Topic modelling
Informacje ogólne
Kod przedmiotu: | 2400-ZEWW878 |
Kod Erasmus / ISCED: |
14.3
|
Nazwa przedmiotu: | Topic modelling |
Jednostka: | Wydział Nauk Ekonomicznych |
Grupy: |
Anglojęzyczna oferta zajęć WNE UW Przedmioty kierunkowe dla Data Science Przedmioty kierunkowe do wyboru - studia II stopnia IE - grupa 1 (6*30h) Przedmioty wyboru kierunkowego dla studiów licencjackich IE Przedmioty wyboru kierunkowego dla studiów licencjackich MSEM |
Punkty ECTS i inne: |
3.00
|
Język prowadzenia: | angielski |
Rodzaj przedmiotu: | nieobowiązkowe |
Skrócony opis: |
The course aims providing a comprehensive review of topic modelling i.e. a set of machine learning methods for grouping texts. Firstly, it involves discussing introductory steps that need to be performed before the modelling phase i.e. collecting textual data and processing it. Next, basic topic modelling algorithms are planned to be discussed. In particular, most popular algorithm – Latent Dirichlet Allocation (LDA) will be presented. Also, more specific and advanced topic modelling algorithms are planned to be introduced. Furthermore, most common problems with performing a topic modelling analysis shall be discussed. At the end of the course, case studies will be discussed. |
Pełny opis: |
1. Big picture of topic modelling (labs 1). a. What is topic modelling? b. What is the procedure for obtaining topics and drawing conclusions? c. Example practical applications of topic modelling. 2. Collecting textual data for topic modelling (labs 2). a. Review of web scraping techniques. b. Crawling. c. Most common technical issues. d. Ethics and possible legal problems. e. Review of Python libraries: Selenium and Beautiful Soup with example codes. 3. Textual data processing (labs 3). a. Tokenization. b. Stemming. c. Lemmatization. d. Stopwords. e. N-grams. f. Total Frequence (TF). g. Inverse Document Frequency (IDF). h. TF-IDF. 4.Basic topic modelling algorithms (labs 4-6). a. Latent Semantic Analysis (LSA). b. Non-Negative Metrix Factorization (NNMF). c. Probabilistic Latent Semantic Analysis (PLSA). 5. Latent Dirichlet Allocation (LDA) (labs 7). a. LDA algorithm. b. Mean Field Variational Method. c. Gibbs sampling. 6. LDA-based topic models (labs 8-9). a. Supervised topic models. b. Correlated Topics Model (CTM). c. Pachinko Allocation Topic Model (PAM). d. Hierarchical Topic Model. e. Spherical topic models. f. Author Topic Model. g. Multilingual Topic Model. h. Dynamic Topic Model. i. Syntactic Topic Model. 7. Measuring model’s performance (labs 10). a. Perplexity. b. Topic coherence measures. 8. Challenges of topic modelling (labs 11). a. Visualisation issues. b. Interpretation issues. c. Memory efficiency. d. Stability of topics. 9. Case studies (labs 12-13). 10. Students’ presentations (labs 14-15). |
Literatura: |
Compulsory: Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of machine learning research, 3(Jan), 993-1022. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50-57). Kherwa, P., & Bansal, P. (2020). Topic modelling: a comprehensive review. EAI Endorsed transactions on scalable information systems, 7(24). Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse processes, 25(2-3), 259-284. Ramos, J. (2003). Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning (Vol. 242, No. 1, pp. 29-48). Roder, M., Both, A., & Hinnenburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eight ACM international conference on Web search and data mining (pp. 399-408). Stevens, K., Kegelmeyer, P., Andrzejewski, D., & Buttler, D. (2012). Exploring topic coherence over many models and many topics. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning (pp. 952-961). Additional: Aletras, N., & Stevensson, M. (2013). Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Long Papers (pp. 13-22). Bahl, L. R., Jelinek, F., & Mercer, R. L. (1983). A maximum likelihood approach to continuous speech recognition. IEE transactions on pattern analysis and machine intelligence, (2), 179-190. Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The annals of applied statistics, 1(1), 17-35. Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J., & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. Advances in neural information processing systems, 22. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391-407. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory and acquisition induction, and representation of knowledge. Psychological review, 104(2), 211. Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing (pp. 262-272). Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100- 108). |
Efekty uczenia się: |
Students will be able to collect textual data and process them with R or Python tools with respect to practices that are already well-established in the literature. They will know theoretical background of different topic modelling algorithms and will be able to prepare texts grouping using different approaches. Furthermore, they will know how to measure model’s performance. Moreover, they will be aware of topic modelling challenges and most commonly encountered problems. |
Metody i kryteria oceniania: |
Final grade is established based on points obtained for preparing a home-taken project (80%) and its presentation (20%). |
Zajęcia w cyklu "Semestr zimowy 2023/24" (zakończony)
Okres: | 2023-10-01 - 2024-01-28 |
Przejdź do planu
PN WT ŚR CZ KON
PT |
Typ zajęć: |
Konwersatorium, 30 godzin
|
|
Koordynatorzy: | Maciej Świtała, Piotr Wójcik | |
Prowadzący grup: | Maciej Świtała | |
Lista studentów: | (nie masz dostępu) | |
Zaliczenie: |
Przedmiot -
Zaliczenie na ocenę
Konwersatorium - Zaliczenie na ocenę |
Właścicielem praw autorskich jest Uniwersytet Warszawski, Wydział Nauk Ekonomicznych.