Uniwersytet Warszawski, Wydział Nauk Ekonomicznych - Centralny System Uwierzytelniania

Topic modelling

Informacje ogólne

Kod przedmiotu:	2400-ZEWW878
Kod Erasmus / ISCED:	14.3 Kod klasyfikacyjny przedmiotu składa się z trzech do pięciu cyfr, przy czym trzy pierwsze oznaczają klasyfikację dziedziny wg. Listy kodów dziedzin obowiązującej w programie Socrates/Erasmus, czwarta (dotąd na ogół 0) – ewentualne uszczegółowienie informacji o dyscyplinie, piąta – stopień zaawansowania przedmiotu ustalony na podstawie roku studiów, dla którego przedmiot jest przeznaczony. / (0311) Ekonomia Kod ISCED - Międzynarodowa Standardowa Klasyfikacja Kształcenia (International Standard Classification of Education) została opracowana przez UNESCO.
Nazwa przedmiotu:	Topic modelling
Jednostka:	Wydział Nauk Ekonomicznych
Grupy:	Anglojęzyczna oferta zajęć WNE UW Przedmioty kierunkowe dla Data Science Przedmioty kierunkowe do wyboru - studia II stopnia IE - grupa 1 (6*30h) Przedmioty wyboru kierunkowego dla studiów licencjackich IE Przedmioty wyboru kierunkowego dla studiów licencjackich MSEM
Punkty ECTS i inne:	3.00 Podstawowe informacje o zasadach przyporządkowania punktów ECTS: roczny wymiar godzinowy nakładu pracy studenta konieczny do osiągnięcia zakładanych efektów uczenia się dla danego etapu studiów wynosi 1500-1800 h, co odpowiada 60 ECTS; tygodniowy wymiar godzinowy nakładu pracy studenta wynosi 45 h; 1 punkt ECTS odpowiada 25-30 godzinom pracy studenta potrzebnej do osiągnięcia zakładanych efektów uczenia się; tygodniowy nakład pracy studenta konieczny do osiągnięcia zakładanych efektów uczenia się pozwala uzyskać 1,5 ECTS; nakład pracy potrzebny do zaliczenia przedmiotu, któremu przypisano 3 ECTS, stanowi 10% semestralnego obciążenia studenta. zobacz reguły punktacji
Język prowadzenia:	angielski
Rodzaj przedmiotu:	nieobowiązkowe
Skrócony opis:	The course aims providing a comprehensive review of topic modelling i.e. a set of machine learning methods for grouping texts. Firstly, it involves discussing introductory steps that need to be performed before the modelling phase i.e. collecting textual data and processing it. Next, basic topic modelling algorithms are planned to be discussed. In particular, most popular algorithm – Latent Dirichlet Allocation (LDA) will be presented. Also, more specific and advanced topic modelling algorithms are planned to be introduced. Furthermore, most common problems with performing a topic modelling analysis shall be discussed. At the end of the course, case studies will be discussed.
Pełny opis:	1. Big picture of topic modelling (labs 1). a. What is topic modelling? b. What is the procedure for obtaining topics and drawing conclusions? c. Example practical applications of topic modelling. 2. Collecting textual data for topic modelling (labs 2). a. Review of web scraping techniques. b. Crawling. c. Most common technical issues. d. Ethics and possible legal problems. e. Review of Python libraries: Selenium and Beautiful Soup with example codes. 3. Textual data processing (labs 3). a. Tokenization. b. Stemming. c. Lemmatization. d. Stopwords. e. N-grams. f. Total Frequence (TF). g. Inverse Document Frequency (IDF). h. TF-IDF. 4.Basic topic modelling algorithms (labs 4-6). a. Latent Semantic Analysis (LSA). b. Non-Negative Metrix Factorization (NNMF). c. Probabilistic Latent Semantic Analysis (PLSA). 5. Latent Dirichlet Allocation (LDA) (labs 7). a. LDA algorithm. b. Mean Field Variational Method. c. Gibbs sampling. 6. LDA-based topic models (labs 8-9). a. Supervised topic models. b. Correlated Topics Model (CTM). c. Pachinko Allocation Topic Model (PAM). d. Hierarchical Topic Model. e. Spherical topic models. f. Author Topic Model. g. Multilingual Topic Model. h. Dynamic Topic Model. i. Syntactic Topic Model. 7. Measuring model’s performance (labs 10). a. Perplexity. b. Topic coherence measures. 8. Challenges of topic modelling (labs 11). a. Visualisation issues. b. Interpretation issues. c. Memory efficiency. d. Stability of topics. 9. Case studies (labs 12-13). 10. Students’ presentations (labs 14-15).
Literatura:	Compulsory: Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of machine learning research, 3(Jan), 993-1022. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50-57). Kherwa, P., & Bansal, P. (2020). Topic modelling: a comprehensive review. EAI Endorsed transactions on scalable information systems, 7(24). Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse processes, 25(2-3), 259-284. Ramos, J. (2003). Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning (Vol. 242, No. 1, pp. 29-48). Roder, M., Both, A., & Hinnenburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eight ACM international conference on Web search and data mining (pp. 399-408). Stevens, K., Kegelmeyer, P., Andrzejewski, D., & Buttler, D. (2012). Exploring topic coherence over many models and many topics. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning (pp. 952-961). Additional: Aletras, N., & Stevensson, M. (2013). Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Long Papers (pp. 13-22). Bahl, L. R., Jelinek, F., & Mercer, R. L. (1983). A maximum likelihood approach to continuous speech recognition. IEE transactions on pattern analysis and machine intelligence, (2), 179-190. Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The annals of applied statistics, 1(1), 17-35. Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J., & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. Advances in neural information processing systems, 22. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391-407. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory and acquisition induction, and representation of knowledge. Psychological review, 104(2), 211. Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing (pp. 262-272). Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100- 108).
Efekty uczenia się:	Students will be able to collect textual data and process them with R or Python tools with respect to practices that are already well-established in the literature. They will know theoretical background of different topic modelling algorithms and will be able to prepare texts grouping using different approaches. Furthermore, they will know how to measure model’s performance. Moreover, they will be aware of topic modelling challenges and most commonly encountered problems.
Metody i kryteria oceniania:	Final grade is established based on points obtained for preparing a home-taken project (80%) and its presentation (20%).

Zajęcia w cyklu "Semestr zimowy 2023/24" (zakończony)

Okres:	2023-10-01 - 2024-01-28	Wybrany podział planu: tygodniowy cykl przedmiotu Przejdź do planu PN WT ŚR CZ KON PT
Typ zajęć:	Konwersatorium, 30 godzin więcej informacji
Koordynatorzy:	Maciej Świtała, Piotr Wójcik
Prowadzący grup:	Maciej Świtała
Lista studentów:	(nie masz dostępu)
Zaliczenie:	Przedmiot - Zaliczenie na ocenę Konwersatorium - Zaliczenie na ocenę

Opisy przedmiotów w USOS i USOSweb są chronione prawem autorskim.
Właścicielem praw autorskich jest Uniwersytet Warszawski, Wydział Nauk Ekonomicznych.