Course teached as: B031285 - DATA MINING Second Cycle Degree in ARTIFICIAL INTELLIGENCE
Teaching Language
Lectures are in Italian, but all the teaching material is in English
Course Content
Datawarehouse, Frequent Itemsets, Dimensionality Reduction, Clustering, Locality sensitive hashing, Text mining, Linguistic pre-processing, Probabilistic and neural language models, Word embeddings, Text categorization, POS tagging, NER
Main textbooks:
A. Rajaraman, J. D. Ullman, Mining of Massive Datasets, Cambridge University Press , 2011
P.-N. Tan, M. Steinbach, A. Karpatne, V. Kumar Introduction to Data Mining, Pearson - 2019
D. Jurafsky, J. H. Martin, Speech and Language Processing, 2020
D. Sarkar, Text Analytics with Python, Apress, 2019
Additional books:
Ian Witten, Text Mining - 2004
C.D. Manning, P. Raghavan, P. Raghavan, Introduction to Information Retrieval, Cambridge University Press – 2008
Details on books availability in the moodle page of the course
Learning Objectives
The course first aims at introducing the main Data Mining techniques that allow you to model large amounts of data and extract useful information.
Secondly, we consider the problems arising when extracting information and indexing both textual and non-textual documents. To this purpose we introduce the main models and algorithms in Information Retrieval and Natural Language Processing.
Prerequisites
It is essential to know topics typically taught in the Algorithms and Data Structures classes. Some knowledge of Machine Learning can be useful.
Teaching Methods
Classes, homework.
Further information
Oral exams are usually made after completion of the report.
Type of Assessment
Study and presentation of one research paper to the class. Writing of a short report on the studied topic. Oral exam.
Course program
Data Mining
Datawarehouse.
Frequent itemsets: The market-basket model. Association rules. Algorithms for computing frequent item-sets and Association Rules. Hash-based filtering. PCY algorithm, Random sampling, SON algorithm, Apriori with MapReduce. Bloom filters.
Finding similar items. Document similarity, shingling, min-hashing
Locality sensitive hashing (LSH)
Families of hash functions. LSH for cosine distance. LSH for Euclidean distance.
Curse of dimensionality. Dimensionality reduction. Principal Component Analysis (PCA). Singular Value Decomposition (SVD)
Text Mining. Information Retrieval. Boolean and Vector Space Model (tf-ifd). Inverted Index.
Linguistic pre-processing: tagging, stop-word removal, lemmatization, stemming. Wildcard queries. N-grams, Edit-distance.
Spelling correction. Performance evaluation in Information Retrieval (Precision, Recall).
Probabilistic language models. Text categorization. Word meaning, vector semantics. Dense embeddings. POS tagging. NE recognition