Offerta formativa | Università degli Studi di Firenze

Course year

First year - Second Semester

Belonging Department

Information Engineering (DINFO)

Course Type

Single education field course

Scientific Area

ING-INF/05 - INFORMATION PROCESSING SYSTEMS

Credits

6

Teaching Hours

48

Teaching Term

27/02/2023 ⇒ 09/06/2023

Attendance required

No

Type of Evaluation

Final Grade

Course Content

show

Course program

show

Lectureship

MARINAI SIMONE

Mutuality

Course teached as:
B031285 - DATA MINING
Second Cycle Degree in ARTIFICIAL INTELLIGENCE

Teaching Language

Lectures are in Italian, but all the teaching material is in English

Course Content

Datawarehouse, Frequent Itemsets, Dimensionality Reduction, Clustering, Locality sensitive hashing, Text mining, Linguistic pre-processing, Probabilistic and neural language models, Word embeddings, Text categorization, POS tagging, NER

Learning Objectives

The course first aims at introducing the main Data Mining techniques that allow you to model large amounts of data and extract useful information.
Secondly, we consider the problems arising when extracting information and indexing both textual and non-textual documents. To this purpose we introduce the main models and algorithms in Information Retrieval and Natural Language Processing.

Prerequisites

It is essential to know topics typically taught in the Algorithms and Data Structures classes. Some knowledge of Machine Learning can be useful.

Teaching Methods

Classes, homework.

Further information

Oral exams are usually made after completion of the report.

Type of Assessment

Study and presentation of one research paper to the class. Writing of a short report on the studied topic. Oral exam.

Course program

Data Mining
Datawarehouse.
Frequent itemsets: The market-basket model. Association rules. Algorithms for computing frequent item-sets and Association Rules. Hash-based filtering. PCY algorithm, Random sampling, SON algorithm, Apriori with MapReduce. Bloom filters.

Finding similar items. Document similarity, shingling, min-hashing
Locality sensitive hashing (LSH)
Families of hash functions. LSH for cosine distance. LSH for Euclidean distance.

Curse of dimensionality. Dimensionality reduction. Principal Component Analysis (PCA). Singular Value Decomposition (SVD)

Clustering. Distance measures. Hierarchical clustering, k-means clustering. SOM. BFR, DB-SCAN, cluster validity

Text Mining. Information Retrieval. Boolean and Vector Space Model (tf-ifd). Inverted Index.
Linguistic pre-processing: tagging, stop-word removal, lemmatization, stemming. Wildcard queries. N-grams, Edit-distance.

Spelling correction. Performance evaluation in Information Retrieval (Precision, Recall).

Probabilistic language models. Text categorization. Word meaning, vector semantics. Dense embeddings. POS tagging. NE recognition

Lab: Python notebooks for clustering and NLP

B031358 - DATA MINING

Academic Year 2022-23

Teaching Language

Course Content

Suggested readings (Search our library's catalogue)

Learning Objectives

Prerequisites

Teaching Methods

Further information

Type of Assessment

Course program