(I) C.D. Manning, P. Raghavan, P. Raghavan Introduction to Information Retrieval, Cambridge University Press - 2008
(I) A. Rajaraman, J. D. Ullman, Mining of Massive Datasets, 2011
(L) I. Witten, A. Moffat, T.C. Bell Managing Gigabytes, Van Nostrand Reinhold – 1999
(L) D. Doermann, K. Tombre (Eds.) Handbook of Document Image Processing and Recognition, 2014 (L)
Note:
(L) : Book available in the Engineering library
(I) : Book available in Internet (authors' version)
Learning Objectives
The course first aims at introducing the main Data Mining techniques that allow you to model large amounts of data and extract useful information.
Secondly, we consider the problems arising when extracting information and indexing both textual and non-textual documents. To this purpose we introduce the main models and algorithms in Information Retrieval and describe the techniques for information extraction from digital born and digitized documents that are represented in the form of images.
Prerequisites
It is essential to know topics typically though in the Data Bases and Algorithms and Data Structures classes. Some knowledge of Artificial Intelligence can be useful.
Teaching Methods
Classes, homework and project.
Further information
Oral exams are usually made after completion of the assigned project.
Type of Assessment
Study and presentation of one research paper to the class (15%). Group project (2 people, 65%). Oral on a sub-set of the topics (20%).
Alternatively it is possible to have an oral on all the topics and a smaller project.
Course program
Basic concepts of Secondary Storage
Large scale file systems
Map-reduce, algorithms using Map-reduce
Information Retrieval
Document Engineering
Document Image Analysis and Recognition
Data Mining
Finding Similar Items, Frequent itemsets, Clustering, High-dimensional spaces and dimensionality reduction, Web mining, Datawarehouse
Homework & project