DOTTORATO IN DATA SCIENCE

Offerta formativa anno accademico 2024/2025

Primo anno

Attività

Corsi di carattere istituzionale avanzato	SI
Attività di tipo seminariale o di laboratorio	SI
Attività connesse con la ricerca	SI
Attività formative e di ricerca autonomamente scelte dal dottorando e approvate dal Collegio dei Docenti	SI

Elenco dei corsi/attività

corso/attività	ore
Indices of Centrality for Complex Networks and their Efficient Computation We introduce the main centrality/role indexes to rank nodes and/or data in large complex networks and then we describe algorithmics methods to efficiently compute them. Tipologia: altro Tipo corso: internazionale Macrosettore: open science Area: scientifica Elevata formazione: SI Verifica finale: SI Lingua: ITALIANO/INGLESE Modalità: riconducibile al progetto formativo del Dottorando CIclo di Seminari	15
Mining Massive Data The course is organized in 3 modules. - Mining Huge Data Sets. One of the key problems in Data Mining is to fastly recover all items that are similar according to effective notions of similarity, such as the Jaccard one. We cover the Locality Sensitive Hashing technique that can be used to break the n^2-time barrier required to solve the problem in the worst case. We introduce the framework of Streaming Algorithms, algorithms for problems in which the input is so huge that it cannot even be stored in the memory and the algorithm can look at each element of the input just once in an online fashion. We study the problems of counting distinct elements, finding the most frequent elements and finding the number of elements in a given queried window that meet a certain criterion. - WEB Search Engine. The Page-Rank Algorithm: Introduction to the key algorithmic ideas of the Page-Rank Algorithm and how it computes an effective Popularity Score the modern WEB search engines applies to rank WEB sites and pages. - The Bitcoin Lightning Network. Introduction to the fully-decentralized system designed to manage the massive data yielded by the micropayments that take place over the BitCoin Networks. Expected Background: Undergraduate Courses in Algorithms & Data Structures and in Probability Tipologia: scuole di formazione dedicate Tipo corso: internazionale Macrosettore: open science Area: scientifica Elevata formazione: SI Verifica finale: SI Lingua: ITALIANO/INGLESE Modalità: riconducibile al progetto formativo del Dottorando Lecturers: Prof. Clementi, Prof. Gualà, Prof. Pasquale ({clementi,guala,pasquale}@mat.uniroma2.it)	12
Web of Data The course introduces the Web of Data, as outlined by the Semantic Web and Linked Open Data in terms of an extension of the Web as a global dataspace for publication, reuse and integration of data. Best practices and (open) standards will be discussed as part of the course, emphasizing machine actionability, a core value of the FAIR paradigm for data custody. Emphasizing the distributed and decentralized nature of the Web of Data, prerequisites for autonomy and independence, we will discuss how to avoid the data silos phenomenon through a distributed and as-needed integration process. In this regard, ontology matching and entity linking techniques will be discussed. The dual role of the Web of Data as a controlled environment for Big Data experimentation and as a source of background knowledge for information extraction and content analytics, in general, will be mentioned. Various examples of big datasets will be shown throughout the course, including general resources, such as DBpedia and Wikdata, and more domain-specific GLAM (Galleries, Libraries, Archives, and Museums) resources. Concrete examples of tabular data lifting and modern standards for their semantic annotation will also be shown. Background: Foundations of Logic, Databases, and Java or Python Languages Tipologia: scuole di formazione dedicate Tipo corso: internazionale Macrosettore: open science Area: scientifica Elevata formazione: SI Verifica finale: SI Lingua: ITALIANO/INGLESE Modalità: riconducibile al progetto formativo del Dottorando Lecturer: dott. Manuel Fiorelli (fiorelli@info.uniroma2.it)	14
Hands on Machine Learning for Physics The course is aimed at deepening the concepts, techniques and tools needed to construct Machine Learning algorithms mainly used in Physics. The target audience are PhD students who want to learn how to program Machine Learning (ML) codes for data analysis of physics problems. During the course, special emphasis will be given to the study of the operation of generative algorithms, such as Variational Auto-Encoders and Generative Adversarial Networks. It starts with a brief theoretical reminder of the problem and then continues with the implementation of a ML algorithm in all its phases, from the construction of the dataset, to the validation of the results. Tipologia: scuole di formazione dedicate Tipo corso: internazionale Macrosettore: open science Area: scientifica Elevata formazione: SI Verifica finale: SI Lingua: ITALIANO/INGLESE Modalità: riconducibile al progetto formativo del Dottorando Lecturers: Michele Buzzicotti (m.buzzicotti@gmail.com)	14

Eventuali maggiori informazioni piano form. 1°a
Media totale ore/anno	20
Totale ore corsi	3
Altre attività didattiche	Lezioni ed Esercitazioni nei Corsi di Laurea Triennale e Magistrale
Modalità di scelta del soggetto della tesi	Autonoma
Modalità delle verifiche per l'ammissione all'anno successivo	Raccolta informazioni, Seminario Finale sulle attività svolte nell'anno e sull'avanzamento verso la Tesi di Dottorato
Note

Secondo anno

Attività

Corsi di carattere istituzionale avanzato	SI
Attività di tipo seminariale o di laboratorio	SI
Attività connesse con la ricerca	SI
Attività formative e di ricerca autonomamente scelte dal dottorando e approvate dal Collegio dei Docenti	SI

Elenco dei corsi/attività

corso/attività	ore
Deep Learning ad Structured Inference – Neural Models and Algorithms for Linguistic Recognition and Inference Modern AI is growingly faced with complex problems, characterized by heterogeneous forms of structured evidence in input and complex decisions. In medicine historical data, biological phenomena or images manifest through streams of structured data, usually digitally represented into sequences, trees or graphs. Machine Learning methods for structured learning have been studied whereas some mathematical paradigms (such as dimensionality reduction, structured kernels or neural embedding) have been proposed as modeling tools. In Natural Language Processing, Machine Translation and other Natural Language Inference (NLI) tasks, such as Question Answering or Textual Entailment, have been approached via kernels or neural models of the input representation. These achieved accurate state-of-the-art classification and prediction capabilities by enabling the exploration of huge spaces of possible solutions (e.g. target sequences or decisions). In this way, they correspond to both enabling technologies and software tools as well as to models of investigation able to systematically select hypotheses and validate controversial theories about linguistic phenomena. The application of these empirical methodologies to other areas like biology, medicine and medical robotics is more than promising, given the similar complexity of the domains targeted by AI and Life Sciences. The course will try to promote this interesting research perspective in Deep Learning to PhD students with a specific focus, but not limited to, Life Science phenomena. Tipologia: altro Tipo corso: internazionale Macrosettore: open science Area: scientifica Elevata formazione: SI Verifica finale: SI Lingua: ITALIANO/INGLESE Modalità: riconducibile al progetto formativo del Dottorando Ciclo di Seminari	12
An introduction to score-based generative models In simple words, generative modeling consists in learning a map capable of generating new data instances that resemble a given set of observations, starting from a simple prior distribution, most often a standard Gaussian distribution. This course aims at providing a mathematical introduction to generative models and in particular to Score-based Generative Models (SGM). SGMs have gained prominence for their ability to generate realistic data across diverse domains, making them a popular tool for researchers and practitioners in machine learning. Participants will learn about the methodological and theoretical foundations, as well as some practical applications associated with these models. The first two lectures motivate the use of generative models, introduce their formalism and present two simple though relevant examples: energy-based models and Generative Adversarial Networks. In the third and fourth lecture we present score-based diffusion models and explain how they provide an algorithmic framework to the basic idea that sampling from the time-reversal of a diffusion process converts noise into new data instances. We shall do so following two different approaches: a first elementary one that only relies on discrete transition probabilities, and a second one based on stochastic calculus. After this introduction, we derive sharp theoretical guarantees of convergence for score-based diffusion models assembling together ideas coming from stochastic control, functional inequalities and regularity theory for HamiltonJacobi-Bellman equations. The course ends with an overview of some of the most recent and sophisticated algorithms such as flow matching and diffusion Sch¨odinger bridges (DSB), which bring an (entropic) optimal transport insight into generative modeling. Tipologia: scuole di formazione dedicate Tipo corso: internazionale Macrosettore: open science Area: umanistica Elevata formazione: SI Verifica finale: SI Lingua: INGLESE Modalità: riconducibile al progetto formativo del Dottorando Lecturers: Proff. Giovanni Conforti (Università di Padova) and Alain Durmus (École Polytechnique, Parigi)	12
Quantile regression The main techniques of quantile regression, an alternative to classical linear regression, will be introduced. As an example, consider a regression model in which we estimate the association between Equivalised Disposable Income of a sample of households and various predictors, including an exogenous treatment. Using quantile regression, it is possible to estimate the effect of treatment on the entire distribution of households, resulting in a potentially different estimated effect at each quantile. Indeed, the treatment could be positive for the income of rich households (high quantiles) and negative for the income of poor households (low quantiles). Similarly, the association of predictors with median income can be evaluated, avoiding the need to assume that the response is Gaussian (symmetric, homoschedastic) and that there are no outliers. If time permits, principles of robust statistics will also be discussed, including linear regression techniques and robust prediction. Background: Use of software R. Undergraduate Courses in Statistical Inference and Linear Models Tipologia: scuole di formazione dedicate Tipo corso: internazionale Macrosettore: open science Area: scientifica Elevata formazione: SI Verifica finale: SI Lingua: ITALIANO/INGLESE Modalità: riconducibile al progetto formativo del Dottorando Prof. Alessio Farcomeni (alessio.farcomeni@uniroma2.it)	0
Simulation-based Predictive Process Mining The course introduces the essential elements of process mining (PM) and simulation. These approaches are initially proposed as tools for analyzing processes from different perspectives, to achieve different objectives. While PM aims to extract knowledge by analyzing a log that records data on past process executions, simulation provides predictions on future or alternative behaviors of the same process. Then, an innovative point of view is proposed in which PM and simulation are seen as complementary tools whose joint adoption leads to an effective analysis paradigm. The first part of the course introduces basic concepts on simulation: simulation modeling, discrete event simulation, local and distributed simulation. The implementation of a Java-based discrete event simulator is also discussed. In the second part, principles, methods, and tools for PM are provided. Finally, the course introduces “Predictive Process Mining” as an innovative paradigm based on the joint use of the two approaches. It is outlined how the knowledge extracted from the log analysis through PM techniques can be used to guide the development of a simulation model, whose execution provides further insights into the system under study. In this context, the most relevant research challenges, opportunities and open issues are illustrated. Background: Basic skills in software development and knowledge of at least one object-oriented programming language (Java recommended). Tipologia: scuole di formazione dedicate Tipo corso: internazionale Macrosettore: open science Area: scientifica Elevata formazione: SI Verifica finale: SI Lingua: ITALIANO/INGLESE Modalità: riconducibile al progetto formativo del Dottorando Lecturer: dott. Paolo Bocciarelli (paolo.bocciarelli@uniroma2.it)	12

Eventuali maggiori informazioni piano form. 2°	Gli studenti hanno un'ampia scelta nei primi due anni di Dottorato tra Corsi di natura differenziata legati ai paradigmi matematico-algoritmici, ai metodi ed alle tecnologie di riutilizzo dei dati in diversi ambiti sperimentali, modellistici e applicativi (industriali). Ad essi è richiesta la partecipazione ad almeno due Corsi.
Media totale ore/anno	12
Totale ore corsi	12
Altre attività didattiche	Sperimentazione di laboratorio, didattica frontale in supporto ai Corsi di Magistrale
Modalità di preparazione della tesi	In collaborazione con centri di ricerca locali o esteri
Modalità delle verifiche per l'ammissione all'anno successivo	Seminari di ricerca, relazione sintetica e Piano della Ricerca per la Tesi di Dottorato
Note

Terzo anno

Attività

Corsi di carattere istituzionale avanzato	NO
Attività di tipo seminariale o di laboratorio	SI
Attività connesse con la ricerca	SI
Attività formative e di ricerca autonomamente scelte dal dottorando e approvate dal Collegio dei Docenti	NO

Elenco dei corsi/attività

corso/attività	ore
Preparazione della Tesi di Dottorato Seminari sedute di sperimentazione Tipologia: workshop Tipo corso: internazionale Macrosettore: open science Area: scientifica Elevata formazione: SI Verifica finale: NO Lingua: ITALIANO/INGLESE Modalità: riconducibile al progetto formativo del Dottorando	0

Eventuali maggiori informazioni piano form. 3°	Il piano di Ricerca prodotto al Secondo anno, viene confermato alla fine dello stesso e monitorizzato dal COllegio durante il primo semestre del Terzo anno.
Media totale ore/anno	25
Totale ore corsi	25
Altre attività didattiche	Seminari, Supporto alla Didattica dei corsi di Magistrale
Modalità di ammissione all'esame finale	Seminari di ricerca, Aggiornamenti sul progresso verso la Tesi Finale di Dottorato
Modalità di svolgimento dell'esame finale	Valutazione di revisori esterni, Valutazione Commissione interna, Difesa della Tesi di fronte ad un collegio di tre docenti
Note

Università degli Studi di Roma "Tor Vergata" - Via Cracovia, 50, 00133 Roma RM