Invited speaker: Gaël Varoquaux - Symposium on Intelligent Data Analysis (IDA 2023)

Representation learning on relational data to automate data preparation

Bio: Gaël Varoquaux is a research director working on data science at Inria (French Computer Science National research) where he leads the Soda team on computational and statistical methods to understand health and society with data. Varoquaux is an expert in machine learning, with an eye on applications in health and social science. He develops tools to make machine learning easier, suited for real-life, messy data. He co-funded scikit-learn, one of the reference machine-learning toolboxes, and helped build various central tools for data analysis in Python. He currently develops data-intensive approaches for epidemiology and public health, and worked for 10 years on machine learning for brain function and mental health. Varoquaux has a PhD in quantum physics supervised by Alain Aspect and is a graduate from Ecole Normale Superieure, Paris.

Abstract: In standard data-science practice, a significant effort is spent on preparing the data before statistical learning. One reason is that the data come from various tables, each with its own subject matter, its specificities. This is unlike natural images, or even natural text, where universal regularities have enabled representation learning, fueling the deep learning revolution. I will present progress on learning representations with data tables, overcoming the lack of simple regularities. I will show how these representations decrease the need for data preparation: matching entities, aggregating the data across tables. Character-level modeling enable statistical learning without normalized entities, as in the dirty-cat library https://dirty-cat.github.io. Representation learning across many tables, describing objects of different nature and varying attributes, can aggregate the distributed information, forming vector representation of entities. As a result, we created general purpose embeddings that enrich many data analyses by summarizing all the numerical and relational information in wikipedia for millions of entities: cities, people, compagnies, books: https://soda-inria.github.io/ken_embeddings/