Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation - IRD - Institut de recherche pour le développement Accéder directement au contenu
Communication Dans Un Congrès Année : 2019

Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation

Résumé

In many applications, data mining and machine learning methods are extensively used to analyze Web data and dis- cover actionable knowledge. But, “dirty data” is a chronic plague that causes incorrect results, misleading conclusions, generally followed by inadequate decisions. To ensure the validity of output results, avoid bias or data snooping, it is necessary to control not only the whole Web data analytics pipeline, but most importantly the quality of Web data with appropriate data preparation and curation choices. For a given dataset and a given machine leaning model, a plethora of data preprocessing techniques and alternative data cleaning strategies may lead to dramatically different outputs with unequal quality performance. It is then crucial to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data.
Fichier non déposé

Dates et versions

ird-02092548 , version 1 (08-04-2019)

Identifiants

Citer

L. Berti-Equille. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. The Web Conf 2019, ACM, May 2019, San Francisco, United States. ⟨10.1145/3308558.3313602⟩. ⟨ird-02092548⟩
205 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More