Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation

Abstract : In many applications, data mining and machine learning methods are extensively used to analyze Web data and dis- cover actionable knowledge. But, “dirty data” is a chronic plague that causes incorrect results, misleading conclusions, generally followed by inadequate decisions. To ensure the validity of output results, avoid bias or data snooping, it is necessary to control not only the whole Web data analytics pipeline, but most importantly the quality of Web data with appropriate data preparation and curation choices. For a given dataset and a given machine leaning model, a plethora of data preprocessing techniques and alternative data cleaning strategies may lead to dramatically different outputs with unequal quality performance. It is then crucial to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data.
Complete list of metadatas

https://hal.ird.fr/ird-02092548
Contributor : Laure Berti-Equille <>
Submitted on : Monday, April 8, 2019 - 11:20:32 AM
Last modification on : Friday, May 17, 2019 - 1:19:58 AM

Identifiers

Citation

L. Berti-Equille. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. The Web Conf 2019, ACM, May 2019, San Francisco, United States. ⟨10.1145/3308558.3313602⟩. ⟨ird-02092548⟩

Share

Metrics

Record views

47