Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation - Archive ouverte HAL Access content directly
Conference Papers Year :

Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation

Abstract

In many applications, data mining and machine learning methods are extensively used to analyze Web data and dis- cover actionable knowledge. But, “dirty data” is a chronic plague that causes incorrect results, misleading conclusions, generally followed by inadequate decisions. To ensure the validity of output results, avoid bias or data snooping, it is necessary to control not only the whole Web data analytics pipeline, but most importantly the quality of Web data with appropriate data preparation and curation choices. For a given dataset and a given machine leaning model, a plethora of data preprocessing techniques and alternative data cleaning strategies may lead to dramatically different outputs with unequal quality performance. It is then crucial to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data.
Not file

Dates and versions

ird-02092548 , version 1 (08-04-2019)

Identifiers

Cite

L. Berti-Equille. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. The Web Conf 2019, ACM, May 2019, San Francisco, United States. ⟨10.1145/3308558.3313602⟩. ⟨ird-02092548⟩
187 View
0 Download

Altmetric

Share

Gmail Facebook Twitter LinkedIn More