Automatic Extraction of TEI Structures in Digitized Lexical Resources using Conditional Random Fields

An important number of digitized lexical resources remain unexploited due to their unstructured content. Manually structuring such resources is a costly task given their multifold complexity. Our goal is to find an approach to automatically structure digitized dictionaries, independently from the language or the lexicographic school or style. In this paper we present a first version of GROBID-Dictionaries1, an open source machine learning system for lexical information extraction. Our approach is twofold: we perform a cascading structure extraction, while we select at each level specific features for training. We followed a ”divide to conquer” strategy to dismantle text constructs in a digitized dictionary, based on the observation of their layout. Main pages (see Figure 1) in almost any dictionary share three blocks: a header (green), a footer (blue) and a body (orange). The body is, in its turn, constituted by several entries (red). Each lexical entry can be further decomposed (see Figure 2) as: form (green), etymology (blue), sense (red) or/and related entry. The same logic could be applied further for each extracted block but in the scope of this paper we focus just on the first three levels. The cascading approach ensures a better understanding of the learning process’s output and consequently simplifies the feature selection process. Limited exclusive text blocks per level helps significantly in diagnosing the cause of prediction errors. It allows an early detection and replacement of irrelevant selected features that can bias a trained model. In such a segmentation, it becomes more straightforward to notice that, for instance, the token position in the page is very relevant to detect headers and footers and has almost no pertinence for capturing a sense in a lexical entry which is very often split on two pages. To implement our approach, we took up the available infrastructure from GROBID [7], a machine learning system for the extraction of bibliographic metadata. GROBID adopts the same cascading approach and uses Conditional Random Fields (CRF) [6] to label text sequences. The output of Grobid dictionary is planned to generate a TEI compliant encoding [2, 9] where the various segmentation levels are associated with an appropriate XML tessellation. Collaboration with COST ENeL are ongoing to ensure maximal compatibility with existing dictionary projects. Our experiments justify so far our choices, where models for the first two levels trained on two different dictionary samples have given a high precision and recall with a small amount of annotated data. Relying mainly on the text layout, we tried to diversify the selected features for each model, on the token and line levels. We are working on tuning features and annotating more data to maintain the good results with new samples and to improve the third segmentation level. While just few task specific attempts [1] have been using machine learning in this research direction, the landscape remains dominated by rule based techniquess [4, 3, 8] which are ad-hoc and costly, even impossible, to adapt for new lexical resources.

Mots clés

machine learning CRF TEI automatic structuring digitized dictionaries

Domaines

Traitement du texte et du document Intelligence artificielle [cs.AI] Linguistique Machine Learning [stat.ML] Interface homme-machine [cs.HC]

Fichier principal

elex2017_Abstract.pdf (682.01 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Mohamed Khemakhem : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01508868

Soumis le : jeudi 17 août 2017-17:09:48

Dernière modification le : mardi 3 octobre 2023-17:18:04

Dates et versions

hal-01508868 , version 1 (17-08-2017)

hal-01508868 , version 2 (29-08-2017)

Licence

Paternité

Identifiants

HAL Id : hal-01508868 , version 1

Citer

Mohamed Khemakhem, Luca Foppiano, Laurent Romary. Automatic Extraction of TEI Structures in Digitized Lexical Resources using Conditional Random Fields. electronic lexicography, eLex 2017, Sep 2017, Leiden, Netherlands. ⟨hal-01508868v1⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

1527 Consultations

560 Téléchargements