Rute Costa, Ana Salgado, Margarida Ramos, Sara Carvalho, Fahad Khan, Toma Tasovac, Bruno Almeida, Mohamed Khemakhem, Laurent Romary and Raquel Silva. 2023. A crossroad between lexicography and terminology work: Knowledge organization and domain labelling. Digital Scholarship in the Humanities38, pages i17–i29. Oxford University Press.
Abstract MORDigital project aims to encode the selected editions of Diccionario de Lingua Portugueza by António de Morais Silva, first published in 1789. Our ultimate goals are, on the one hand, to promote accessibility to cultural heritage while fostering reusability and, on the other hand, to contribute towards a more significant presence of lexicographic digital content in Portuguese through open tools and standards. The Morais dictionary represents a significant legacy, since it marks the beginning of Portuguese dictionaries, having served as a model for all subsequent lexicographic production. The team follows a new paradigm in lexicography, which results from the convergence between lexicography, terminology, computational linguistics, and ontologies as an integral part of digital humanities and linked (open) data. In the Portuguese context, this research fills a gap concerning searchable online retrodigitized dictionaries, built on current standards and methodologies which promote data sharing and harmonization, namely TEI Lex-0. The team will further ensure the connection to other existing systems and lexical resources, particularly in the Portuguese-speaking world.
Simon Gabay, Philippe Gambette, Rachel Bawden and Benoît Sagot. 2023. Ancien ou moderne ? Pistes computationnelles pour l'analyse graphématique des textes écrits au XVIIe siècle. Linx85. Presses Universitaires de Paris Nanterre.
The use of contemporary spelling rather than old graphic systems in the vast majority of current editions of 17th century French texts has the unfortunate effect of masking their graphematic richness. Such valuable information has remained concealed and therefore under-exploited, despite the potential it holds in terms of analysis. By favouring a practical corpus-based approach, rather than a theoretical one, and by relying on a recategorisation of the various competing systems at that time in French scriptae, we propose the foundations of a scriptometric study of the classical language, focusing on the analysis of specific documents, both manuscripts and old prints.
Thibault Clérice, Malamatenia Vlachou-Efstathiou and Alix Chagué. 2023. CREMMA Medii Aevi: Literary manuscript text recognition in Latin. Journal of Open Humanities Data9, pages 1–19. Ubiquity Press.
This paper present a novel segmentation and handwritten text recognition dataset for Medieval Latin, from the 11 th to the 16 th century. It connects with Medieval French dataset as well as ealier Latin dataset, by enforcing common guidelines. We provide our own addition to Ariane Pinche's Old French guidelines to deal with specific Latin case. We also offer an overview of how we addressed this dataset compilation through the use of pre-existing resources. With a higher abbreviation ratio and a better representation of abbreviating marks, we offer new models that outperform the base Old French model on Latin dataset, reaching readability levels on unknown manuscripts.
Conference proceedings
Ali Elkahky, Wei-Ning Hsu, Paden Tomasello, Tu Anh Nguyen, Robin Algayres, Yossi Adi, Jade Copet, Emmanuel Dupoux and Abdelrahman Mohamed. 2023. Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training? In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023) IEEE. Ixia-Ialyssos, Greece.
The research community has produced many successful selfsupervised speech representation learning methods over the past few years. Discrete units have been utilized in various self-supervised learning frameworks, such as VQ-VAE [1], wav2vec 2.0 [2], HuBERT [3], and Wav2Seq [4]. This paper studies the impact of altering the granularity and improving the quality of these discrete acoustic units for pre-training encoder-only and encoder-decoder models. We systematically study the current proposals of using Byte-Pair Encoding (BPE) and new extensions that use cluster smoothing and Brown clustering. The quality of learned units is studied intrinsically using zero speech metrics and on the downstream speech recognition (ASR) task. Our results suggest that longer-range units are helpful for encoder-decoder pre-training; however, encoder-only masked-prediction models cannot yet benefit from self-supervised word-like targets.
Maud Bénard, Alexandra Mestivier, Natalie Kubler, Lichao Zhu, Rachel Bawden, Eric De La Clergerie, Laurent Romary, Mathilde Huguin, Jean-François Nominé, Ziqian Peng and François Yvon. 2023. MaTOS: Traduction automatique pour la science ouverte. In Actes de la 18e Conférence en Recherche d'Information et Applications–16e Rencontres Jeunes Chercheurs en RI–30e Conférence sur le Traitement Automatique des Langues Naturelles–25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues, pages 8–15. ATALA. Paris, France.
Cette contribution présente le projet MaTOS (Machine Translation for Open Science), qui vise à développer de nouvelles méthodes pour la traduction automatique (TA) intégrale de documents scientifiques entre le français et l'anglais, ainsi que des métriques automatiques pour évaluer la qualité des traductions produites. Pour ce faire, MaTOS s'intéresse (a) au recueil de ressources ouvertes pour la TA spécialisée; (b) à la description des marqueurs de cohérence textuelle pour les articles scientifiques; (c) au développement de nouvelles méthodes de traitement multilingue pour les documents; (d) aux métriques mesurant les progrès de la traduction de documents complets.
Simon Meoni, Rian Touchent and Eric De La Clergerie. 2023. Passe ta pharma d'abord ! In 18e Conférence en Recherche d'Information et Applications–16e Rencontres Jeunes Chercheurs en RI–30e Conférence sur le Traitement Automatique des Langues Naturelles–25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues pages 68–76. ATALA. Paris, France.
Nous présentons les 3 expériences menées par l'équipe ALMAnaCH - Arkhn et leurs résultats pour le DÉfi Fouille de Textes (DEFT) 2023. Les scores sont encourageants mais suggèrent surtout de nouveaux éléments à prendre en compte pour réussir ce défi. Nous avons exploré différentes approches avec des modèles de tailles variables et modélisé la tâche de différentes manières (classification multi-labels, implication textuelle, séquence à séquence). Nous n'avons pas observé des gains de performance significatifs. Nos expériences semblent montrer la nécessité de l'utilisation de bases de connaissances externes pour obtenir de bons résultats sur ce type de tâche.
Lionel Tadonfouet Tadjou, Eric De La Clergerie, Fabrice Bourge and Tiphaine Marie. 2023. Constitution de sous-fils de conversations d'emails. In 18e Conférence en Recherche d'Information et Applications–16e Rencontres Jeunes Chercheurs en RI–30e Conférence sur le Traitement Automatique des Langues Naturelles–25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues, pages 157–171. ATALA. Paris, France.
Email conversations in the workplace are sometimes difficult to follow by collaborators because they can deal with multiple topics and involve many interlocutors. To improve understanding of key messages, it’s helpful to create subthreads within the conversation. In our study, we propose a two-stage pipeline to recognize dialogue acts in email text segments and link them to improveinformation accessibility. This pipeline creates pairs of text segments across the conversation, making it easier to understand the key messages. To our knowledge, this is the first time this issue of creating conversation threads has been addressed in email conversations. We annotated the BC3 corpus of emails with dialogue acts and linked conversation email text segments.
Lydia Nishimwe. 2023. Normalisation lexicale de contenus générés par les utilisateurs sur les réseaux sociaux. In Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 : 25e Rencontres Etudiants Chercheurs en Informatique pour le TAL (RECITAL), pages 160–183. ATALA. Paris, France.
L'essor du traitement automatique des langues (TAL) se vit dans un monde où l'on produit de plus en plus de contenus en ligne. En particulier sur les réseaux sociaux, les textes publiés par les internautes sont remplis de phénomènes « non standards » tels que les fautes d'orthographe, l'argot, les marques d'expressivité, etc. Ainsi, les modèles de TAL, en grande partie entraînés sur des données « standards », voient leur performance diminuer lorsqu'ils sont appliqués aux contenus générés par les utilisateurs (CGU). L'une des approches pour atténuer cette dégradation est la normalisation lexicale : les mots non standards sont remplacés par leurs formes standards. Dans cet article, nous réalisons un état de l'art de la normalisation lexicale des CGU, ainsi qu'une étude expérimentale préliminaire pour montrer les avantages et les difficultés de cette tâche.
Simon Meoni, Théo Ryffel and Eric De La Clergerie. 2023. Annotation d'entités cliniques en utilisant les Larges Modèles de Langue. In 18e Conférence en Recherche d'Information et Applications–16e Rencontres Jeunes Chercheurs en RI–30e Conférence sur le Traitement Automatique des Langues Naturelles–25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues, pages 190–203. ATALA. Paris, France.
Dans le domaine clinique et dans d'autres domaines spécialisés, les données sont rares du fait de leur caractère confidentiel. Ce manque de données est un problème majeur lors du fine-tuning de modèles de langue.Par ailleurs, les modèles de langue de très grande taille (LLM) ont des performances prometteuses dans le domaine médical. Néanmoins, ils ne peuvent pas être utilisés directement dans les infrastructures des établissements de santé pour des raisons de confidentialité des données. Nous explorons une approche d'annotation des données d'entraînement avec des LLMs pour entraîner des modèles de moins grandes tailles mieux adaptés à notre problématique. Cette méthode donne des résultats prometteurs pour des tâches d'extraction d'information
You Zuo, Kim Gerdes, Houda Mouzoun, Samir Ghamri Doudane and Benoît Sagot. 2023. Exploring Data-Centric Strategies for French Patent Classification: A Baseline and Comparisons. In 18e Conférence en Recherche d'Information et Applications–16e Rencontres Jeunes Chercheurs en RI–30e Conférence sur le Traitement Automatique des Langues Naturelles–25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues, pages 349–365. ATALA. Paris, France.
This paper proposes a novel approach to French patent classification leveraging data-centric strategies. We compare different approaches for the two deepest levels of the IPC hierarchy: the IPC group and subgroups. Our experiments show that while simple ensemble strategies work for shallower levels, deeper levels require more sophisticated techniques such as data augmentation, clustering, and negative sampling. Our research highlights the importance of language-specific features and data-centric strategies for accurate and reliable French patent classification. It provides valuable insights and solutions for researchers and practitioners in the field of patent classification, advancing research in French patent classification.
Rian Touchent, Laurent Romary and Eric De La Clergerie. 2023. CamemBERT-bio : Un modèle de langue français savoureux et meilleur pour la santé. In 18e Conférence en Recherche d'Information et Applications \\ 16e Rencontres Jeunes Chercheurs en RI \\ 30e Conférence sur le Traitement Automatique des Langues Naturelles \\ 25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues, pages 323–334. ATALA. Paris, France.
Les données cliniques dans les hôpitaux sont de plus en plus accessibles pour la recherche à travers les entrepôts de données de santé, cependant ces documents sont non-structurés. Il est donc nécessaire d'extraire les informations des comptes-rendus médicaux. L'utilisation du transfert d'apprentissage grâce à des modèles de type BERT comme CamemBERT ont permis des avancées majeures, notamment pour la reconnaissance d'entités nommées. Cependant, ces modèles sont entraînés pour le langage courant et sont moins performants sur des données biomédicales. C'est pourquoi nous proposons un nouveau jeu de données biomédical public français sur lequel nous avons poursuivi le pré-entraînement de CamemBERT. Ainsi, nous présentons une première version de CamemBERT-bio, un modèle public spécialisé pour le domaine biomédical français qui montre un gain de 2,54 points de F-mesure en moyenne sur différents jeux d'évaluations de reconnaissance d'entités nommées biomédicales.
Niyati Bafna, Cristina España-Bonet, Josef van Genabith, Benoît Sagot and Rachel Bawden. 2023. Cross-lingual Strategies for Low-resource Language Modeling: A Study on Five Indic Dialects. In Actes de la 18e Conférence en Recherche d'Information et Applications–16e Rencontres Jeunes Chercheurs en RI–30e Conférence sur le Traitement Automatique des Langues Naturelles–25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues, pages 28–42. ATALA. Paris, France.
Neural language models play an increasingly central role for language processing, given their success for a range of NLP tasks. In this study, we compare some canonical strategies in language modeling for low-resource scenarios, evaluating all models by their (finetuned) performance on a POS-tagging downstream task. We work with five (extremely) low-resource dialects from the Indic dialect continuum (Braj, Awadhi, Bhojpuri, Magahi, Maithili), which are closely related to each other and the standard mid-resource dialect, Hindi. The strategies we evaluate broadly include from-scratch pretraining, and cross-lingual transfer between the dialects as well as from different kinds of off-the- shelf multilingual models; we find that a model pretrained on other mid-resource Indic dialects and languages, with extended pretraining on target dialect data, consistently outperforms other models. We interpret our results in terms of dataset sizes, phylogenetic relationships, and corpus statistics, as well as particularities of this linguistic system.
Francesca Frontini, Laurent Romary and Anas Fahad Khan. 2023. ISO LMF 24613-6: A Revised Syntax Semantics Module for the Lexical Markup Framework. In LDK 2023–4th Conference on Language, Data and Knowledge. Vienne, Austria.
The Lexical Markup Framework (LMF) is a meta-model for representing data in monolingual and multilingual lexical databases with a view to its use in computer applications. The "new LMF" replaces the old LMF standard, ISO 24613:2008, and is being published as a multi-part standard. This short paper introduces one of these new parts, ISO 24613-6, namely the Syntax and Semantics (SynSem) module. The SynSem module allows for the description of syntactic and semantic properties of lexemes, as well as the complex interactions between them. While the new standard remains faithful to (and backwards compatible with) the syntax and semantics coverage of the previous model, the new standard clarifies and simplifies it in a few places, which will be illustrated.
Alix Chagué, Thibault Clérice, Jade Norindr, Maxime Humeau, Baudoin Davoury, Elsa Van Kote, Anaïs Mazoue, Margaux Faure and Soline Doat. 2023. Manu McFrench, from zero to hero: impact of using a generic handwriting recognition model for smaller datasets. In Digital Humanities 2023: Collaboration as Opportunity. Graz, Austria.
Long paper presentation for ADHO's annual conference on Digital Humanities (2023), discussing the importance of using generic transcription models for HTR and how to create them. We use the case of the CREMMA datasets and the Manu McFrench models as an example.
Thibault Clérice, Alix Chagué and Hugo Scheithauer. 2023. Workshop HTR-United: metadata, quality control and sharing process for HTR training data. In DH 2023 - Digital Humanities Conference: Collaboration as Opportunity. Graz, Austria.
Workshop for ADHO's 2023 conference on Digital Humanities, introducing HTR-United's main features and demonstrating how to use them, on top of presenting essential Continuous Integration principles.
Sonal Sannigrahi and Rachel Bawden. 2023. Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 181–192. Tampere, Finland.
Multilingual language models have shown impressive cross-lingual transfer ability across a diverse set of languages and tasks. To improve the cross-lingual ability of these models, some strategies include transliteration and finer-grained segmentation into characters as opposed to subwords. In this work, we investigate lexical sharing in multilingual machine translation (MT) from Hindi, Gujarati, Nepali into English. We explore the trade-offs that exist in translation performance between data sampling and vocabulary size, and we explore whether transliteration is useful in encouraging cross-script generalisation. We also verify how the different settings generalise to unseen languages (Marathi and Bengali). We find that transliteration does not give pronounced improvements and our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences even for relatively low-resource languages. Our code will be made publicly available. 1
Rachel Bawden and François Yvon. 2023. Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 157–170. Tampere, Finland.
The NLP community recently saw the release of a new large open-access multilingual language model, BLOOM (BigScience et al., 2022) covering 46 languages. We focus on BLOOM's multilingual ability by evaluating its machine translation performance across several datasets (WMT, Flores-101 and DiaBLa) and language pairs (high- and low-resourced). Our results show that 0-shot performance suffers from overgeneration and generating in the wrong language, but this is greatly improved in the few-shot setting, with very good results for a number of language pairs. We study several aspects including prompt design, model sizes, cross-lingual transfer and the use of discursive context.
Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoît Sagot, Abdelrahman Mohamed and Emmanuel Dupoux. 2023. Generative Spoken Dialogue Language Modeling. In SLT-2022 - IEEE Spoken Language Technology Workshop. Doha, Qatar.
We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model 12 .
Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot and Rachel Bawden. 2023. Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5394–5413. Toronto, Canada.
One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as images. However, recent work in multimodal MT (MMT) has shown that obtaining improvements from images is challenging, limited not only by the difficulty of building effective cross-modal representations, but also by the lack of specific evaluation and training data. We present a new MMT approach based on a strong text-only MT model, which uses neural adapters, a novel guided self-attention mechanism and which is jointly trained on both visually-conditioned masking and MMT. We also introduce CoMMuTE, a Contrastive Multilingual Multimodal Translation Evaluation set of ambiguous sentences and their possible translations, accompanied by disambiguating images corresponding to each translation. Our approach obtains competitive results compared to strong text-only models on standard English-to-French, English-to-German and English-to-Czech benchmarks and outperforms baselines and state-of-the-art MMT systems by a large margin on our contrastive test set. Our code and CoMMuTE are freely available.
Wissam Antoun, Benoît Sagot and Djamé Seddah. 2023. Data-Efficient French Language Modeling with CamemBERTa. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5174–5185. Association for Computational Linguistics. Toronto, Canada.
Recent advances in NLP have significantly improved the performance of language models on a variety of tasks. While these advances are largely driven by the availability of large amounts of data and compute, they also benefit from the development of better training methods and architectures. In this paper, we introduce CAMEMBERTA, a French DeBERTa model that builds upon the DeBERTaV3 architecture and training objective. We evaluate our model's performance on a variety of French downstream tasks and datasets, including question answering, part-of-speech tagging, dependency parsing, named entity recognition, and the FLUE benchmark, and compare against CamemBERT, the state-of-the-art monolingual model for French. Our results show that, given the same amount of training tokens, our model outperforms BERT-based models trained with MLM on most tasks. Furthermore, our new model reaches similar or superior performance on downstream tasks compared to Camem-BERT, despite being trained on only 30% of its total number of input tokens. In addition to our experimental results, we also publicly release the weights and code implementation of CAMEMBERTA, making it the first publicly available DeBERTaV3 model outside of the original paper and the first openly available implementation of a DeBERTaV3 training objective.
Communications
Thibault Clérice and Anthony Glaise. 2023. Twenty-One* Pseudo-Chrysostoms and more: authorship verification in the patristic world. In CHR 2023: Computational Humanities Research Conference. Paris, France.
As the most prolific of the Church Fathers, John Chrysostom (344-407 CE) has a vast textual mass and theological importance that has led to a significant misattribution of texts, resulting in the existence of a second corpus known as the pseudo-Chrysostomian corpus. Like many Greek-language Church Fathers' works, this corpus comprises anonymous texts, which scholars have attempted to reattribute or group together based on factors such as the person's function, biography, ideology, style, etc. One survey conducted by Voicu in 1981 explored potential groupings of such texts and produced a critical list of 21 Pseudo-Chrysostom works identified by scholars, including Montfaucon (1655-1741), one of the first modern editors of Chrysostom's writings. In this paper, we present a novel approach to addressing pseudonymous work in the context of chrysostomian studies. We propose to employ siamese networks within an authorship verification framework, following the methodology commonly used in recent computational linguistic competitions. Our embedding model is trained using commonly used features in the digital humanities landscape, such as the most frequent words, affixes, and POS trigrams, utilizing a signal-to-noise ratio distance and pair mining. The results of our model show high AUCROC scores (0.855). Furthermore, the article concludes with an analysis of the pseudo-Chrysostoms proposed by Voicu. We validate a significant portion of the hypotheses found in Voicu's survey while also providing counter-arguments for two Pseudo-Chrysostoms. This research contributes to shedding light on the attribution of ancient texts and enriches the field of chrysostomian studies.
Chahan Vidal-Gorène, Jean-Baptiste Camps and Thibault Clérice. 2023. Synthetic lines from historical manuscripts: an experiment using GAN and style transfer. In Visual Processing of Digital Manuscripts: Workflows, Pipelines, Best Practices. ICIAP 2023 Workshops. ICIAP 2023.. Udine, Italy.
Given enough data of sufficient quality, HTR systems can achieve high accuracy, regardless of language, script or medium. Despite growing pooling of datasets, the question of the required quantity of training material still remains crucial for the transfer of models to out-of-domain documents, or the recognition of new scripts and under-resourced character classes. We propose a new data augmentation strategy, using generative adversarial networks (GAN). Inspired by synthetic lines generation for printed documents, our objective is to generate handwritten lines in order to massively produce data for a given style or under-resourced character class. Our approach, based on a variant of ScrabbleGAN, demonstrates the feasibility for various scripts, either in the presence of a high number and variety of abbreviations (Latin) and spellings or letter forms (Old French), in a situation of data scarcity (Armenian), or in the instance of a very cursive script (Arabic Maghribi). We then study the impact of synthetic line generation on HTR, by evaluating the gain for out-of-domain documents and under-resourced classes.
Ana Salgado, Rute Costa, Sara Carvalho, Anas Fahad Khan, Bruno Almeida, Margarida Ramos, Raquel Silva, Mohamed Khemakhem, Laurent Romary and Toma Tasovac. 2023. Domain labelling in the Morais dictionary: bringing structure to unstructured lexicographic data. In 24th Biennial Dictionary Society of North America Conference (DSNA). Boulder, United States.
This article provides a detailed analysis on the use of domain labels, i.e., special markersidentifying a specialised field of knowledge, in successive editions of the Morais dictionary.Morais is a historical Portuguese language dictionary, commonly known by and disseminated under the name of António de Morais Silva. This monolingual dictionary has relevance for the Portuguese lexicographic tradition as it inaugurates modern Portuguese lexicography and serves as a model for all subsequent lexicographic production throughout the 19th and 20th centuries. The domain labels were retrieved from the abbreviation lists of its various editions. This work is part of an ongoing Portuguese national linguistic project. It has two goals: 1) to encode the first three editions of the Morais dictionary to make them available online (as well as publishing them as lexical resources using two different standards for structured lexicographic datasets) and 2) to provide a description of the lexicographic components of these editions following a rigorous linguistic treatment. This project is not merely of a lexicographic nature, but it also explores the convergence between lexicography and other research domains, such as terminology, ontologies, linked data, and digital humanities. This article analyzes the domain labelling system in Morais from an evolutionary and diachronic perspective, in line with previous works that highlight the theoretical assumptions and methodological aspects of the lexicographical tradition around domain labelling. To organize lexicographic content, it is helpful to establish a hierarchical structure in general language dictionaries to systematize the included terminological information. Each table of abbreviations has two distinct columns: one with the abbreviation and the other with the complete domain designations. Given the importance of domain labels, we conducted a survey of all domain labels found. We identify and demonstrate the previous and newly added domains. After reviewing the flat domain list, we evaluated whether there was a discernible knowledge organizational approach that identified possible generic domains and subdomains. In the organization of domains, we propose three possible levels: superdomain, domain, and subdomain. The superdomain corresponds to the broadest taxonomic grouping followed by a domain, whereas the subdomain is part of a broader domain. To facilitate the analysis and to focus on interoperability issues, we generated a metalabel, a tag that identifies the English equivalent of the corresponding domain. The lists of domains included in general dictionaries’ outside matter follow alphabetical ordering, without any concern for the relationships that can be established between those types of labels. This article describes both onomasiological and semasiological approaches to treating specialized lexicographic content. Following terminological principles and an onomasiological approach, we organize and conceptualize specialized knowledge using structured data formats, such as Text Encoding Initiative, also considering future alignments between different lexicographic resources. The project will contribute towards a more significant presence of lexicographic digital content in Portuguese through open tools and standards.
Alix Chagué and Thibault Clérice. 2023. ''I'm here to fight for ground truth'': HTR-United, a solution towards a common for HTR training data. In Digital Humanities 2023: Collaboration as Opportunity. Graz, Austria.
Short paper presentation for ADHO's annual conference on the Digital Humanities (DH2023), introducing the HTR-United infrastructure and the stakes of sharing training datasets for HTR of historical documents.
El Haff Karim, Wissam Antoun, Florence Le Ber and Véronique Pitchon. 2023. Reconnaissance des entités nommées pour l'analyse des pharmacopées médiévales. In EGC 2023 - Extraction et Gestion des Connaissances. Lyon, France.
Today, many projects focus on the application of linguistic technologies on modern medical corpora, especially in the field of Named Entity Recognition. Besides, ancient pharmacopoeias are being explored with manual data entry by specialists in history and biology in order to extract knowledge. These analyses are carried out without necessarily going through the automatic recognition of named entities which could accelerate the exploration of the manuscripts. Therefore, we propose here a link between the two practices by: (1) creating a named entity recognition dataset for English translations of medieval Arabic pharmacopoeias and (2) training and evaluating language models that are pre-trained on multiple domains.
Preprints
Tu Anh Nguyen, Maureen De Seyssel, Robin Algayres, Patricia Rozé, Ewan Dunbar and Emmanuel Dupoux. 2023. Are word boundaries useful for unsupervised language learning? Preprint.
Benjamin Muller. 2022. How Can We Make Language Models Better at Handling the Diversity and Variability of Natural Languages ? PhD thesis. Sorbonne Université.
Deep Learning for NLP has led to impressive empirical progress in recent years. In essence, this progress is based on better contextualized representations that can be easily used for a wide variety of tasks. However, these models usually require substantial computing power and large amounts of raw textual data. This makes language’s inherent diversity and variability a vivid challenge in NLP. We focus on the following: How can we make language models better at handling the variability and diversity of natural languages?. First, we explore the generalizability of language models by building and analyzing one of the first large-scale replication of a BERT model for a non-English language. Our results raise the question of using these language models on highly-variable domains such as these found online. Focusing on lexical normalization, we show that this task can be approached with BERT-like models. However, we show that it only partially helps downstream performance. In consequence, we focus on adaptation techniques using what we refer to as representation transfer and explore challenging settings such as the zero-shot setting, low-resource languages. We show that multilingual language models can be adapted and used efficiently with low-resource languages, even with the ones unseen during pretraining, and that the script is a critical component in this adaptation.
Clémentine Fourrier. 2022. Neural Approaches to Historical Word Reconstruction. PhD thesis. Université PSL (Paris Sciences & Lettres).
In historical linguistics, cognates are words that descend in direct line from a common ancestor, called their proto-form, andtherefore are representative of their respective languages evolutions through time, as well as of the relations between theselanguages synchronically. As they reflect the phonetic history of the languages they belong to, they allow linguists to betterdetermine all manners of synchronic and diachronic linguistic relations (etymology, phylogeny, sound correspondences).Cognates of related languages tend to be linked through systematic phonetic correspondence patterns, which neuralnetworks could well learn to model, being especially good at learning latent patterns. In this dissertation, we seek tomethodically study the applicability of machine translation inspired neural networks to historical word prediction, relyingon the surface similarity of both tasks. We first create an artificial dataset inspired by the phonetic and phonotactic rules ofRomance languages, which allow us to vary task complexity and data size in a controlled environment, therefore identifyingif and under which conditions neural networks were applicable. We then extend our work to real datasets (after havingupdated an etymological database to gather a correct amount of data), study the transferability of our conclusions toreal data, then the applicability of a number of data augmentation techniques to the task, to try to mitigate low-resourcesituations. We finally investigat in more detail our best models, multilingual neural networks. We first confirm that, onthe surface, they seem to capture language relatedness information and phonetic similarity, confirming prior work. Wethen discover, by probing them, that the information they store is actually more complex: our multilingual models actuallyencode a phonetic language model, and learn enough latent historical information to allow decoders to reconstruct the(unseen) proto-form of the studied languages as well or better than bilingual models trained specifically on the task. Thislatent information is likely the explanation for the success of multilingual methods in the previous works
Pedro Ortiz Suarez. 2022. A Data-driven Approach to Natural Language Processing for Contemporary and Historical French. PhD thesis. Sorbonne Université.
In recent years, neural methods for Natural Language Processing (NLP) have consistently and repeatedly improved the state of the art in a wide variety of NLP tasks. One of the main contributing reasons for this steady improvement is the increased use of transfer learning techniques. These methods consist in taking a pre-trained model and reusing it, with little to no further training, to solve other tasks. Even though these models have clear advantages, their main drawback is the amount of data that is needed to pre-train them. The lack of availability of large-scale data previously hindered the development of such models for contemporary French, and even more so for its historical states.In this thesis, we focus on developing corpora for the pre-training of these transfer learning architectures. This approach proves to be extremely effective, as we are able to establish a new state of the art for a wide range of tasks in NLP for contemporary, medieval and early modern French as well as for six other contemporary languages. Furthermore, we are able to determine, not only that these models are extremely sensitive to pre-training data quality, heterogeneity and balance, but we also show that these three features are better predictors of the pre-trained models' performance in downstream tasks than the pre-training data size itself. In fact, we determine that the importance of the pre-training dataset size was largely overestimated, as we are able to repeatedly show that such models can be pre-trained with corpora of a modest size.
Journal articles
Alix Chagué. 2022. eScriptorium~: une application libre pour la transcription automatique des manuscrits. Arabesques, page 25. ABES.
Alix Chagué and Laurent Romary. 2022. L'intelligence artificielle, une ouverture du champ des possibles. Arabesques, pages 4–5. ABES.
Robin Algayres, Tristan Ricoul, Julien Karadayi, Hugo Laurençon, Salah Zaiem, Abdelrahman Mohamed, Benoît Sagot and Emmanuel Dupoux. 2022. DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon. Transactions of the Association for Computational Linguistics10, pages 1051–1065. The MIT Press.
Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-theart in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark. 1
Tu Anh Nguyen, Benoit Sagot and Emmanuel Dupoux. 2022. Are Discrete Units Necessary for Spoken Language Modeling?IEEE Journal of Selected Topics in Signal Processing16 pages 1415–1423.
Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels. The approach relies first on transforming the audio into a sequence of discrete units (or pseudo-text) and then training a language model directly on such pseudo-text. Is such a discrete bottleneck necessary, potentially introducing irreversible errors in the encoding of the speech signal, or could we learn a language model without discrete units at all? In this work, we study the role of discrete versus continuous representations in spoken language modeling. We show that discretization is indeed essential for good results in spoken language modeling. We show that discretization removes linguistically irrelevant information from the continuous features, helping to improve language modeling performances. On the basis of this study, we train a language model on the discrete units of the HuBERT features, reaching new state-of-the-art results in the lexical, syntactic and semantic metrics of the Zero Resource Speech Challenge 2021 (Track 1-Speech Only).
Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Barone, Jindřich Helcl and Alexandra Birch. 2022. Survey of Low-Resource Machine Translation. Computational Linguistics48, pages 673–732. The MIT Press.
We present a survey covering the state of the art in low-resource machine translation (MT) research. There are currently around 7,000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available. We present a summary of this topical research field and provide a description of the techniques evaluated by researchers in several recent shared tasks in low-resource MT.
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Rubungo Andre Niyongabo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Balli, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal and Mofetoluwa Adeyemi. 2022. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics10, pages 50–72. The MIT Press.
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.
Jack Bowers, Axel Herold, Laurent Romary and Toma Tasovac. 2022. TEI Lex-0 Etym–towards terse recommendations for the encoding of etymological information. Journal of the Text Encoding Initiative. TEI Consortium.
The present paper describes the etymological component of the TEI Lex-0 initiative which aims at defining a terser subset of the TEI guidelines for the representation of etymological features in dictionary entries. Going beyond the basic provision of etymological mechanisms in the TEI guidelines, TEI Lex-0 Etym proposes a systematic representation of etymological and cognate descriptions by means of embedded constructs based on the <etym> (for etymologies) and <cit> (for etymons and cognates) elements. In particular, given that all the potential contents of etymons are highly analogous to those of dictionary entries in general, the contents presented herein heavily re-use many of the corresponding features and constraints introduced in other components of the TEI Lex-0 to the encoding of etymologies and etymons. The TEI Lex-0 Etym model is also closely aligned to ISO 24613-3 on modelling etymological data and the corresponding TEI serialisation available in ISO 24613-4.
Conference proceedings
Anna Chepaikina, Robert Bossy, Catherine Roussey and Stephan Bernard. 2022. Thesaurus Enrichment via Coordination Extraction. In 16th International Conference on Metadata and Semantics Research (MTSR 2022). London, United Kingdom.
We advance a method of thesaurus enrichment, based on the extraction of coordinations in a domain-related corpus. Our hypothesis is that there is a semantic homogeneity between the conjuncts located in a coordination. We conducted an experiment that allowed us to evaluate the effectiveness of our method. This experiment aims to enrich the concept hierarchy of a French agricultural thesaurus named French Crop Usage (FCU), thanks to the texts of the Plant Health Bulletins (PHB). The FCU thesaurus is published on the Web using the SKOS model.
Tom Kocmi, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Novák, Martin Popel, Maja Popović and Mariya Shmatova. 2022. Findings of the 2022 Conference on Machine Translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45. Abu Dhabi, United Arab Emirates.
This paper presents the results of the General Machine Translation Task organised as part of the Conference on Machine Translation (WMT) 2022. In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of four different domains. We evaluate system outputs with human annotators using two different techniques: reference-based direct assessment and (DA) and a combination of DA and scalar quality metric (DA+SQM).
Mariana Neves, Antonio Jimeno Yepes, Amy Siu, Roland Roller, Philippe Thomas, Maika Vicente Navarro, Lana Yeganova, Dina Wiemann, Giorgio Maria Di Nunzio, Federica Vezzani, Christel Gérardin, Rachel Bawden, Darryl Johan Estrada, Salvador Lima-López, Eulàlia Farré-Maduell, Martin Krallinger, Cristian Grozea and Aurélie Névéol. 2022. Findings of the WMT 2022 Biomedical Translation Shared Task: Monolingual Clinical Case Reports. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 694–723. Association for Computational Linguistics. Abu Dhabi, United Arab Emirates.
In the seventh edition of the WMT Biomedical Task, we addressed a total of seven language pairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian. This year’s test sets covered three types of biomedical text genre. In addition to scientific abstracts and terminology items used in previ- ous editions, we released test sets of clinical cases. The evaluation of clinical cases translations were given special attention by involving clinicians in the preparation of reference translations and manual evaluation. For the main MEDLINE test sets, we received a total of 609 submissions from 37 teams. For the ClinSpEn sub-task, we had the participation of five teams.
Omer Goldman, Francesco Tinner, Hila Gonen, Benjamin Muller, Victoria Basmov, Shadrack Kirimi, Lydia Nishimwe, Benoît Sagot, Djamé Seddah, Reut Tsarfaty and Duygu Ataman. 2022. The MRL 2022 Shared Task on Multilingual Clause-level Morphology. In 1st Shared Task on Multilingual Clause-level Morphology. Abu Dhabi, United Arab Emirates.
The 2022 Multilingual Representation Learning (MRL) Shared Task was dedicated to clause-level morphology. As the first ever benchmark that defines and evaluates morphology outside its traditional lexical boundaries, the shared task on multilingual clause-level morphology sets the scene for competition across different approaches to morphological modeling, with 3 clause-level sub-tasks: morphological inflection, reinflection and analysis, where systems are required to generate, manipulate or analyze simple sentences centered around a single content lexeme and a set of morphological features characterizing its syntactic clause. This year's tasks covered eight typologically distinct languages: English, French, German, Hebrew, Russian, Spanish, Swahili and Turkish. The tasks has received submissions of four systems from three teams which were compared to two baselines implementing prominent multilingual learning methods. The results show that modern NLP models are effective in solving morphological tasks even at the clause level. However, there is still room for improvement, especially in the task of morphological analysis.
Nathan Godey, Roman Castagné, Eric Villemonte de La Clergerie and Benoît Sagot. 2022. MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates.
Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this work, we propose MANTa, a Module for Adaptive Neural TokenizAtion. MANTa is a differentiable tokenizer trained end-to-end with the language model. The resulting system offers a trade-off between the expressiveness of byte-level models and the speed of models trained using subword tokenization. In addition, our tokenizer is highly explainable since it produces an explicit segmentation of sequences into blocks. We evaluate our pretrained model on several English datasets from different domains as well as on synthetic noise. We find that MANTa improves robustness to character perturbations and out-of-domain data. We then show that MANTa performs comparably to other models on the general-domain GLUE benchmark. Finally, we show that it is considerably faster than strictly byte-level models.
Syrielle Montariol, Arij Riabi and Djamé Seddah. 2022. Multilingual Auxiliary Tasks Training: Bridging the Gap between Languages for Zero-Shot Transfer of Hate Speech Detection Models. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 347–363. Association for Computational Linguistics. Online.
Zero-shot cross-lingual transfer learning has been shown to be highly challenging for tasks involving a lot of linguistic specificities or when a cultural gap is present between languages, such as in hate speech detection. In this paper, we highlight this limitation for hate speech detection in several domains and languages using strict experimental settings. Then, we propose to train on multilingual auxiliary tasks -- sentiment analysis, named entity recognition, and tasks relying on syntactic information -- to improve zero-shot transfer of hate speech detection models across languages. We show how hate speech detection models benefit from a cross-lingual knowledge proxy brought by auxiliary tasks fine-tuning and highlight these tasks' positive impact on bridging the hate speech linguistic and cultural gap between languages.
Syrielle Montariol, Étienne Simon, Arij Riabi and Djamé Seddah. 2022. Fine-tuning and Sampling Strategies for Multimodal Role Labeling of Entities under Class Imbalance. In Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations, pages 55–65. Association for Computational Linguistics. Dublin, Ireland.
We propose our solution to the multimodal semantic role labeling task from the CON-STRAINT’22 workshop. The task aims at clas-sifying entities in memes into classes such as “hero” and “villain”. We use several pre-trained multi-modal models to jointly encode the text and image of the memes, and implement three systems to classify the role of the entities. We propose dynamic sampling strategies to tackle the issue of class imbalance. Finally, we per-form qualitative analysis on the representations of the entities.
Jesujoba O Alabi, Lydia Nishimwe, Benjamin Muller, Camille Rey, Benoît Sagot and Rachel Bawden. 2022. Inria-ALMAnaCH at the WMT 2022 shared task: Does Transcription Help Cross-Script Machine Translation? In Proceedings of the Seventh Conference on Machine Translation Association for Computational Linguistics. Abu Dhabi, United Arab Emirates.
This paper describes the Inria ALMAnaCH team submission to the WMT 2022 general translation shared task. Participating in the language directions {cs,ru,uk}→en and cs↔uk, we experiment with the use of a dedicated Latin-script transcription convention aimed at representing all Slavic languages involved in a way that maximises character-and word-level correspondences between them as well as with the English language. Our hypothesis was that bringing the source and target language closer could have a positive impact on machine translation results. We provide multiple comparisons, including bilingual and multilingual baselines, with and without transcription. Initial results indicate that the transcription strategy was not successful, resulting in lower results than baselines. We nevertheless submitted our multilingual, transcribed models as our primary systems, and in this paper provide some indications as to why we got these negative results.
Paul-Ambroise Duquenne, Hongyu Gong, Benoît Sagot and Holger Schwenk. 2022. T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Abu Dhabi, United Arab Emirates.
We present a new approach to perform zeroshot cross-modal transfer between speech and text for translation tasks. Multilingual speech and text are encoded in a joint fixed-size representation space. Then, we compare different approaches to decode these multimodal and multilingual fixed-size representations, enabling zero-shot translation between languages and modalities. All our models are trained without the need of cross-modal labeled translation data. Despite a fixed-size representation, we achieve very competitive results on several text and speech translation tasks. In particular, we outperform the state of the art for zero-shot speech translation on Must-C. We also introduce the first results for zero-shot direct speechto-speech and text-to-speech translation.
Louis Martin, Angela Fan, Éric Villemonte de la Clergerie, Antoine Bordes and Benoît Sagot. 2022. MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1651–1664. European Language Resources Association. Marseille, France.
Progress in sentence simplification has been hindered by a lack of labeled parallel simplification data, particularly in languages other than English. We introduce MUSS, a Multilingual Unsupervised Sentence Simplification system that does not require labeled simplification data. MUSS uses a novel approach to sentence simplification that trains strong models using sentencelevel paraphrase data instead of proper simplification data. These models leverage unsupervised pretraining and controllable generation mechanisms to flexibly adjust attributes such as length and lexical complexity at inference time. We show that this paraphrase data can be mined in any language from Common Crawl using semantic sentence embeddings, thus removing the need for labeled data. We evaluate our approach on English, French, and Spanish simplification benchmarks and closely match or outperform the previous best supervised results, despite not using any labeled simplification data. We push the state of the art further by incorporating labeled simplification data.
Robin Algayres, Adel Nabli, Benoît Sagot and Emmanuel Dupoux. 2022. Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning. In Proceedings of the 23rd Annual Conference of the International Speech Communication Association, pages 2123–2127. Incheon, South Korea.
We introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search. We show that when built on top of recent self-supervised audio representations [1, 2, 3], this method can be applied iteratively and yield competitive SSE as evaluated on two tasks: query-by-example of random sequences of speech, and spoken term discovery. On both tasks our method pushes the state-of-the-art by a significant margin across 5 different languages. Finally, we establish a benchmark on a query-byexample task on the LibriSpeech dataset to monitor future improvements in the field.
Ghazi Felhi, Joseph Le Roux and Djamé Seddah. 2022. Exploiting Inductive Bias in Transformers for Unsupervised Disentanglement of Syntax and Semantics with VAEs. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5763–5776. Association for Computational Linguistics. Seattle, United States.
We propose a generative model for text generation, which exhibits disentangled latent representations of syntax and semantics. Contrary to previous work, this model does not need syntactic information such as constituency parses, or semantic information such as paraphrase pairs. Our model relies solely on the inductive bias found in attention-based architectures such as Transformers. In the attention of Transformers, keys handle information selection while values specify what information is conveyed. Our model, dubbed QKVAE, uses Attention in its decoder to read latent variables where one latent variable infers keys while another infers values. We run experiments on latent representations and experiments on syntax/semantics transfer which show that QKVAE displays clear signs of disentangled syntax and semantics. We also show that our model displays competitive syntax transfer capabilities when compared to supervised models and that comparable supervised models need a fairly large amount of data (more than 50K samples) to outperform it on both syntactic and semantic transfer. The code for our experiments is publicly available.
Loïc Grobol, Mathilde Regnault, Pedro Ortiz Suarez, Benoît Sagot, Laurent Romary and Benoit Crabbé. 2022. BERTrade: Using Contextual Embeddings to Parse Old French. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1104–1113. European Language Resources Association. Marseille, France.
The successes of contextual word embeddings learned by training large-scale language models, while remarkable, have mostly occurred for languages where significant amounts of raw texts are available and where annotated data in downstream tasks have a relatively regular spelling. Conversely, it is not yet completely clear if these models are also well suited for lesser-resourced and more irregular languages. We study the case of Old French, which is in the interesting position of having relatively limited amount of available raw text, but enough annotated resources to assess the relevance of contextual word embedding models for downstream NLP tasks. In particular, we use POS-tagging and dependency parsing to evaluate the quality of such models in a large array of configurations, including models trained from scratch from small amounts of raw text and models pre-trained on other languages but fine-tuned on Medieval French data.
Simon Gabay, Pedro Ortiz Suarez, Rachel Bawden, Alexandre Bartz, Philippe Gambette and Benoît Sagot. 2022. Le projet FREEM : ressources, outils et enjeux pour l'étude du français d'Ancien Régime (The FREEM project: Resources, tools and challenges for the study of Ancien Régime French). In Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale, pages 154–165. ATALA. Avignon, France.
Despite their undoubted quality, the resources and tools available for the analysis of Ancien Régime French are no longer able to meet the challenges of research in linguistics and literature for this period. After having precisely defined the chronological framework, we present the corpora made available and the results obtained with them for several NLP tasks, fundamental to the study of language and literature.
Arij Riabi, Syrielle Montariol and Djamé Seddah. 2022. Tâches Auxiliaires Multilingues pour le Transfert de Modèles de Détection de Discours Haineux (Multilingual Auxiliary Tasks for Zero-Shot Cross-Lingual Transfer of Hate Speech Detection). In Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale, pages 413–423. ATALA. Avignon, France.
La tâche de détection de contenus haineux est ardue, car elle nécessite des connaissances culturelles et contextuelles approfondies ; les connaissances nécessaires varient, entre autres, selon la langue du locateur ou la cible du contenu. Or, des données annotées pour des domaines et des langues spécifiques sont souvent absentes ou limitées. C’est là que les données dans d’autres langues peuvent être exploitées ; mais du fait de ces variations, le transfert cross-lingue est souvent difficile. Dans cet article, nous mettons en évidence cette limitation pour plusieurs domaines et langues et montrons l’impact positif de l’apprentissage de tâches auxiliaires multilingues - analyse de sentiments, reconnaissance des entités nommées et tâches reposant sur des informations morpho-syntaxiques - sur le transfert cross-lingue zéro-shot des modèles de détection de discours haineux, afin de combler ce fossé culturel.
Benjamin Muller, Antonios Anastasopoulos, Benoît Sagot and Djamé Seddah. 2022. Quand être absent de mBERT n'est que le commencement : Gérer de nouvelles langues à l'aide de modèles de langues multilingues (When Being Unseen from mBERT is just the Beginning : Handling New Languages With Multilingual Language Models). In Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale, pages 450–451. ATALA. Avignon, France.
Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. We show that transliterating those languages significantly improves the potential of large-scale multilingual language models on downstream tasks. This result provides a promising direction towards making these massively multilingual models useful for a new set of unseen languages.
Simon Gabay, Rachel Bawden, Philippe Gambette, Jonathan Poinhos, Eleni Kogkitsidou and Benoît Sagot. 2022. Le changement linguistique au XVIIe s. : nouvelles approches scriptométriques. In CMLF 2022 - 8e Congrès Mondial de Linguistique Française138, pages 02006.1–14. EDP Sciences. Orléans, France.
Linguistic change in 17th c. France: new scriptometric approaches The end of the 17th c. remains a blind spot of the research on the spelling system, despite its importance for French at this period, during which a strict norm, still (more or less) in place, was created and imposed. Focusing on a practical rather than a theoretical approach, we propose to lay the foundation for a computational scriptometric study of early modern French and analyse the evolution of the spelling system over the 17th c. To do so, we measure and evaluate the distance between the early modern and the contemporary versions of the language, thanks to two automatic normalisers: one rule-based and another one neural-based.
Thibault Charmet, Inès Cherichi, Matthieu Allain, Urszula Czerwinska, Amaury Fouret, Benoît Sagot and Rachel Bawden. 2022. Complex Labelling and Similarity Prediction in Legal Texts: Automatic Analysis of France's Court of Cassation Rulings. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4754–4766. European Language Resources Association. Marseille, France.
Detecting divergences in the applications of the law (where the same legal text is applied differently by two rulings) is an important task. It is the mission of the French Cour de Cassation. The first step in the detection of divergences is to detect similar cases, which is currently done manually by experts. They rely on summarised versions of the rulings (syntheses and keyword sequences), which are currently produced manually and are not available for all rulings. There is also a high degree of variability in the keyword choices and the level of granularity used. In this article, we therefore aim to provide automatic tools to facilitate the search for similar rulings. We do this by (i) providing automatic keyword sequence generation models, which can be used to improve the coverage of the analysis, and (ii) providing measures of similarity based on the available texts and augmented with predicted keyword sequences. Our experiments show that the predictions improve correlations of automatically obtained similarities against our specially colelcted human judgments of similarity.
Francesco De Toni, Christopher Akiki, Javier De La Rosa, Clémentine Fourrier, Enrique Manjavacas, Stefan Schweter and Daniel Van Strien. 2022. Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0. In Proceedings of BigScience Episode #5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 75–83. Association for Computational Linguistics. virtual+Dublin.
In this work, we explore whether the recently demonstrated zero-shot abilities of the T0 model extend to Named Entity Recognition for out-of-distribution languages and time periods. Using a historical newspaper corpus in 3 languages as test-bed, we use prompts to extract possible named entities. Our results show that a naive approach for prompt-based zero-shot multilingual Named Entity Recognition is error-prone, but highlights the potential of such an approach for historical languages lacking labeled datasets. Moreover, we also find that T0-like models can be probed to predict the publication date and language of a document, which could be very relevant for the study of historical texts.
Clémentine Fourrier and Syrielle Montariol. 2022. Caveats of Measuring Semantic Change of Cognates and Borrowings using Multilingual Word Embeddings. In Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, pages 97–112. Association for Computational Linguistics. Dublin, Ireland.
Cognates and borrowings carry different aspects of etymological evolution. In this work, we study semantic change of such items using multilingual word embeddings, both static and contextualised. We underline caveats identified while building and evaluating these embeddings. We release both said embeddings and a newly-built historical words lexicon, containing typed relations between words of varied Romance languages.
Clémentine Fourrier and Benoît Sagot. 2022. Probing Multilingual Cognate Prediction Models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3786–3801. Association for Computational Linguistics. Dublin, Ireland.
Character-based neural machine translation models have become the reference models for cognate prediction, a historical linguistics task. So far, all linguistic interpretations about latent information captured by such models have been based on external analysis (accuracy, raw results, errors). In this paper, we investigate what probing can tell us about both models and previous interpretations, and learn that though our models store linguistic and diachronic information, they do not achieve it in previously assumed ways.
Simon Gabay, Pedro Ortiz Suarez, Alexandre Bartz, Alix Chagué, Rachel Bawden, Philippe Gambette and Benoît Sagot. 2022. From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3367–3374. European Language Resources Association. Marseille, France.
Language models for historical states of language are becoming increasingly important to allow the optimal digitisation and analysis of old textual sources. Because these historical states are at the same time more complex to process and more scarce in the corpora available, specific efforts are necessary to train natural language processing (NLP) tools adapted to the data. In this paper, we present our efforts to develop NLP tools for Early Modern French (historical French from the 16th to the 18th centuries). We present the FreEMmax corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on FreEMmax. We evaluate the usefulness of D'AlemBERT by fine-tuning it on a part-of-speech tagging task, outperforming previous work on the test set. Importantly, we find evidence for the transfer learning capacity of the language model, since its performance on lesser-resourced time periods appears to have been boosted by the more resourced ones. We release D'AlemBERT and the open-sourced subpart of the FreEMmax corpus.
Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou, Philippe Gambette, Benoît Sagot and Simon Gabay. 2022. Automatic Normalisation of Early Modern French. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3354–3366. European Language Resources Association. Marseille, France.
Spelling normalisation is a useful step in the study and analysis of historical language texts, whether it is manual analysis by experts or automatic analysis using downstream natural language processing (NLP) tools. Not only does it help to homogenise the variable spelling that often exists in historical texts, but it also facilitates the use of off-the-shelf contemporary NLP tools, if contemporary spelling conventions are used for normalisation. We present FREEMnorm, a new benchmark for the normalisation of Early Modern French (from the 17th century) into contemporary French and provide a thorough comparison of three different normalisation methods: ABA, an alignment-based approach and MT-approaches, (both statistical and neural), including extensive parameter searching, which is often missing in the normalisation literature.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng-Xin Yong, Harshit Pandey, Michael Mckenna, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf and Alexander M. Rush. 2022. Multitask Prompted Training Enables Zero-Shot Task Generalization. In Proceedings of the The Tenth International Conference on Learning Representations. Online.
Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models’ pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pre-trained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at https://github.com/bigscience-workshop/t-zero, and all prompts are available at https://github.com/bigscience-workshop/promptsource.
Julien Abadji, Pedro Ortiz Suarez, Laurent Romary and Benoît Sagot. 2022. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4344–4355. European Language Resources Association. Marseille, France.
The need for raw large raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods to Natural Language Processing. And while there have been some recent attempts to manually curate the amount of data necessary to train large language models, the main way to obtain this data is still through automatic web crawling. In this paper we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR that could prove more suitable to pre-train large generative language models as well as hopefully other applications in Natural Language Processing and Digital Humanities.
Communications
Rute Costa, Ana Salgado, Margarida Ramos, Fahad Khan, Sara Carvalho, Toma Tasovac, Bruno Almeida, Mohamed Khemakhem, Laurent Romary and Raquel Silva. 2022. Integrating Terminological and Ontological Principles into a Lexicographic Resource. In 1st International Conference on «Multilingual digital terminology today. Design, representation formats and management systems»;Vol-3161. CEUR-WS.org. Padova, Italy.
In this paper we will present the research that is taking place at the NOVA CLUNL where an international team is working on a financed project MORDigital. MORDigital's goal is to encode the selected editions of Diccinario de Lingua Portugueza by António de Morais Silva (MOR), first published in 1789.
Yves Rychener, Xavier Renard, Djamé Seddah, Pascal Frossard and Marcin Detyniecki. 2022. On the Granularity of Explanations in Model Agnostic NLP Interpretability. In XKDD 2022 - ECML PKDD 2022 International Workshop on eXplainable Knowledge Discovery in Data Mining. Grenoble, France.
Current methods for Black-Box NLP interpretability, like LIME or SHAP, are based on altering the text to interpret by removing words and modeling the Black-Box response. In this paper, we outline limitations of this approach when using complex BERT-based classifiers: The word-based sampling produces texts that are out-of-distribution for the classifier and further gives rise to a high-dimensional search space, which can't be sufficiently explored when time or computation power is limited. Both of these challenges can be addressed by using segments as elementary building blocks for NLP interpretability. As illustration, we show that the simple choice of sentences greatly improves on both of these challenges. As a consequence, the resulting explainer attains much better fidelity on a benchmark classification task.
Benoît Sagot, Laurent Romary, Rachel Bawden, Pedro Javier Ortiz Suárez, Kelly Christensen, Simon Gabay, Ariane Pinche and Jean-Baptiste Camps. 2022. Gallic(orpor)a : Extraction, annotation et diffusion de l'information textuelle et visuelle en diachronie longue. In DataLab de la BnF : Restitution des travaux 2022. Paris, France.
Aurélia Rostaing and Hugo Scheithauer. 2022. LectAuRep : Un projet de recherche et développement pour la transcription automatique de répertoires de notaires. In La reconnaissance des écritures manuscrites et ses usages dans les archives. Pierrefitte-sur-Seine, France.
Simon Gabay, Rachel Bawden, Benoît Sagot and Philippe Gambette. 2022. Vers l'étude linguistique sur données artificielles. In Variation(s) en français. Nancy, France.
Depuis désormais des décennies, plusieurs disciplines ont pris l'habitude de travailler sur des données dites « synthétiques » plutôt que « réelles », c’est-à-dire sur des données générées par une simulation computationnelle reflétant le monde réel. Notre présentation se propose d'expérimenter cette méthode en linguistique diachronique par la génération de corpus pseudo-anciens. Nous reviendrons donc sur cette approche, tant du point de vue méthodologique que technique, en prenant comme cas d'étude celui de la variation graphique du français et de son évolution pendant l'Ancien Régime.
Aurélia Rostaing and Hugo Scheithauer. 2022. LectAuRep (2018-2021) :Projet de lecture automatique de répertoires de notaires. In Segmenter et annoter les images : déconstruire pour reconstruire. Paris, France.
You Zuo, Houda Mouzoun, Samir Ghamri Doudane, Kim Gerdes and Benoît Sagot. 2022. Patent Classification using Extreme Multi-label Learning: A Case Study of French Patents. In SIGIR 2022 - PatentSemTech workshop - 3rd Workshop on Patent Text Mining and Semantic Technologies. Madrid, Spain.
Most previous patent classification methods have treated the task as a general text classification task, and others have tried to implement XML (extreme multi-label learning) methods designed to handle vast numbers of classes. However, they focus only on the IPC subclass level, which has fewer than 700 labels and is far from "extreme." This paper presents a French Patents corpus INPI-CLS extracted from the INPI internal database. It contains all parts of patent texts (title, abstract, claims, description) published from 2002 to 2021, with IPC labels at all levels. We test different XML methods and other classification models at the subclass and group levels of the INPI-CLS dataset with about 600 and 7k labels, respectively, demonstrating the XML approach's validity to patent classification.
You Zuo, Yixuan Li, Alma Parias García and Kim Gerdes. 2022. Technological taxonomies for hypernym and hyponym retrieval in patent texts. In ToTh 2022 - Terminology & Ontology: Theories and applications. Chambéry, France.
This paper presents an automatic approach to creating taxonomies of technical terms based on the Cooperative Patent Classification (CPC). The resulting taxonomy contains about 170k nodes in 9 separate technological branches and is freely available. We also show that a Text-to-Text Transfer Transformer (T5) model can be fine-tuned to generate hypernyms and hyponyms with relatively high precision, confirming the manually assessed quality of the resource. The T5 model opens the taxonomy to any new technological terms for which a hypernym can be generated, thus making the resource updateable with new terms, an essential feature for the constantly evolving field of technological terminology.
Laurent Romary and Hugo Scheithauer. 2022. DataCatalogue : enjeux et réalisations. In Un outil numérique pour interroger les catalogues de vente : le projet DataCatalogue. Paris, France.
Aurélia Rostaing and Hugo Scheithauer. 2022. Enrichir le patrimoine écrit archivistique grâce aux technologies numériques : Ingénierie du projet LectAuRep (Lecture automatique de répertoires). In DHNord 2022 - Travailler en Humanités Numériques : collaborations, complémentarités et tensions. Online, France.
Floriane Chiffoleau and Hugo Scheithauer. 2022. From a collection of documents to a published edition : how to use an end-to-end publication pipeline. In TEI 2022 - Text Encoding Initiative 2022 Conference. Newcastle, United Kingdom.
The goal of the workshop is to demonstrate how a corpus could be processed for publication with TEI Publisher. The workshop participants will learn to experiment with a ready-to-use solution that provides an easy and quick publication of a corpus. They will also get tips and shortcuts to help speed up the creation of a digital edition. Moreover, by the end of the session, this workshop will provide the participants with a visualization of their respective corpus, with side by side transformed text and original image; all of which then showing what can be achieved while working with TEI in the context of an end-to-end publication pipeline.
Ariane Pinche, Kelly Christensen and Simon Gabay. 2022. Between automatic and manual encoding. In TEI 2022 conference : Text as data. Newcastle, United Kingdom.
Cultural heritage institutions today aim to digitise their collections of prints andmanuscripts (Bermès 2020) and are generating more and more digital images (Gray2009). To enrich these images, many institutions work with standardised formats such asIIIF, preserving as much of the source’s information as possible. To take full advantage oftextual documents, an image alone is not enough. Thanks to automatic text recognitiontechnology, it is now possible to extract images’ content on a large scale. The TEI seemsto provide the perfect format to capture both an image’s formal and textual data (Janèset al. 2021). However, this poses a problem. To ensure compatibility with a range ofuse cases, TEI XML files must guarantee IIIF or RDF exports and therefore must bebased on strict data structures that can be automated. But a rigid structure contradictsthe basic principles of philology, which require maximum flexibility to cope with varioussituations. The solution proposed by the Gallic(orpor)a project1 attempted to deal with such acontradiction, focusing on French historical documents produced between the 15th andthe 18th c. It aims to enrich the digital facsimiles distributed by the French NationalLibrary (BnF).
Alix Chagué, Hugo Scheithauer, Lucas Terriel, Floriane Chiffoleau and Yves Tadjo-Takianpi. 2022. Take a sip of TEI and relax: a proposition for an end-to-end workflow to enrich and publish data created with automatic text recognition. In Digital Humanities 2022 : Responding to Asian Diversity. Tokyo, Japan.
Alix Chagué and Thibault Clérice. 2022. Sharing HTR datasets with standardized metadata: the HTR-United initiative. In Documents anciens et reconnaissance automatique des écritures manuscrites. Paris, France.
Hugo Scheithauer. 2022. LectAuRep : Données d'archives en français des XIXe et XXe siècles. In Transkribus / eScriptorium : Transcrire, annoter et éditer numériquement des documents d'archives. Paris, France.
Alix Chagué. 2022. Corpus, méthodes et ressources pour la transcription automatique des documents manuscrits patrimoniaux francophones contemporains. In 89e Congrès de l'Acfas, Section 310 - Le numérique dans les sciences humaines : édition et visualisation. Montréal, Canada.
Résumé en 5 minutes du projet de recherche doctorale intitulé "Corpus, méthodes et ressources pour la transcription automatique des documents manuscrits patrimoniaux francophones contemporains" débuté en novembre 2021 et récompensé par le Bourse d'Excellence 2022 du GREN. La communication replaçait le projet dans le contexte de la disponibilité actuelle des logiciels grand public pour l'application de la transcription automatique de documents manuscrits et le manque de ressources conceptuelles et méthodologiques permettant d'en tirer pleinement parti. L'une des principales difficultés évoquées était celle de la convergence des pratiques vers les modèles et des données interopérables.
Florence Clavaud, Laurent Romary, Pauline Charbonnier, Lucas Terriel, Gaetano Piraino and Vincent Verdese. 2022. NER4Archives (named entity recognition for archives) : Conception et réalisation d'un outil de détection, de classification et de résolution des entités nommées dans les instruments de recherche archivistiques encodés en XML/EAD. In Atelier Culture-INRIA Pierrefitte sur Seine, France.
Hugo Scheithauer, Laurent Romary, Frédérique Duyrat and Federico Nurra. 2022. DataCatalogue : présentation du projet. In Atelier Culture-Inria. Pierrefitte-sur-Seine, France.
Presentation on the DataCatalogue project, jointly led by Inria, the National Library of France (BnF) and the National Institute for Art History (INHA), at the "journée Atelier culture-Inria," held at the Archives nationales on 03/22/2022.
Ghazi Felhi, Joseph Le Roux and Djamé Seddah. 2022. Towards Unsupervised Content Disentanglement in Sentence Representations via Syntactic Roles. In CtrlGen: Controllable Generative Modeling in Language and Vision. virtual, France.
Linking neural representations to linguistic factors is crucial in order to build and analyze NLP models interpretable by humans. Among these factors, syntactic roles (e.g. subjects, direct objects,.. .) and their realizations are essential markers since they can be understood as a decomposition of predicative structures and thus the meaning of sentences. Starting from a deep probabilistic generative model with attention, we measure the interaction between latent variables and realizations of syntactic roles, and show that it is possible to obtain, without supervision, representations of sentences where different syntactic roles correspond to clearly identified different latent variables. The probabilistic model we propose is an Attention-Driven Variational Autoencoder (ADVAE). Drawing inspiration from Transformer-based machine translation models, ADVAEs enable the analysis of the interactions between latent variables and input tokens through attention. We also develop an evaluation protocol to measure disentanglement with regard to the realizations of syntactic roles. This protocol is based on attention maxima for the encoder and on disturbing individual latent variables for the decoder. Our experiments on raw English text from the SNLI dataset show that i) disentanglement of syntactic roles can be induced without supervision, ii) ADVAE separates more syntactic roles than classical sequence VAEs, iii) realizations of syntactic roles can be separately modified in sentences by mere intervention on the associated latent variables. Our work constitutes a first step towards unsupervised controllable content generation. The code for our work is publicly available 1 .
Book chapters
Alix Chagué, Victoria Le Fourner, Manuela Martini and Eric Villemonte de La Clergerie. 2022. Deux siècles de sources disparates sur l'industrie textile en France : comment automatiser les traitements d'un corpus non-uniforme ? In La fabrique numérique des corpus en sciences humaines et sociales Presses Universitaires du Septentrion.
Victoria Le Fourner, Alix Chagué, Manuela Martini and Anaïs Albert. 2022. Structurer automatiquement un corpus homogène issu de la reconnaissance d'écriture manuscrite : les jugements du Conseil des prud'hommes des tissus parisiens. In La fabrique numérique des corpus en sciences humaines et sociales, page https://www.septentrion.com/livre/?GCOI=27574100990460. Presses Universitaires du Septentrion.
Jack Bowers. 2022. Pathways and patterns of metaphor and metonymy in Mixtepec-Mixtec body-part terms. In The Grammar of Body-Part Expressions: A view from the Americas, pages 91–135. Roberto Zariquiey.
Other
Anas Fahad Khan, Ana Salgado, Rute Costa, Sara Carvalho, Laurent Romary, Bruno Almeida, Margarida Ramos, Mohamed Khemakhem, Raquel Silva and Toma Tasovac. 2022. Interlinking lexicographic data in the MORDigital project. .
Benoît Sagot, Laurent Romary, Rachel Bawden, Pedro Ortiz Suarez, Kelly Christensen, Simon Gabay, Ariane Pinche and Jean-Baptiste Camps. 2022. Gallic(orpor)a: Extraction, annotation et diffusion de l'information textuelle et visuelle en diachronie longue. .
Alix Chagué. 2022. Intelligence Artificielle et intelligence collective : des nouveaux eldorados pour rendre les textes patrimoniaux plus accessibles ?
Alix Chagué. 2022. Conditions de la mutualisation : les principes FAIR et HTR-United. .
Preprints
Alix Chagué, Thibault Clérice and Laurent Romary. 2022. HTR-United : un écosystème pour une approche mutualisée de la transcription automatique des écritures manuscrites. Preprint.
Handwritten Text Recognition (HTR) is a computer process that aims to obtain digital text equivalent to the content of the image of a physical handwritten document. Based on Github, HTR-United invites the community of users to decompartmentalize data sourced from different HTR platforms in order to reduce the costs of producing such data. This solution proposes an operational model that could offer a framework for the construction of data papers for HTR, and even the beginnings of a standardization for this type of publication.
Hugo Scheithauer, Alix Chagué and Laurent Romary. 2022. Which TEI representation for the output of automatic transcriptions and their metadata? An illustrated proposition. Preprint.
The recent and fast development of automatic transcription software is accompanied by a growing heterogeneity of formats to save the output of such a task. TEI P5 can be helpful to simplify workflows and bring in more coherence in digitization pipelines. We present a twofold modelization in TEI which brings together essential information resulting from the transcription phase with the editorial layers. The usefulness of this modelization is illustrated with several examples showing how such an approach can be leveraged at different stages of a digitization pipeline.
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina Mcmillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco de Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro von Werra, Leon Weber, Long Phan, Loubna Ben Allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-Shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh Hajihosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Ferrandis, Danish Contractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael Mckenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel de Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-Aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada and Thomas Wolf. 2022. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. Preprint.
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
Yu Lu Liu, Rachel Bawden, Thomas Scialom, Benoît Sagot and Jackie Chi Kit Cheung. 2022. MaskEval: Weighted MLM-Based Evaluation for Text Summarization and Simplification. Preprint.
In text summarization and simplification, system outputs must be evaluated along multiple dimensions such as relevance, factual consistency, fluency, and grammaticality, and a wide range of possible outputs could be of high quality. These properties make the development of an adaptable, reference-less evaluation metric both necessary and challenging. We introduce MaskEval, a reference-less metric for text summarization and simplification that operates by performing masked language modeling (MLM) on the concatenation of the candidate and the source texts. It features an attention-like weighting mechanism to modulate the relative importance of each MLM step, which crucially allows it to be adapted to evaluate different quality dimensions. We demonstrate its effectiveness on English summarization and simplification in terms of correlations with human judgments, and explore transfer scenarios between the two tasks.
Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoît Sagot, Abdelrahman Mohamed and Emmanuel Dupoux. 2022. Generative Spoken Dialogue Language Modeling: preprint version. Preprint.
We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. It is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces naturalistic turn taking. Generation samples can be found at: https://speechbot.github.io/dgslm.
Floriane Chiffoleau and Anne Baillot. 2022. Le projet DAHN : une pipeline pour l'édition numérique de documents d'archives. Preprint.
Angelina Mcmillan-Major, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Francesco de Toni, Gérard Dupont, Hady Elsahar, Chris Emezue, Alham Fikri Aji, Suzana Ilić, Nurulaqilla Khamis, Colin Leong, Maraim Masoud, Aitor Soroa, Pedro Ortiz Suarez, Zeerak Talat, Daniel van Strien and Yacine Jernite. 2022. Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources. Preprint.
In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficient documentation and tools for analysis. Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons. We present our development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types; and our lessons learned in this endeavor.
Sabrina J. Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y. Lee, Benoît Sagot and Samson Tan. 2022. Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP. Preprint.
What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete and atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in many areas, enabling small vocabularies while still allowing for fast inference. Is the end of the road character-level model or byte-level processing? In this survey, we connect several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications and that thinking seriously about tokenization remains important for many applications.
Louis Martin. 2021. Automatic sentence simplification using controllable and unsupervised methods. PhD thesis. Sorbonne Université.
In this thesis we study the task of automatic sentence simplification. We first study the different methods used to evaluate simplification models, highlight several shortcomings of current approaches, and propose new contributions. We then propose to train sentence simplification models that can be adapted to the target user, allowing for greater simplification flexibility. Finally, we extend the scope of sentence simplification to several languages, by proposing methods that do not require annotated training data, but that nevertheless achieve very strong performance.
Journal articles
Frank Uiterwaal, Franco Niccolucci, Sheena Bassett, Steven Krauwer, Hella Hollander, Femmy Admiraal, Laurent Romary, George Bruseker, Carlo Meghini, Jennifer Edmond and Mark Hedges. 2021. From disparate disciplines to unity in diversity How the PARTHENOS project has brought European humanities Research Infrastructures together. International Journal of Humanities and Arts Computing15, pages 101–116. Edinburgh University Press.
Since the first ESFRI roadmap in 2006, multiple humanities Research Infrastructures (RIs) have been set up all over the European continent, supporting archaeologists (ARIADNE), linguists (CLARIN-ERIC), Holocaust researchers (EHRI), cultural heritage specialists (IPERION-CH) and others. These examples only scratch the surface of the breadth of research communities that have benefited from close cooperation in the European Research Area.While each field developed discipline-specific services over the years, common themes can also be distinguished. All humanities RIs address, in varying degrees, questions around research data management, the use of standards and the desired interoperability of data across disciplinary boundaries.This article sheds light on how cluster project PARTHENOS developed pooled services and shared solutions for its audience of humanities researchers, RI managers and policymakers. In a time where the convergence of existing infrastructure is becoming ever more important – with the construction of a European Open Science Cloud as an audacious, ultimate goal – we hope that our experiences inform future work and provide inspiration on how to exploit synergies in interdisciplinary, transnational, scientific cooperation.
Rachel Bawden. 2021. [Book Review] Understanding Dialogue: Language Use and Social Interaction. Computational Linguistics. Massachusetts Institute of Technology Press (MIT Press).
Luca Foppiano, Sae Dieb, Akira Suzuki, Pedro Baptista de Castro, Suguru Iwasaki, Azusa Uzuki, Miren Garbine Esparza Echevarria, Yan Meng, Kensei Terashima, Laurent Romary, Yoshihiko Takano and Masashi Ishii. 2021. SuperMat: Construction of a linked annotated dataset from superconductors-related publications. Science and Technology of Advanced Materials: Methods1. Taylor & Francis.
A growing number of papers are published in the area of superconducting materials science. However, novel text and data mining (TDM) processes are still needed to efficiently access and exploit this accumulated knowledge, paving the way towards data-driven materials design. Herein, we present SuperMat (Superconductor Materials), an annotated corpus of linked data derived from scientific publications on superconductors, which comprises 142 articles, 16052 entities, and 1398 links that are characterised into six categories: the names, classes, and properties of materials; links to their respective superconducting critical temperature (Tc); and parametric conditions such as applied pressure or measurement methods. The construction of SuperMat resulted from a fruitful collaboration between computer scientists and material scientists, and its high quality is ensured through validation by domain experts. The quality of the annotation guidelines was ensured by satisfactory Inter Annotator Agreement (IAA) between the annotators and the domain experts. SuperMat includes the dataset, annotation guidelines, and annotation support tools that use automatic suggestions to help minimise human errors.
Naomi Truan and Laurent Romary. 2021. Building, Encoding, and Annotating a Corpus of Parliamentary Debates in XML-TEI: A Cross-Linguistic Account. Journal of the Text Encoding Initiative. TEI Consortium.
This data paper introduces an integrative and comprehensive method for the linguistic annotation of parliamentary discourse. Initially conceived as a documentation for a specific and rather small-scale research project, the annotation scheme takes into account national specificities and is geared to proposing an annotation scheme that is both highly standardised and adaptable to other research contexts. The paper reads as a specific application of the Text Encoding Initiative (TEI) framework applied to a subset of parliamentary debates. This strategy has two main applications: first, to develop a model for the encoding of parliamentary corpora by providing a systematic way of annotating both elements within the text (e.g. turns, incidents, interruptions) and the metadata associated with it (e.g. variables pertaining to the speaker or the speech event); second, to provide a cross-linguistic empirical basis for further annotation projects.
Conference proceedings
José Carlos Rosales Núñez, Djamé Seddah and Guillaume Wisniewski. 2021. Understanding the Impact of UGC Specificities on Translation Quality. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 189–198. Association for Computational Linguistics. Online.
This work takes a critical look at the evaluation of user-generated content automatic translation, the well-known specificities of which raise many challenges for MT. Our analyses show that measuring the average-case performance using a standard metric on a UGC test set falls far short of giving a reliable image of the UGC translation quality. That is why we introduce a new data set for the evaluation of UGC translation in which UGC specificities have been manually annotated using a fine-grained typology. Using this data set, we conduct several experiments to measure the impact of different kinds of UGC specificities on translation quality, more precisely than previously possible.
José Carlos Rosales Núñez, Guillaume Wisniewski and Djamé Seddah. 2021. Noisy UGC Translation at the Character Level: Revisiting Open-Vocabulary Capabilities and Robustness of Char-Based Models. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 199–211. Association for Computational Linguistics. Online.
This work explores the capacities of character-based Neural Machine Translation to translate noisy User-Generated Content (UGC) with a strong focus on exploring the limits of such approaches to handle productive UGC phenomena, which almost by definition, cannot be seen at training time. Within a strict zero-shot scenario, we first study the detrimental impact on translation performance of various user-generated content phenomena on a small annotated dataset we developed, and then show that such models are indeed incapable of handling unknown letters, which leads to catastrophic translation failure once such characters are encountered. We further confirm this behavior with a simple, yet insightful, copy task experiment and highlight the importance of reducing the vocabulary size hyper-parameter to increase the robustness of character-based models for machine translation.
Ghazi Felhi, Joseph Le Roux and Djamé Seddah. 2021. Challenging the Semi-Supervised VAE Framework for Text Classification. In Second Workshop on Insights from Negative Results in NLP (colocated with EMNLP). Association for Computational Linguistics. Punta Cana, Dominican Republic.
Semi-Supervised Variational Autoencoders (SSVAEs) are widely used models for data efficient learning. In this paper, we question the adequacy of the standard design of sequence SSVAEs for the task of text classification as we exhibit two sources of overcomplexity for which we provide simplifications. These simplifications to SSVAEs preserve their theoretical soundness while providing a number of practical advantages in the semi-supervised setup where the result of training is a text classifier. These simplifications are the removal of (i) the Kullback-Liebler divergence from its objective and (ii) the fully unobserved latent variable from its probabilistic model. These changes relieve users from choosing a prior for their latent variables, make the model smaller and faster, and allow for a better flow of information into the latent variables. We compare the simplified versions to standard SSVAEs on 4 text classification tasks. On top of the above-mentioned simplification, experiments show a speed-up of 26%, while keeping equivalent classification scores. The code to reproduce our experiments is public.
Arij Riabi, Benoît Sagot and Djamé Seddah. 2021. Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios? In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021) pages 423–436. Association for Computational Linguistics. Online.
Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high-resource languages. Building language models and, more generally, NLP systems for non-standardized and low-resource languages remains a challenging task. In this work, we focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data displaying a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set-tings.
Lana Yeganova, Dina Wiemann, Mariana Neves, Federica Vezzani, Amy Siu, Inigo Jauregi Unanue, Maite Oronoz, Nancy Mah, Aurélie Névéol, David Martinez, Rachel Bawden, Giorgio Maria Di Nunzio, Roland Roller, Philippe Thomas, Cristian Grozea, Olatz Perez-de-Viñaspre, Maika Vicente Navarro and Antonio Jimeno Yepes. 2021. Findings of the WMT 2021 Biomedical Translation Shared Task: Summaries of Animal Experiments as New Test Set. In Proceedings of the Sixth Conference on Machine Translation, pages 664–683. Association for Computational Linguistics. Online.
In the sixth edition of the WMT Biomedical Task, we addressed a total of eight language pairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian, and English/Basque. Further, our tests were composed of three types of textual test sets. New to this year, we released a test set of summaries of animal experiments, in addition to the test sets of scientific abstracts and terminologies. We received a total of 107 submissions from 15 teams from 6 countries.
Lionel Tadonfouet Tadjou, Fabrice Bourge, Tiphaine Marie, Laurent Romary and Eric Villemonte de La Clergerie. 2021. Building A Corporate Corpus For Threads Constitution. In Student Research Workshop associated with the International Conference on Recent Advances in Natural Language Processing (RANLP'2021). Online, Bulgaria.
In this paper we describe the process of building a corporate corpus that will be used as a reference for modelling and computing threads from conversations generated using communication and collaboration tools. The overall goal of the reconstruction of threads is to be able to provide value to the collorator in various use cases, such as higlighting the important parts of a running discussion, reviewing the upcoming commitments or deadlines, etc. Since, to our knowledge, there is no available corporate corpus for the French language which could allow us to address this problem of thread constitution, we present here a method for building such corpora including different aspects and steps which allowed the creation of a pipeline to pseudo-anonymise data. Such a pipeline is a response to the constraints induced by the General Data Protection Regulation GDPR in Europe and the compliance to the secrecy of correspondence.
Simon Gabay, Barbara Topalov, Caroline Corbières, Lucie Rondeau Du Noyer, Béatrice Joyeux-Prunel and Laurent Romary. 2021. Automating Artl@s–extracting data from exhibition catalogues. In EADH 2021 - Second International Conference of the European Association for Digital Humanities. Krasnoyarsk, Russia.
Catalogues, which have been published for centuries, are an extremely precious resource for scholars. Using the Artl@s database as an example, where exhibition catalogues are transformed into a georeferenced database, we question the possibility of an (almost) automatic transformation of pdfs into semantically annotated data. To do so, we present and analyse the graphic organisation of exhibition catalogues, before exploring a possible modeling into TEI (involving possible enhancement of the guidelines).
Julien Abadji, Pedro Javier Ortiz Suárez, Laurent Romary and Benoît Sagot. 2021. Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus. In CMLC 2021 - 9th Workshop on Challenges in the Management of Large Corpora. Limerick / Virtual, Ireland.
Since the introduction of large language models in Natural Language Processing, large raw corpora have played a crucial role in Computational Linguistics. However, most of these large raw corpora are either available only for English or not available to the general public due to copyright issues. Nevertheless, there are some examples of freely available multilingual corpora for training Deep Learning NLP models, such as the OSCAR and Paracrawl corpora. However, they have quality issues, especially for low-resource languages. Moreover, recreating or updating these corpora is very complex. In this work, we try to reproduce and improve the goclassy pipeline used to create the OSCAR corpus. We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data.Also, unlike OSCAR, the metadata information is at the document level. We release our pipeline under an open source license and publish the corpus under a research-only license.
Syrielle Montariol and Alexandre Allauzen. 2021. Transport Optimal pour le Changement Sémantique à partir de Plongements Contextualisés. In TALN 2021 - Traitement Automatique des Langues Naturelles, pages 235–244. ATALA. Lille / Virtuel, France.
Plusieurs méthodes de détection des changements sémantiques utilisant des plongements lexicaux contextualisés sont apparues récemment. Elles permettent une analyse fine du changement d’usage des mots, en agrégeant les plongements contextualisés en clusters qui reflètent les différents usages d’un mot. Nous proposons une nouvelle méthode basée sur le transport optimal. Nous l’évaluons sur plusieurs corpus annotés, montrant un gain de précision par rapport aux autres méthodes utilisant des plongements contextualisés, et l’illustrons sur un corpus d’articles de journaux.
Benjamin Muller, Antonios Anastasopoulos, Benoît Sagot and Djamé Seddah. 2021. When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 448–462. Association for Computational Linguistics. Online.
Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-theart performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. We show that transliterating those languages significantly improves the potential of large-scale multilingual language models on downstream tasks. This result provides a promising direction towards making these massively multilingual models useful for a new set of unseen languages.
Clémentine Fourrier, Rachel Bawden and Benoît Sagot. 2021. Can Cognate Prediction Be Modelled as a Low-Resource Machine Translation Task? In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 pages 847–861. Association for Computational Linguistics. Online.
Cognate prediction is the task of generating, in a given language, the likely cognates of words in a related language, where cognates are words in related languages that have evolved from a common ancestor word. It is a task for which little data exists and which can aid linguists in the discovery of previously undiscovered relations. Previous work has applied machine translation (MT) techniques to this task, based on the tasks' similarities, without, however, studying their numerous differences or optimising architectural choices and hyper-parameters. In this paper, we investigate whether cognate prediction can benefit from insights from low-resource MT. We first compare statistical MT (SMT) and neural MT (NMT) architectures in a bilingual setup. We then study the impact of employing data augmentation techniques commonly seen to give gains in low-resource MT: monolingual pretraining, backtranslation and multilinguality. Our experiments on several Romance languages show that cognate prediction behaves only to a certain extent like a standard lowresource MT task. In particular, MT architectures, both statistical and neural, can be successfully used for the task, but using supplementary monolingual data is not always as beneficial as using additional language data, contrarily to what is observed for MT.
Benjamin Muller, Yanai Elazar, Benoît Sagot and Djamé Seddah. 2021. First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2214–2231. Association for Computational Linguistics. Online.
Multilingual pretrained language models have demonstrated remarkable zero-shot crosslingual transfer capabilities. Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning. Despite promising results, we still lack a proper understanding of the source of this transfer. Using a novel layer ablation technique and analyses of the model's internal representations, we show that multilingual BERT, a popular multilingual language model, can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a taskspecific language-agnostic predictor. While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during finetuning, the task predictor has little importance on the transfer and can be reinitialized during fine-tuning. We present extensive experiments with three distinct tasks, seventeen typologically diverse languages and multiple domains to support our hypothesis.
Rute Costa, Ana Salgado, Anas Fahad Khan, Sara Carvalho, Laurent Romary, Bruno Almeida, Margarida Ramos, Mohamed Khemakhem, Raquel Silva and Toma Tasovac. 2021. MORDigital: The Advent of a New Lexicographical Portuguese Project. In eLex 2021 - Seventh biennial conference on electronic lexicography. Brno, Czech Republic.
MORDigital is a newly funded Portuguese lexicographical project that aims to produce highquality and searchable digital versions of the first three editions (1789; 1813; 1823) of the Diccionario da Lingua Portugueza by António de Morais Silva, preserving and making accessible this important work of European heritage. This paper will describe the current state of the art, the project, its objectives and the methodology proposed, the latter of which is based on a rigorous linguistic analysis and will also include steps necessary for the ontologisation of knowledge contained in and relating to the text. A section will be dedicated to the various investigation domains of the project description. The output of the project will be made available via a dedicated platform.
Antoine Gérard, Benoît Sagot and Emilie Pons. 2021. Le Traitement Automatique des Langues au service du vin. In Dataquitaine 2021 - IA, Recherche Opérationnelle & Data Science. Bordeaux / Virtual, France.
Dans cette présentation, nous proposons de détailler une collaboration fructueuse entre l'institut de recherche Inria et une startup bordelaise : Winespace. Nous nous intéresserons alors à l'analyse sémantique de commentaires de dégustation dans le but de recommander des vins présentant des caractéristiques similaires.
Farid Arthaud, Rachel Bawden and Alexandra Birch. 2021. Few-shot learning through contextual data augmentation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1049–1062. Association for Computational Linguistics. Online.
Machine translation (MT) models used in industries with constantly changing topics, such as translation or news agencies, need to adapt to new data to maintain their performance over time. Our aim is to teach a pre-trained MT model to translate previously unseen words accurately, based on very few examples. We propose (i) an experimental setup allowing us to simulate novel vocabulary appearing in human-submitted translations, and (ii) corresponding evaluation metrics to compare our approaches. We extend a data augmentation approach using a pre-trained language model to create training examples with similar contexts for novel words. We compare different fine-tuning and data augmentation approaches and show that adaptation on the scale of one to five examples is possible. Combining data augmentation with randomly selected training sentences leads to the highest BLEU score and accuracy improvements. Impressively, with only 1 to 5 examples, our model reports better accuracy scores than a reference system trained with on average 313 parallel examples.
Communications
Alix Chagué. 2021. CREMMA : Une infrastructure mutualisée pour la reconnaissance d'écritures manuscrites et la patrimonialisation numérique. In Sciences du patrimoine - sciences du texte. Confrontation des méthodes. Paris, France.
Hugo Scheithauer, Alix Chagué, Aurélia Rostaing, Lucas Terriel, Laurent Romary, Marie-Françoise Limon-Bonnet, Benjamin Davy, Gaetano Piraino, Franck Beltrami, Danis Habib, Nathalie Denis and Marc Durand. 2021. Production d'un modèle affiné de reconnaissance d'écriture manuscrite avec eScriptorium et évaluation de ses performances. In Les Futurs Fantastiques - 3e Conférence Internationale sur l'Intelligence Artificielle appliquée aux Bibliothèques, Archives et Musées, AI4LAM. Paris, France.
For this workshop, participants will take part in the fine-tuning of a handwritten text recognition (HTR) model with eScriptorium. Fine-tuning a model means retraining an initial generic model with a new dataset in order to specialize it in a particular domain.
Hugo Scheithauer, Alix Chagué and Laurent Romary. 2021. From eScriptorium to TEI Publisher. In Brace your digital scholarly edition!. Berlin, France.
Lucas Terriel. 2021. Atelier : Production d'un modèle affiné de reconnaissance d'écriture manuscrite avec eScriptorium et évaluation de ses performances. Évaluer son modèle HTR/OCR avec KaMI (Kraken as Model Inspector). In Les Futurs Fantastiques - 3e Conférence Internationale sur l'Intelligence Artificielle appliquée aux Bibliothèques, Archives et Musées. Paris, France.
Pauline Charbonnier, Lucas Terriel, Florence Clavaud, Laurent Romary, Gaetano Piraino and Vincent Verdese. 2021. NER4Archives (named entity recognition for archives) : méthodes et outils semi-automatiques pour reconnaître les entités nommées dans les instruments de recherche archivistiques encodés en XML/EAD. In Les Futurs Fantastiques - 3e Conférence Internationale sur l'Intelligence Artificielle appliquée aux Bibliothèques, Archives et Musées. Paris, France.
Alix Chagué and Rostaing Aurélia. 2021. LECTAUREP : Lecture Automatique des Répertoires de Notaires Parisiens. In Fantastic Futures 2021 / Futures Fantastiques 2021. Paris, France.
Alix Chagué and Aurélia Rostaing. 2021. LECTAUREP: Paris Notary Record Books Automated Reading. In Fantastic Futures 2021 / Futures Fantastiques 2021. Paris, France.
Floriane Chiffoleau, Anne Baillot and Manon Ovide. 2021. A TEI-based publication pipeline for historical egodocuments - the DAHN project. In Next Gen TEI, 2021 - TEI Conference and Members' Meeting. Virtual, United States.
Alix Chagué, Thibault Clérice and Laurent Romary. 2021. HTR-United : Mutualisons la vérité de terrain ! In DHNord2021 - Publier, partager, réutiliser les données de la recherche : les data papers et leurs enjeux Lille, France.
Hugo Scheithauer, Alix Chagué, Simon Gabay, Laurent Romary, Juliette Janes and Claire Jahan. 2021. From page to content–which TEI representation for HTR output? In Next Gen TEI, 2021 - TEI Conference and Members' Meeting Weaton (virtual), United States.
Alexandre Bartz, Juliette Janes, Laurent Romary, Philippe Gambette, Rachel Bawden, Pedro Ortiz Suarez, Benoît Sagot and Simon Gabay. 2021. Expanding the content model of annotationBlock. In Next Gen TEI, 2021 - TEI Conference and Members' Meeting. Virtual, United States.
Simon Gabay, Philippe Gambette, Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou and Benoît Sagot. 2021. Variation graphique dans les documents d'Ancien Régime : Nouvelles approches scriptométriques. In Journée d'étude : « Pour une histoire de la langue ‘par en bas': textes privés et variation des langues dans le passé »;. Paris, France.
Jean-Damien Généro, Alix Chagué, Victoria Le Fourner and Marie Puren. 2021. Transcribing and editing digitized sources on work in the textile industry. In Rémunérations et usages du temps des hommes et des femmes dans le textile en France de la fin du XVIIe au début du XXe siècle. Lyon, France.
Historians have been using digital tools for several decades. Time-Us project has been part ofthis long tradition by developing experimental methods of automatic transcription (ORC) andstructuring (XML) of handwritten archival documents and book collections. The sets chosen toillustrate this work are the minutes of the Conseil des prud'hommes de Paris (1847-1848, 1858,1878) and the monographs of the Ouvriers des deux mondes (1857-1913, 1930). Two stageswill be exposed. The first is the process of analysis and reproduction of logical structures(minutes of the labor court hearings and sections of the monographs), conducted on a ridgebetween the machine (automation of tasks) and the human hand (manual verifications andcorrections). The second is the extraction of textile-related information from the monographsand its availability to researchers. Finally, proposals will be made regarding the possible usesof digital technology in research programs.
Simon Gabay and Pedro Javier Ortiz Suárez. 2021. A dataset for automatic detection of places in (early) modern French texts. In Proceedings of the 50th Annual North American Society for Seventeenth-Century French Literature Conference. Online.
Alix Chagué and Floriane Chiffoleau. 2021. An accessible and transparent pipeline for publishing historical egodocuments. In WPIP21 - What's Past is Prologue: The NewsEye International Conference. Virtual, Austria.
The automatization of the processing of documents oriented towards online publication and exploration by the humanities increases the rapidity of treatments like the transcription, but they should also be an opportunity to make the experimentation and the resulting corpora sustainable and reusable. The DAHN project (Dispositif de soutien à l’Archivistique et aux Humanités Numériques) relies on a joint interdisciplinary collaboration between Inria, the EHESS and the University of Le Mans. By taking the example of egodocuments, the project aims to create a ready-to-use digital and scientific publishing pipeline going from the material archive to an online publication. In this presentation, we introduce our method and guidelines for the processing of non-digital-native textual documents using open-source and easily hackable tools that guarantee visibility across an accessible pipeline, thus challenging the notions of a black box or scattered tools which tend to be hard to maintain in the long run.
Alix Chagué and Aurélia Rostaing. 2021. Présentation du projet Lectaurep (Lecture automatique de répertoires). In Atelier sur la transcription des écritures manuscrites - BnF DataLab. Paris, France.
Arij Riabi, Thomas Scialom, Rachel Keraron, Benoît Sagot, Djamé Seddah and Jacopo Staiano. 2021. Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Punta cana, Dominican Republic.
Coupled with the availability of large scale datasets, deep learning architectures have enabled rapid progress on the Question Answering task. However, most of those datasets are in English, and the performances of state-of-the-art multilingual models are significantly lower when evaluated on non-English data. Due to high data collection costs, it is not realistic to obtain annotated data for each language one desires to support. We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a cross-lingual fashion. We show that the proposed method allows to significantly outperform the baselines trained on English data only. We report a new state-of-the-art on four multilingual datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).
Tech reports
Julien Launay, Elena Tommasone, Baptiste Pannier, François Boniface, Amélie Chatelain, Alessandro Cappelli, Iacopo Poli and Djamé Seddah. 2021. PAGnol: An Extra-Large French Generative Model. Technical report.
Access to large pre-trained models of varied architectures, in many different languages, is central to the democratization of NLP. We introduce PAGnol, a collection of French GPT models. Using scaling laws, we efficiently train PAGnol-XL (1.5B parameters) with the same computational budget as CamemBERT, a model 13 times smaller. PAGnol-XL is the largest model trained to date for the French language. We plan to train increasingly large and performing versions of PAGnol, exploring the capabilities of French extreme-scale models. For this first release, we focus on the pre-training and scaling calculations underlining PAGnol. We fit a scaling law for compute for the French language, and compare it with its English counterpart. We find the pre-training dataset significantly conditions the quality of the outputs, with common datasets such as OSCAR leading to low-quality offensive text. We evaluate our models on discriminative and generative tasks in French, comparing to other state-of-the-art French and multilingual models, and reaching the state of the art in the abstract summarization task. Our research was conducted on the public GENCI Jean Zay supercomputer, and our models up to the Large are made publicly available.
Toma Tasovac, Laurent Romary, Erzsébet Tóth-Czifra and Irena Marinski. 2021. Lexicographic Data Seal of Compliance. Technical report.
Other
Alix Chagué. 2021. Comment faire lire des gribouillis à mon ordinateur ?
Preprints
Laurent Romary. 2021. Normes et patrimoine numérique. Preprint.
Thomas Scialom, Louis Martin, Jacopo Staiano, Eric Villemonte de La Clergerie and Benoît Sagot. 2021. Rethinking Automatic Evaluation in Sentence Simplification. Preprint.
Automatic evaluation remains an open research question in Natural Language Generation. In the context of Sentence Simplification, this is particularly challenging: the task requires by nature to replace complex words with simpler ones that shares the same meaning. This limits the effectiveness of n-gram based metrics like BLEU. Going hand in hand with the recent advances in NLG, new metrics have been proposed, such as BERTScore for Machine Translation. In summarization, the QuestEval metric proposes to automatically compare two texts by questioning them. In this paper, we first propose a simple modification of QuestEval allowing it to tackle Sentence Simplification. We then extensively evaluate the correlations w.r.t. human judgement for several metrics including the recent BERTScore and QuestEval, and show that the latter obtain state-of-the-art correlations, outperforming standard metrics like BLEU and SARI. More importantly, we also show that a large part of the correlations are actually spurious for all the metrics. To investigate this phenomenon further, we release a new corpus of evaluated simplifications, this time not generated by systems but instead, written by humans. This allows us to remove the spurious correlations and draw very different conclusions from the original ones, resulting in a better understanding of these metrics. In particular, we raise concerns about very low correlations for most of traditional metrics. Our results show that the only significant measure of the Meaning Preservation is our adaptation of QuestEval.
Alix Chagué and Floriane Chiffoleau. 2021. An accessible and transparent pipeline for publishing historical egodocuments. Preprint.
The automatization of the processing of documents oriented towards online publication and exploration by the humanities increases the rapidity of treatments like the transcription, but they should also be an opportunity to make the experimentation and the resulting corpora sustainable and reusable. The DAHN project (Dispositif de soutien à l’Archivistique et aux Humanités Numériques) relies on a joint interdisciplinary collaboration between Inria, the EHESS and the University of Le Mans. By taking theexample of egodocuments, the project aims to create a ready-to-use digital and scientific publishing pipeline going from the material archive to an online publication.In this presentation, we introduce our method and guidelines for the processing of non-digital-native textual documents using open-source and easily hackable tools that guarantee visibility across an accessible pipeline, thus challenging the notions of a black box or scattered tools which tend to be hard to maintain in the long run.
Benjamin Muller, Yanai Elazar, Benoît Sagot and Djamé Seddah. 2021. First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT. Preprint.
Multilingual pretrained language models have demonstrated remarkable zero-shot cross-lingual transfer capabilities. Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning. Despite promising results, we still lack a proper understanding of the source of this transfer. Using a novel layer ablation technique and analyses of the model's internal representations, we show that multilingual BERT, a popular multilingual language model, can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a task-specific language-agnostic predictor. While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during fine-tuning, the task predictor has little importance on the transfer and can be reinitialized during fine-tuning. We present extensive experiments with three distinct tasks, seventeen typologically diverse languages and multiple domains to support our hypothesis.
Benjamin Muller, Benoît Sagot and Djamé Seddah. 2021. Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi. Preprint.
Building natural language processing systems for non standardized and low resource languages is a difficult challenge. The recent success of large-scale multilingual pretrained language models provides new modeling tools to tackle this. In this work, we study the ability of multilingual language models to process an unseen dialect. We take user generated North-African Arabic as our case study, a resource-poor dialectal variety of Arabic with frequent code-mixing with French and written in Arabizi, a non-standardized transliteration of Arabic to Latin script. Focusing on two tasks, part-of-speech tagging and dependency parsing, we show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect, specifically in two extreme cases: (i) across scripts, using Modern Standard Arabic as a source language, and (ii) from a distantly related language, unseen during pretraining, namely Maltese. Our results constitute the first successful transfer experiments on this dialect, paving thus the way for the development of an NLP ecosystem for resource-scarce, non-standardized and highly variable vernacular languages.
Louis Martin, Angela Fan, Eric Villemonte de La Clergerie, Antoine Bordes and Benoît Sagot. 2021. Multilingual Unsupervised Sentence Simplification. Preprint.
Progress in Sentence Simplification has been hindered by the lack of supervised data, particularly in languages other than English. Previous work has aligned sentences from original and simplified corpora such as English Wikipedia and Simple English Wikipedia, but this limits corpus size, domain, and language. In this work, we propose using unsupervised mining techniques to automatically create training corpora for simplification in multiple languages from raw Common Crawl web data. When coupled with a controllable generation mechanism that can flexibly adjust attributes such as length and lexical complexity, these mined paraphrase corpora can be used to train simplification systems in any language. We further incorporate multilingual unsupervised pretraining methods to create even stronger models and show that by training on mined data rather than supervised corpora, we outperform the previous best results. We evaluate our approach on English, French, and Spanish simplification benchmarks and reach state-of-the-art performance with a totally unsupervised approach. We will release our models and code to mine the data in any language included in Common Crawl.
Mohamed Khemakhem. 2020. Standard-based lexical models for automatically structured dictionnaries. PhD thesis. Université Paris Cité.
Dictionaries could be considered as the most comprehensive reservoir of human knowledge, which carry not only the lexical description of words in one or more languages, but also the commun awareness of a certain community about every known piece of knowledge in a time frame. Print dictionaries are the principle resources which enable the documentation and transfer of such knowledge. They already exist in abundant numbers, while new ones are continuously compiled, even with the recent strong move to digital resources. However, a majority of these dictionaries, even when available digitally, is still not fully structured due to the absence of scalable methods and techniques that can cover the variety of corresponding material. Moreover, the relatively few existing structured resources present limited exchange and query alternatives, given the discrepancy of their data models and formats. In this thesis we address the task of parsing lexical information in print dictionaries through the design of computer models that enable their automatic structuring. Solving this task goes hand in hand with finding a standardised output for these models to guarantee a maximum interoperability among resources and usability for downstream tasks. First, we present different classifications of the dictionaric resources to delimit the category of print dictionaries we aim to process. Second, we introduce the parsing task by providing an overview of the processing challenges and a study of the state of the art. Then, we present a novel approach based on a top-down parsing of the lexical information. We also outline the architecture of the resulting system, called GROBID-Dictionaries, and the methodology we followed to close the gap between the conception of the system and its applicability to real-world scenarios. After that, we draw the landscape of the leading standards for structured lexical resources. In addition, we provide an analysis of two ongoing initiatives, TEI-Lex-0 and LMF, that aim at the unification of modelling the lexical information in print and electronic dictionaries. Based on that, we present a serialisation format that is inline with the schemes of the two standardisation initiatives and fits the approach implemented in our parsing system. After presenting the parsing and standardised serialisation facets of our lexical models, we provide an empirical study of their performance and behaviour. The investigation is based on a specific machine learning setup and series of experiments carried out with a selected pool of varied dictionaries. We try in this study to present different ways for feature engineering and exhibit the strength and the limits of the best resulting models. We also dedicate two series of experiments for exploring the scalability of our models with regard to the processed documents and the employed machine learning technique. Finally, we sum up this thesis by presenting the major conclusions and opening new perspectives for extending our investigations in a number of research directions for parsing entry-based documents.
Mohamed Khemakhem. 2020. Standard-based Lexical Models for Automatically Structured Dictionaries. PhD thesis. Université de Paris.
Dictionaries could be considered as the most comprehensive reservoir of human knowledge, which carry not only the lexical description of words in one or more languages, but also the commun awareness of a certain community about every known piece of knowledge in a time frame. Print dictionaries are the principle resources which enable the documentation and transfer of such knowledge. They already exist in abundant numbers, while new ones are continuously compiled, even with the recent strong move to digital resources. However, a majority of these dictionaries, even when available digitally, is still not fully structured due to the absence of scalable methods and techniques that can cover the variety of corresponding material. Moreover, the relatively few existing structured resources present limited exchange and query alternatives, given the discrepancy of their data models and formats. In this thesis we address the task of parsing lexical information in print dictionaries through the design of computer models that enable their automatic structuring. Solving this task goes hand in hand with finding a standardised output for these models to guarantee a maximum interoperability among resources and usability for downstream tasks. First, we present different classifications of the dictionaric resources to delimit the category of print dictionaries we aim to process. Second, we introduce the parsing task by providing an overview of the processing challenges and a study of the state of the art. Then, we present a novel approach based on a top-down parsing of the lexical information. We also outline the architecture of the resulting system, called GROBID-Dictionaries, and the methodology we followed to close the gap between the conception of the system and its applicability to real-world scenarios. After that, we draw the landscape of the leading standards for structured lexical resources. In addition, we provide an analysis of two ongoing initiatives, TEI-Lex-0 and LMF, that aim at the unification of modelling the lexical information in print and electronic dictionaries. Based on that, we present a serialisation format that is inline with the schemes of the two standardisation initiatives and fits the approach implemented in our parsing system. After presenting the parsing and standardised serialisation facets of our lexical models, we provide an empirical study of their performance and behaviour. The investigation is based on a specific machine learning setup and series of experiments carried out with a selected pool of varied dictionaries. We try in this study to present different ways for feature engineering and exhibit the strength and the limits of the best resulting models. We also dedicate two series of experiments for exploring the scalability of our models with regard to the processed documents and the employed machine learning technique. Finally, we sum up this thesis by presenting the major conclusions and opening new perspectives for extending our investigations in a number of research directions for parsing entry-based documents.
Jack Bowers. 2020. Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec. PhD thesis. École Pratique des Hauts Études.
This dissertation concerns a language documentation project covering the Mixtepec-Mixtec variety of Mixtec (ISO 639-3: mix). Mixtepec-Mixtec is an Oto-Manguean spoken by roughly 9000- 10000 people in San Juan Mixtepec Municipality in the Juxtlahuaca district of Oaxaca, Mexico and by several thousand speakers living in Baja California, Tlaxiaco, Santiago Juxtlahuaca. There are also significant populations in the United States, most notably in California, around Santa Maria and Oxnard, as well as in Oregon, Florida, and Arkansas.The core facets of the work are: the creation a body of linguistic resources for the MIX language and community; the evaluation the current tools, standards and practices used in language documentation; an account of how the TEI and related XML technologies can be used as the primary encoding, metadata, and annotation format for multi-dimensional linguistic projects, including under-resourced languages. The concrete resources produced are: a multilingual TEI dictionary; a collection of audio recordings published and archived on Harvard Dataverse; a corpus of texts derived from a combination of spoken language transcriptions and texts encoded and annotated in TEI, as well as linguistic and lexicographic descriptions and analyses of the Mixtepec-Mixtec language.Due to the array of different data and resources produced, this project has components that equally fall within the fields of: digital humanities, language documentation, language description and corpus linguistics. Because of this overlapping relevance, over the processes of attempting to carry out this work in line with best practices in each sub-field, this work addresses the need to further bring together the intersecting interests, technologies, practices and standards relevant to, and used in each of these related fields.
Loïc Grobol. 2020. Coreference resolution for spoken French. PhD thesis. Université Sorbonne Nouvelle - Paris 3.
A coreference chain is the set of linguistic expressions — or mentions — that refer to the same entity or discourse object in a given document. Coreference resolution consists in detecting all the mentions in a document and partitioning their set into coreference chains. Coreference chainsplay a central role in the consistency of documents and interactions, and their identification has applications to many other fields in natural language processing that rely on an understanding of language, such as information extraction, question answering or machine translation. Natural language processing systems that perform this task exist for many languages, but none for French — which suffered until recently from a lack of suitable annotated resources — and none for spoken language.In this thesis, we aim to fill this gap by designing a coreference resolution system for spoken French. To this end, we propose a knowledge-poor system based on an end-to-end neural network architecture, which obviates the need for the preprocessing pipelines common in existing systems, while maintaining performances comparable to the state-of-the art. We then propose extensions on that baseline, by augmenting our system with external knowledge obtained from resources and preprocessing tools designed for written French. Finally, we propose a new standard representation for coreference annotation in corpora of written and spoken languages, and demonstrate its use in a new version of ANCOR, the first coreference corpus of spoken French.
Journal articles
Xinying Chen and Kim Gerdes. 2020. Dependency Distances and Their Frequencies in Indo-European Language. Journal of Quantitative Linguistics, pages 1–20. Taylor & Francis (Routledge).
The present study investigates the relationship between two features of dependencies, namely, dependency distances and dependency frequencies. The study is based on the analysis of a parallel dependency treebank that includes 10 Indo-European languages. Two corresponding random dependency treebanks are generated as baselines for comparison. After computing the values of dependency distances and their frequencies in these treebanks, for each lan-guage, we fit four functions, namely quadratic, exponent, logarithm, and power-law func-tions, to its original and random datasets. The preliminary result shows that there is a rela-tion between the two dependency features for all 10 Indo-European languages. The relation can be further formalized as a power-law function which can distinguish the observed data from randomly generated datasets.
Laurent Romary. 2020. Découpler gestion des manuscrits de publication et évaluation par les pairs : la plateforme de gestion de revues Épisciences. I2D -- Information, données & documents. A.D.B.S.
Fondée sur un modèle original, la plateforme Épisciences, qui contient actuellement 15 revues, propose un outil complet pour la gestion d’une revue, son hébergement et la diffusion de ses contenus. Elle assure l’hébergement de revues en open access (épi-revues) et le processus de soumission des articles à ces revues, via un dépôt dans une archive ouverte telle que HAL. Les personnels documentaires jouent ici un rôle d’accompagnement décisif.
Andrea Bertino, Luca Foppiano, Laurent Romary and Pierre Mounier. 2020. Leveraging Concepts in Open Access Publications. Journal of Data Mining and Digital Humanities2019. Episciences.org.
This paper addresses the integration of a Named Entity Recognition and Disambiguation (NERD) service within a group of open access (OA) publishing digital platforms and considers its potential impact on both research and scholarly publishing. The software powering this service, called entity-fishing, was initially developed by Inria in the context of the EU FP7 project CENDARI and provides automatic entity recognition and disambiguation using the Wikipedia and Wikidata data sets. The application is distributed with an open-source licence, and it has been deployed as a web service in DARIAH's infrastructure hosted by the French HumaNum. In the paper, we focus on the specific issues related to its integration on five OA platforms specialized in the publication of scholarly monographs in the social sciences and humanities (SSH), as part of the work carried out within the EU H2020 project HIRMEOS (High Integration of Research Monographs in the European Open Science infrastructure). In the first section, we give a brief overview of the current status and evolution of OA publications, considering specifically the challenges that OA monographs are encountering. In the second part, we show how the HIRMEOS project aims to face these challenges by optimizing five OA digital platforms for the publication of monographs from the SSH and ensuring their interoperability. In sections three and four we give a comprehensive description of the entity-fishing service, focusing on its concrete applications in real use cases together with some further possible ideas on how to exploit the annotations generated. We show that entity-fishing annotations can improve both research and publishing process. In the last chapter, we briefly present further possible application scenarios that could be made available through infrastructural projects.
Luca Foppiano and Laurent Romary. 2020. Entity-fishing: a DARIAH entity recognition and disambiguation service. Journal of the Japanese Association for Digital Humanities5, pages 22–60. Japanese Association for Digital Humanities.
This paper presents an attempt to provide a generic named-entity recognition and disambiguation module (NERD) called entity-fishing as a stable online service that demonstrates the possible delivery of sustainable technical services within DARIAH, the European digital research infrastructure for the arts and humanities. Deployed as part of the national infrastructure Huma-Num in France, this service provides an efficient state-of-the-art implementation coupled with standardised interfaces allowing an easy deployment on a variety of potential digital humanities contexts. Initially developed in the context of the FP9 EU project CENDARI, the software was well received by the user community and continued to be further developed within the H2020 HIRMEOS project where several open access publishers have integrated the service to their collections of published monographs as a means to enhance retrieval and access. entity-fishing implements entity extraction as well as disambiguation against Wikipedia and Wikidata entries. The service is accessible through a REST API which allows easier and seamless integration, language independent and stable convention and a widely used service-oriented architecture (SOA) design. Input and output data are carried out over a query data model with a defined structure providing flexibility to support the processing of partially annotated text or the repartition of text over several queries. The interface implements a variety of functionalities, like language recognition, sentence segmentation and modules for accessing and looking up concepts in the knowledge base. The API itself integrates more advanced contextual parametrisation or ranked outputs, allowing for the resilient integration in various possible use cases. The entity-fishing API has been used as a concrete use case to draft the experimental stand-off proposal, which has been submitted for integration into the TEI guidelines. The representation is also compliant with the Web Annotation Data Model (WADM). In this paper we aim at describing the functionalities of the service as a reference contribution to the subject of web-based NERD services. In this paper, we detail the workflow from input to output and unpack each building box in the processing flow. Besides, with a more academic approach, we provide a transversal schema of the different components taking into account non-functional requirements in order to facilitate the discovery of bottlenecks, hotspots and weaknesses. We also describe the underlying knowledge base, which is set up on the basis of Wikipedia and Wikidata content. We conclude the paper by presenting our solution for the service deployment: how and which the resources where allocated. The service has been in production since Q3 of 2017, and extensively used by the H2020 HIRMEOS partners during the integration with the publishing platforms.
Conference proceedings
Hila Gonen, Ganesh Jawahar, Djamé Seddah and Yoav Goldberg. 2020. Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 538–555. Association for Computational Linguistics. Online.
The problem of comparing two bodies of text and searching for words that differ in their usage between them arises often in digital humanities and computational social science. This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large. However, these methods often require extensive filtering of the vocabulary to perform well, and-as we show in this work-result in unstable, and hence less reliable, results. We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word. The method is simple, interpretable and stable. We demonstrate its effectiveness in 9 different setups, considering different corpus splitting criteria (age, gender and profession of tweet authors, time of tweet) and different languages (English, French and Hebrew).
Gaël Guibon, Marine Courtin, Kim Gerdes and Bruno Guillaume. 2020. When Collaborative Treebank Curation Meets Graph Grammars. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5291–5300. European Language Resources Association. Marseille, France.
In this paper we present Arborator-Grew, a collaborative annotation tool for treebank development. Arborator-Grew combines the features of two preexisting tools: Arborator and Grew. Arborator is a widely used collaborative graphical online dependency treebank annotation tool. Grew is a tool for graph querying and rewriting specialized in structures needed in NLP, i.e. syntactic and semantic dependency trees and graphs. Grew also has an online version, Grew-match, where all Universal Dependencies treebanks in their classical, deep and surface-syntactic flavors can be queried. Arborator-Grew is a complete redevelopment and modernization of Arborator, replacing its own internal database storage by a new Grew API, which adds a powerful query tool to Arborator's existing treebank creation and correction features. This includes complex access control for parallel expert and crowd-sourced annotation, tree comparison visualization, and various exercise modes for teaching and training of annotators. Arborator-Grew opens up new paths of collectively creating, updating, maintaining, and curating syntactic treebanks and semantic graph banks.
Pedro Javier Ortiz Suárez, Yoann Dupont, Gaël Lejeune and Tian Tian. 2020. SinNer@Clef-Hipe2020 : Sinful adaptation of SotA models for Named Entity Recognition in French and German. In CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. Thessaloniki / Virtual, Greece.
In this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers. The challenge proposed various tasks for three languages, among them we focused on Named Entity Recognition in French and German texts. The best system we proposed ranked third for these two languages, it uses FastText em-beddings and Elmo language models (FrELMo and German ELMo). We show that combining several word representations enhances the quality of the results for all NE types and that the segmentation in sentences has an important impact on the results.
Robin Algayres, Mohamed Salah Zaiem, Benoît Sagot and Emmanuel Dupoux. 2020. Evaluating the Reliability of Acoustic Speech Embeddings. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020), pages 4621–4625.
Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to un-supervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimize the quality of these embeddings in a task-neutral way. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods, ranging from supervised to fully unsu-pervised, and using different loss functions (autoencoders, cor-respondance autoencoders, siamese). Then we use the ABX and MAP to predict performances on a new downstream task: the unsupervised estimation of the frequencies of speech segments in a given corpus. We find that overall, ABX and MAP correlate with one another and with frequency estimation. However, substantial discrepancies appear in the fine-grained distinctions across languages and/or embedding methods. This makes it un-realistic at present to propose a task-independent silver bullet method for computing the intrinsic quality of speech embed-dings. There is a need for more detailed analysis of the metrics currently used to evaluate such embeddings.
Tanti Kristanti and Laurent Romary. 2020. DeLFT and entity-fishing : Tools for CLEF HIPE 2020 Shared Task. In CLEF 2020 - Conference and Labs of the Evaluation Forum2696. CEUR. Thessaloniki / Virtual, Greece.
This article presents an overview of approaches and results during our participation in the CLEF HIPE 2020 NERC-COARSE-LIT and EL-ONLY tasks for English and French. For these two tasks, we use two systems: 1) DeLFT, a Deep Learning framework for text processing; 2) entity-fishing, generic named entity recognition and disambiguation service deployed in the technical framework of INRIA.
Fernando Alva-Manchego, Louis Martin, Antoine Bordes, Carolina Scarton, Benoît Sagot and Lucia Specia. 2020. ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4668–4679. Association for Computational Linguistics. Online.
In order to simplify a sentence, human editors perform multiple rewriting transformations: they split it into several shorter sentences , paraphrase words (i.e. replacing complex words or phrases by simpler synonyms), reorder components, and/or delete information deemed unnecessary. Despite these varied range of possible text alterations, current models for automatic sentence simplification are evaluated using datasets that are focused on a single transformation, such as lexical paraphrasing or splitting. This makes it impossible to understand the ability of simplification models in more realistic settings. To alleviate this limitation, this paper introduces ASSET, a new dataset for assessing sentence simplification in English. ASSET is a crowdsourced multi-reference corpus where each simplification was produced by executing several rewriting transformations. Through quantitative and qualitative experiments, we show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task. Furthermore, we motivate the need for developing better methods for automatic evaluation using ASSET, since we show that current popular metrics may not be suitable when multiple simplification transformations are performed.
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de La Clergerie, Djamé Seddah and Benoît Sagot. 2020. CamemBERT: a Tasty French Language Model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219. Online.
Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the con-catenation of data in multiple languages. This makes practical use of such models-in all languages except English-very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.
Djamé Seddah, Farah Essaidi, Amal Fethi, Matthieu Futeral, Benjamin Muller, Pedro Javier Ortiz Suárez, Benoît Sagot and Abhishek Srivastava. 2020. Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1139–1150. Association for Computational Linguistics. Online.
We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. Made of 1500 sentences, fully annotated in morpho-syntax and Universal Dependency syntax, with full translation at both the word and the sentence levels, this treebank is made freely available. It is supplemented with 50k unlabeled sentences collected from Common Crawl and web-crawled data using intensive data-mining techniques. Preliminary experiments demonstrate its usefulness for POS tagging and dependency parsing. We believe that what we present in this paper is useful beyond the low-resource language community. This is the first time that enough unlabeled and annotated data is provided for an emerging user-generated content dialectal language with rich morphology and code switching, making it an challenging test-bed for most recent NLP approaches.
Pedro Javier Ortiz Suárez, Laurent Romary and Benoît Sagot. 2020. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1703–1714. Association for Computational Linguistics. Online.
We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.
Clémentine Fourrier. 2020. Évolution phonologique des langues et réseaux de neurones : travaux préliminaires (Sound change and neural networks: preliminary experiments ). In Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 3 : Rencontre des Étudiants Chercheurs en Informatique pour le TAL, pages 110–122. ATALA et AFCP. Nancy, France.
Cognate prediction is a key task in historical linguistics that presents a number of similarities withmachine translation. However, although neural methods are now widespread in machine translation,they are still largely unused in historical linguistics. In this paper, we study the performance ofneural methods (more specifically encoder-decoder networks) for the task of cognate prediction. Wefocus in particular on the types of data that can be used for this task, and compare the performanceof statistical and neural methods. We show that sound correspondances can only be learned usingcognate datasets, and that statistical and neural methods seem to have complementary strengths andweaknesses regarding what they learn about the data.
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Benoît Sagot and Djamé Seddah. 2020. Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l'hétérogénéité des données d'entrainement (C AMEM BERT Contextual Language Models for French: Impact of Training Data Size and Heterogeneity ). In Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles, pages 54–65. ATALA et AFCP. Nancy, France.
Contextual word embeddings have become ubiquitous in Natural Language Processing. Until recently,most available models were trained on English data or on the concatenation of corpora in multiplelanguages. This made the practical use of models in all languages except English very limited.The recent release of monolingual versions of BERT (Devlinet al., 2019) for French establisheda new state-of-the-art for all evaluated tasks. In this paper, based on experiments on CamemBERT(Martinet al., 2019), we show that pretraining such models on highly variable datasets leads to betterdownstream performance compared to models trained on more uniform data. Moreover, we show thata relatively small amount of web crawled data (4GB) leads to downstream performances as good as amodel pretrained on a corpus two orders of magnitude larger (138GB)
Murielle Fabre, Pedro Javier Ortiz Suárez, Benoît Sagot and Éric Villemonte de La Clergerie. 2020. French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus. In CMLC-8 - 8th Workshop on the Challenges in the Management of Large Corpora. Marseille, France.
This paper describes and compares the impact of different types and size of training corpora on language models like ELMO. By asking the fundamental question of quality versus quantity we evaluate four French corpora for training on parsing scores, POS-tagging and named-entities recognition downstream tasks. The paper studies the relevance of a new corpus, CaBeRnet, featuring a representative range of language usage, including a balanced variety of genres (oral transcriptions, newspapers, popular magazines, technical reports, fiction, academic texts), in oral and written styles. We hypothesize that a linguistically representative and balanced corpora will allow the language model to be more efficient and representative of a given language and therefore yield better evaluation scores on different evaluation sets and tasks.
Louis Martin, Éric Villemonte de La Clergerie, Benoît Sagot and Antoine Bordes. 2020. Controllable Sentence Simplification. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4689–4698. European Language Resources Association. Marseille, France.
Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical. It is often considered an all-purpose generic task where the same simplification is suitable for all; however multiple audiences can benefit from simplified text in different ways. We adapt a discrete parametrization mechanism that provides explicit control on simplification systems based on Sequence-to-Sequence models. As a result, users can condition the simplifications returned by a model on attributes such as length, amount of paraphrasing, lexical complexity and syntactic complexity. We also show that carefully chosen values of these attributes allow out-of-the-box Sequence-to-Sequence models to outperform their standard counterparts on simplification benchmarks. Our model, which we call ACCESS (as shorthand for AudienCe-CEntric Sentence Simplification), establishes the state of the art at 41.87 SARI on the WikiLarge test set, a +1.42 improvement over the best previously reported score.
Clémentine Fourrier and Benoît Sagot. 2020. Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB-2.0. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3207–3216. European Language Resources Association. Marseille, France.
Diachronic lexical information was mostly used in its natural field, historical linguistics, until recently, when promising but not yet conclusive applications to low resource languages machine translation started extending its usage to NLP. There is therefore a new need for fine-grained, large-coverage and accurate etymological lexical resources. In this paper, we propose a set of guidelines to generate such resources, for each step of the life-cycle of an etymological lexicon: creation, update, evaluation, dissemination, and exploitation. To illustrate the guidelines, we introduce EtymDB 2.0, an etymological database automatically generated from the Wiktionary, which contains 1.8 million lexemes, linked by more than 700,000 fine-grained etymological relations, across 2,536 living and dead languages. We also introduce use cases for which EtymDB 2.0 could represent a key resource, such as phylogenetic tree generation, low resource machine translation and medieval languages study.
Gaël Guibon and Benoît Sagot. 2020. OFrLex: A Computational Morphological and Syntactic Lexicon for Old French. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3217–3225. European Language Resources Association. Marseille, France.
In this paper we describe our work on the development and enrichment of OFrLex, a freely available, large-coverage morphological and syntactic Old French lexicon. We rely on several heterogeneous language resources to extract structured and exploitable information. The extraction follows a semi-automatic procedure with substantial manual steps to respond to difficulties encountered while aligning lexical entries from distinct language resources. OFrLex aims at improving natural language processing tasks on Old French such as part-of-speech tagging and dependency parsing. We provide quantitative information on OFrLex and discuss its reliability. We also describe and evaluate a semi-automatic, word-embedding-based lexical enrichment process aimed at increasing the accuracy of the resource. Results of this extension technique will be manually validated in the near future, a step that will take advantage of OFrLex's viewing, searching and editing interface, which is already accessible online.
Fahad Khan, Laurent Romary, Ana Salgado, Jack Bowers, Mohamed Khemakhem and Toma Tasovac. 2020. Modelling Etymology in LMF/TEI: The Grande Dicionário Houaiss da Língua Portuguesa Dictionary as a Use Case. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3172–3180. European Language Resources Association. Marseille, France.
In this article, we will introduce two of the new parts of the new multi-part version of the Lexical Markup Framework (LMF) ISO standard, namely Part 3 of the standard (ISO 24613-3), which deals with etymological and diachronic data, and Part 4 (ISO 24613-4), which consists of a TEI serialisation of all of the prior parts of the model. We will demonstrate the use of both standards by describing the LMF encoding of a small number of examples taken from a sample conversion of the reference Portuguese dictionary Grande Dicionário Houaiss da Língua Portuguesa, part of a broader experiment comprising the analysis of different, heterogeneously encoded, Portuguese lexical resources. We present the examples in the Unified Modelling Language (UML) and also in a couple of cases in TEI.
Pedro Javier Ortiz Suárez, Yoann Dupont, Benjamin Muller, Laurent Romary and Benoît Sagot. 2020. Establishing a New State-of-the-Art for French Named Entity Recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4631–4638. European Language Resources Association. Marseille, France.
The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no large-scale French corpus with named entity annotations contain referential information, which complement the type and the span of each mention with an indication of the entity it refers to. We have manually annotated the French TreeBank with such information, after an automatic pre-annotation step. We sketch the underlying annotation guidelines and we provide a few figures about the resulting annotations.
Clémentine Fourrier and Benoît Sagot. 2020. Comparing Statistical and Neural Models for Learning Sound Correspondences. In LT4HALA 2020 - First Workshop on Language Technologies for Historical and Ancient Languages. Marseille, France.
Cognate prediction and proto-form reconstruction are key tasks in computational historical linguistics that rely on the study of sound change regularity. Solving these tasks appears to be very similar to machine translation, though methods from that field have barely been applied to historical linguistics. Therefore, in this paper, we investigate the learnability of sound correspondences between a proto-language and daughter languages for two machine-translation-inspired models, one statistical, the other neural. We first carry out our experiments on plausible artificial languages, without noise, in order to study the role of each parameter on the algorithms respective performance under almost perfect conditions. We then study real languages, namely Latin, Italian and Spanish, to see if those performances generalise well. We show that both model types manage to learn sound changes despite data scarcity, although the best performing model type depends on several parameters such as the size of the training data, the ambiguity, and the prediction direction.
Simon Gabay, Lucie Rondeau Du Noyer and Mohamed Khemakhem. 2020. Selling autograph manuscripts in 19th c. Paris: digitising the Revue des Autographes. In IX Convegno AIUCD. Milan, Italy.
In Paris, the manuscript market appears in the early 20's of the 19th c. Fixed-price catalogues and auction catalogues are regularly published, describing each document in detail. Such descriptions being highly formalised, it is possible to extract and structure them (almost) automatically, and thus create a database of sold manuscripts in 19th c. Paris.
Communications
Gabriela Elgarrista, Frédérique Mélanie-Becquet, Carmen Brando, Mohamed Khemakhem, Laurent Romary and Jean-Luc Pinol. 2020. Pipeline to process and analyze Paris's old property address directories (XIXe -XXe). In CLARIN Annual Conference. Paris (en ligne), France.
Mohamed Khemakhem, Simon Gabay, Béatrice Joyeux-Prunel, Laurent Romary, Léa Saint-Raymond and Lucie Rondeau Du Noyer. 2020. Information Extraction Workflow for Digitised Entry-based Documents. In DARIAH Annual event 2020. Zagreb / Virtual, Croatia.
Book chapters
Benoît Sagot. 2020. A new PIE root *h1er ‘(to be/become) dark red'. In Loanwords and Substrata164.
Romain Garnier and Benoît Sagot. 2020. New results on a centrum substratum in Greek: the Lydian connection. In Loanwords and Substrata164.
Jennifer Edmond, Frank Fischer, Laurent Romary and Toma Tasovac. 2020. 9. Springing the Floor for a Different Kind of Dance. In Digital Technology and the Practices of Humanities Research, pages 207–234. Open Book Publishers.
Jennifer Edmond and Laurent Romary. 2020. 3. Academic Publishing. In Digital Technology and the Practices of Humanities Research, pages 49–80. Open Book Publishers.
Anne Baillot. 2020. Zahlenwahn oder Textliebe? Digitale Philologie als Disziplin und als Weltanschauung. In Machines/Maschinen. Les machines dans l'espace germanique: de l'automate de Kempelen à Kraftwerk. Presses Universitaires de Rennes.
Tech reports
Floriane Chiffoleau. 2020. Rapport d'avancement sur le projet DAHN (avec le soutien du MESRI). Technical report.
Other
Laurent Romary. 2020. Eléments de sciences ouvertes. .
Lucas Terriel. 2020. Le saviez-vous ? Les répertoires de notaires ne sont pas seulement des images numérisées !
This post provides an overview of the data associated with the documents of the project coordinated by INRIA (team project ALMAnaCH) and the National Archives LectAuRep - Automatic reading of directories - which consists in applying the handwritten text recognition techniques on notaries directories. This post is part of a larger reflection on the creation of a TEI pivot format to centralize metadata associated with documents and those generated during image processing with the eScriptorium transcription platform.
Jean-Damien Généro. 2020. Le corpus des Ouvriers des deux mondes : des images et des URLs. .
Si les documents d’archives ont une part prépondérante dans le projet Time us, ils ne représentent pas pour autant l’intégralité de sa documentation. Les imprimés sont également présents, sous la forme de trois importants dossiers : la collection de la presse ancienne lyonnaise, divers imprimés portant sur le textile en France au XIXe siècle, et le corpus des Ouvriers des deux mondes. Les Ouvriers des deux mondes sont des enquêtes sociologiques réparties en 3 séries et 126 monographies. Initiée par le sociologue Frédéric Le Play (1806-1882), la publication est assurée par la Société internationale des études pratiques d’économie sociale de 1857 à 1928 et représente un total de 13 volumes. Ceux-ci sont aujourd’hui intégralement consultables sur le site Internet Archive. Nous allons nous intéresser dans ce billet aux fichiers de transcription de ces volumes et au lien entre ceux-ci et les images numérisées d’origine. Le script lse od2m, écrit par Alix Chagué, avait automatiquement segmenté et transcrit les images, puis encodé et structuré en xml-tei les textes bruts ainsi obtenus; la sortie avait résulté en 13 fichiers xml. Ces fichiers « sources » avaient ensuite été scindés en 222 fichiers xml correspondant à autant de divisions logiques des volumes : les monographies bien sûr, mais également les introductions, tables des matières et autres éléments de paratexte. Des opérations de vérification ont permis de réduire le nombre de fichiers à 192.
Alix Chagué, Lucas Terriel and Laurent Romary. 2020. Des images au texte : LECTAUREP, un projet de reconnaissance automatique d'écriture. .
Laurent Romary. 2020. Les données de la recherche. .
Dans le cadre de l'Open Access Week, présentation de l’actualité dans le domaine de la gestion des données, notamment dans le cadre du plan science ouverte du ministère.
Laurent Romary. 2020. Multilingual content management and standards with a view on AI developments. .
Laurent Romary. 2020. An editorial and technical journey into Post Publication Peer Review (PPPR). .
Laurent Romary. 2020. TEI guidelines: born to be open. .
Open science has never been so high on the research agendas, and this is true in all fields, ranging from so-called hard sciences to the humanities. In this respect, those who have been dealing with the TEI guidelines for years, whether as users or designers of the standard, have experienced an environment which has always been open by construction and fostering openness for projects based upon its principles.We outline the main issues related to open science in the current scholarly landscape, whether political or technical, and show the various aspects where the TEI environment has been seminal in setting up an open agenda that may enlightened the humanities at large in terms of good practices, for, e.g., managing, documenting or disseminating scholarly sources and methods.
Preprints
Benjamin Muller, Antonis Anastasopoulos, Benoît Sagot and Djamé Seddah. 2020. When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models. Preprint.
Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. Transliterating those languages improves very significantly the ability of large-scale multilingual language models on downstream tasks.
Erzsébet Tóth-Czifra and Laurent Romary. 2020. The Heritage Data Reuse Charter: from principles to research workflows. Preprint.
There is a growing need to establish domain-or discipline-specific approaches to research data sharing workflows. A defining feature of data and data workflows in the arts and humanities domain is their dependence on cultural heritage sources hosted and curated in museums, libraries, galleries and archives. A major difficulty when scholars interact with heritage data is that the nature of the cooperation between researchers and Cultural Heritage Institutions (henceforth CHIs) is often constrained by structural and legal challenges but even more by uncertainties as to the expectations of both parties. The Heritage Data Reuse Charter aims to address these by designing a common environment that will enable all the relevant actors to work together to connect and improve access to heritage data and make transactions related to the scholarly use of cultural heritage data more visible and transparent. As a first step, a wide range of stakeholders on the Cultural Heritage and research sector agreed upon a set of generic principles, summarized in the Mission Statement of the Charter, that can serve as a baseline governing the interactions between CHIs, researchers and data centres. This was followed by a long and thorough validation process related to these principles through surveys 1 and workshops 2. As a second step, we now put forward a questionnaire template tool that helps researchers and CHIs to translate the 6 core principles into specific research project settings. It contains questions about access to data, provenance information, preferred citation standards, hosting responsibilities etc. on the basis of which the parties can arrive at mutual reuse agreements that could serve as a starting point for a FAIR-by-construction data management, right from the project planning/application phase. The questionnaire template and the resulting mutual agreements can be flexibly applied to projects of different scale and in platform-independent ways. Institutions can embed them into their own exchange protocols while researchers can add them to their Data Management Plans. As such, they can show evidence for responsible and fair conduct of cultural heritage data, and fair (but also FAIR) research data management practices that are based on partnership with the holding institution.
Romain Garnier and Benoît Sagot. 2019. Metathesis of Proto-Indo-European Sonorants. Münchener Studien zur Sprachwissenschaft73, pages 29–53. Verlag J.H. Röll GmbH.
Detlef Reineke and Laurent Romary. 2019. Bridging the gap between SKOS and TBX. edition - Die Fachzeitschrift für Terminologie19. Deutscher Terminologie-Tag e.V. (DTT).
This article provides an in-depth comparison and proposal for mapping between Simple KnowledgeOrganization System (SKOS) and TermBase eXchange (TBX), two important exchangestandards within the knowledge and terminology landscape. The attempt to develop an interfaceor conversion routine between SKOS and TBX is rooted in a strong demand in the language andknowledge industries for resource leverage and based on the premise that the two formalisms aregoverned by similar data models, namely the description of concepts (rather than words).
Laurent Romary and Charles Riondet. 2019. Towards multiscale archival digital data. Umanistica digitale. AIUCD - Associazione per l'Informatica Umanistica e la Cultura Digitale.
In this paper, we would like to present some ideas on the use of the archival standards in various contexts that exemplify the complexity of such standards and provide users with innovative ways to handle EAD content. Our main idea is that researchers, Cultural heritage institutions, archival portals and standards maintenance bodies could greatly benefit from a multiscale modelling of archival data, but also from multiscale representations and documentations. A first step is on the way to being cleared in the domain of the management of heterogeneous archival sources in one single environment, namely a federated portal, like in EHRI. We built a methodology based on a specification and customisation method inspired from the long lasting experience of the Text Encoding Initiative (TEI) community. In the TEI framework, one has the possibility of defining project-specific subsets or extensions of the TEI guidelines while maintaining both the technical (XML schemas) and editorial (documentation) specification within a single framework. Using the same framework for EAD data allows us to express precise content-oriented rules combined with some interesting possibilities of integrating the human readable documentation in the validation process.
Conference proceedings
Laurent Romary. 2019. The place of lexicography in (computer) science. In The Future of Academic Lexicography: Linguistic Knowledge Codification in the Era of Big Data and AI. Leiden, Netherlands.
Luca Foppiano, Laurent Romary, Masashi Ishii and Mikiko Tanifuji. 2019. Automatic Identification and Normalisation of Physical Measurements in Scientific Literature. In DocEng '19 - ACM Symposium on Document Engineering 2019, pages 1–4. ACM Press. Berlin, Germany.
We present Grobid-quantities, an open-source application for extracting and normalising measurements from scientific and patent literature. Tools of this kind, aiming to understand and make unstructured information accessible, represent the building blocks for large-scale Text and Data Mining (TDM) systems. Grobid-quantities is a module built on top of Grobid [6] [13], a machine learning framework for parsing and structuring PDF documents. Designed to process large quantities of data, it provides a robust implementation accessible in batch mode or via a REST API. The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm [12] for extracting quantities (atomic values, intervals and lists), units (such as length, weight) and different value representations (numeric, alphabetic or scientific notation). Identified measurementsare normalised according to the International System of Units (SI). Thanks to its stable recall and reliable precision, Grobid-quantities has been integrated as the measurement-extraction engine in various TDM projects, such as Marve (Measurement Context Extraction from Text), for extracting semantic measurements and meaning in Earth Science [10]. At the National Institute for Materials Science in Japan (NIMS), it is used in an ongoing project to discover new superconducting materials. Normalised materials characteristics (such as critical temperature, pressure) extracted from scientific literature are a key resource for materials informatics (MI) [9].
Benjamin Muller, Benoit Sagot and Djamé Seddah. 2019. Enhancing BERT for Lexical Normalization. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 297–306. Association for Computational Linguistics. Hong Kong, China.
Language model-based pre-trained representations have become ubiquitous in natural language processing. They have been shown to significantly improve the performance of neu-ral models on a great variety of tasks. However , it remains unclear how useful those general models can be in handling non-canonical text. In this article, focusing on User Generated Content (UGC) in a resource-scarce scenario , we study the ability of BERT (Devlin et al., 2018) to perform lexical normalisation. Our contribution is simple: by framing lexical normalisation as a token prediction task, by enhancing its architecture and by carefully fine-tuning it, we show that BERT can be a competitive lexical normalisation model without the need of any UGC resources aside from 3,000 training sentences. To the best of our knowledge , it is the first work done in adapting and analysing the ability of this model to handle noisy UGC data.
Hervé Bohbot, Francesca Frontini, Fahad Khan, Mohamed Khemakhem and Laurent Romary. 2019. Nénufar: Modelling a Diachronic Collection of Dictionary Editions as a Computational Lexical Resource. In ELEX 2019: smart lexicography. Sintra, Portugal.
The Petit Larousse Illustré (PLI) is a monolingual French dictionary which has been published every year since the 1906 edition and which is therefore a fundamental testimony of the evolution of the French language. As a consequence of the pre-1948 editions of the PLI entering the public domain in 2018 the Nénufar (“Nouvelle édition numérique de fac-similés de reference”) project was launched at the Praxiling laboratory in Montpellier with the aim of digitising and make these editions available electronically. The project is still ongoing; various selected editions per decade are going to be fully digitised (so far the 1906, 1924 and 1925 editions have been completed), and changes backtracked and dated per specific year.
Lucie Rondeau Du Noyer, Simon Gabay, Mohamed Khemakhem and Laurent Romary. 2019. Scaling up Automatic Structuring of Manuscript Sales Catalogues. In TEI 2019: What is text, really? TEI and beyond. Graz, Austria.
Manuscript Sales Catalogues (MSC) are highly important for authenticating documents and studying the reception of authors. Their regular publication throughout Europe since the beginning of the 19th c. has consequently raised the interest around scaling up the means for automatically structuring their contents. Following successful first encoding tests with GROBID-Dictionaries [1,2] on a single MSC collection [3], we aim in this paper to present the results of more advanced tests of the system’s capacity to handle a larger corpus with MSC ofdifferent dealers, and therefore multiple layouts.
Fernando Alva-Manchego, Louis Martin, Carolina Scarton and Lucia Specia. 2019. EASSE: Easier Automatic Sentence Simplification Evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 49–54. Association for Computational Linguistics. Hong Kong, China.
We introduce EASSE, a Python package aiming to facilitate and standardise automatic evaluation and comparison of Sentence Simplification (SS) systems. EASSE provides a single access point to a broad range of evaluation resources: standard automatic metrics for assessing SS outputs (e.g. SARI), word-level accuracy scores for certain simplification transformations, reference-independent quality estimation features (e.g. compression ratio), and standard test data for SS evaluation (e.g. TurkCorpus). Finally, EASSE generates easy-to-visualise reports on the various met-rics and features above and on how a particular SS output fares against reference simplifications. Through experiments, we show that these functionalities allow for better comparison and understanding of the performance of SS systems.
Mathilde Regnault, Sophie Prévost and Éric Villemonte de La Clergerie. 2019. Challenges of language change and variation: towards an extended treebank of Medieval French. In TLT 2019 - 18th International Workshop on Treebanks and Linguistic Theories. Paris, France.
In order to automatically extend a treebank of Old French (9 th-13 th c.) with new texts in Old and Middle French (14 th-15 th c.), we need to adapt tools for syntactic annotation. However, these stages of French are subjected to great variation, and parsing historical texts remains an issue. We chose to adapt a symbolic system, the French Metagrammar (FRMG), and develop a lexicon comparable to the Lefff lexicon for Old and Middle French. The final goal of our project is to model the evolution of language through the whole period of Medieval French (9 th-15 th c.).
Benoit Crabbé, Murielle Fabre and Christophe Pallier. 2019. Variable beam search for generative neural parsing and its relevance for the analysis of neuro-imaging signal. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1150–1160. Association for Computational Linguistics. Hong Kong, China.
This paper describes a method of variable beam size inference for Recurrent Neural Network Grammar (rnng) by drawing inspiration from sequential Monte-Carlo methods such as particle filtering. The paper studies the relevance of such methods for speeding up the computations of direct generative parsing for rnng. But it also studies the potential cognitive interpretation of the underlying representations built by the search method (beam activity) through analysis of neuro-imaging signal.
Kim Gerdes, Sylvain Kahane and Xinying Chen. 2019. Rediscovering Greenberg's Word Order Universals in UD. In UDW, Universal Dependencies Workshop 2019, Syntaxfest, pages 124–131. Association for Computational Linguistics. Paris, France.
This paper discusses an empirical refoundation of selected Greenbergian word order univer-sals based on a data analysis of the Universal Dependencies project. The nature of the data we work on allows us to extract rich details for testing well-known typological universals and constitutes therefore a valuable basis for validating Greenberg's universals. Our results show that we can refine some Greenbergian universals in a more empirical and accurate way by means of a data-driven typological analysis.
Géraldine Walther and Benoît Sagot. 2019. Morphological complexities. In 16th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. Florence, Italy.
Jack Bowers, Mohamed Khemakhem and Laurent Romary. 2019. TEI Encoding of a Classical Mixtec Dictionary Using GROBID- Dictionaries. In ELEX 2019: Smart Lexicography. Sintra, Portugal.
This paper presents the application of GROBID-Dictionaries (Khemakhem et al. 2017, Khemakhem et al. 2018a, Khemakhem et al. 2018b, Khemakhem et al. 2018c), an open source machine learning system for automatically structuring print dictionaries in digital format into TEI (Text Encoding Initiative) to a historical lexical resource of Colonial Mixtec 'Voces del Dzaha Dzahui' published by the Dominican fray Francisco Alvarado in the year 1593. The GROBID-Dictionaries application was applied to a reorganized and modernized version of the historical resource published by Jansen and Perez Jiménez (2009). The TEI dictionary produced will be integrated into a language documentation project dealing with Mixtepec-Mixtec (ISO 639-3: mix) (Bowers & Romary, 2017, 2018a, 2018b) an under-resourced indigenous language native to the Juxtlahuaca district of Oaxaca Mexico.
Marco Dinarelli and Loïc Grobol. 2019. Modèles neuronaux hybrides pour la modélisation de séquences : le meilleur de trois mondes. In TALN-RECITAL 2019 - 26ème Conférence sur le Traitement Automatique des Langues Naturelles. Toulouse, France.
We propose a neural architecture with the main characteristics of the most successful neural models of the last years : bidirectional RNNs, encoder-decoder, and the Transformer model. Evaluation on three sequence labelling tasks yields results that are close to the state-of-the-art for all tasks and better than it for some of them, showing the pertinence of this hybrid architecture for this kind of tasks.
Loïc Grobol. 2019. Neural Coreference Resolution with Limited Lexical Context and Explicit Mention Detection for Oral French. In Second Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC19). Minneapolis, United States.
We propose an end-to-end coreference resolution system obtained by adapting neural models that have recently improved the state-of-the-art on the OntoNotes benchmark to make them applicable to other paradigms for this task. We report the performances of our system on ANCOR, a corpus of transcribed oral French-for which it constitutes a new baseline with proper evaluation.
Benoît Sagot. 2019. Développement d'un lexique morphologique et syntaxique de l'ancien français. In 26ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN). Toulouse, France.
In this paper we describe our work on the development of a large-scale morphological and syntactic lexicon of Old French for natural language processing. We rely on dictionary and lexical resources, from which the extraction of structured and exploitable information required specific developments. In addition, matching information from these different sources posed difficulties. We provide quantitative information on the resulting lexicon, and discuss its reliability in its current version and the prospects for improvement allowed by the existence of a first version, in particular through the automatic analysis of textual data.
Pedro Javier Ortiz Suárez, Benoît Sagot and Laurent Romary. 2019. Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache. Cardiff, United Kingdom.
Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.
Mathilde Regnault. 2019. Adaptation d'une métagrammaire du français contemporain au français médiéval. In TALN-RECITAL 2019 - 26e édition de la conférence TALN (Traitement Automatique des Langues Naturelles) et 21e édition de la conférence jeunes chercheur$\times$euse$\times$s RECITAL. Toulouse, France.
Adapting an existing metagrammar for Contemporary French to Medieval French Medieval French is characterized by strong language variation. Our purpose is to extend a corpus of Old French annotated with dependency syntax with new texts of this period and add texts of Middle French. In order to achieve this, we want to adapt existing tools instead of training a parser with annotated data. In this article, we present a state of the art for this project and our solution : adapting the French Metagrammar (FRMG) to former states of language.
Ganesh Jawahar, Benoît Sagot and Djamé Seddah. 2019. What does BERT learn about the structure of language? In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics Florence, Italy.
BERT is a recent language representation model that has surprisingly performed well in diverse language understanding benchmarks. This result indicates the possibility that BERT networks capture structural information about language. In this work, we provide novel support for this claim by performing a series of experiments to unpack the elements of English language structure learned by BERT. We first show that BERT's phrasal representation captures phrase-level information in the lower layers. We also show that BERT's intermediate layers encode a rich hierarchy of linguistic information, with surface features at the bottom, syntactic features in the middle and semantic features at the top. BERT turns out to require deeper layers when long-distance dependency information is required, e.g.~to track subject-verb agreement. Finally, we show that BERT representations capture linguistic information in a compositional way that mimics classical, tree-like structures.
Laurent Romary, Mohamed Khemakhem, Fahad Khan, Jack Bowers, Nicoletta Calzolari, Monte George, Mandy Pet and Piotr Bański. 2019. LMF Reloaded. In AsiaLex 2019: Past, Present and Future. Istanbul, Turkey.
Lexical Markup Framework (LMF) or ISO 24613 [1] is a de jure standard that provides a framework for modelling and encoding lexical information in retrodigitised print dictionaries and NLP lexical databases. An in-depth review is currently underway within the standardisation subcommittee , ISO-TC37/SC4/WG4, to find a more modular, flexible and durable follow up to the original LMF standard published in 2008. In this paper we will present some of the major improvements which have so far been implemented in the new version of LMF.
Anas Fahad Khan, Hervé Bohbot, Francesca Frontini, Mohamed Khemakhem and Laurent Romary. 2019. Historical Dictionaries as Digital Editions and Connected Graphs: the Example of Le Petit Larousse Illustré. In Digital Humanities 2019. Utrech, Netherlands.
Marco Dinarelli and Loïc Grobol. 2019. Seq2Biseq: Bidirectional Output-wise Recurrent Neural Networks for Sequence Modelling. In CICLing 2019 - 20th International Conference on Computational Linguistics and Intelligent Text Processing. La Rochelle, France.
During the last couple of years, Recurrent Neural Networks (RNN) have reached state-of-the-art performances on most of the sequence modelling problems. In particular, the sequence to sequence model and the neural CRF have proved to be very effective in this domain. In this article, we propose a new RNN architecture for sequence labelling, leveraging gated recurrent layers to take arbitrarily long contexts into account, and using two decoders operating forward and backward. We compare several variants of the proposed solution and their performances to the state-of-the-art. Most of our results are better than the state-of-the-art or very close to it and thanks to the use of recent technologies, our architecture can scale on corpora larger than those used in this work.
Jack Bowers and Laurent Romary. 2019. TEI and the Mixtepec-Mixtec corpus: data integration, annotation and normalization of heterogeneous data for an under-resourced language. In 6th International Conference on Language Documentation and Conservation (ICLDC). Honolulu, United States.
Communications
Alix Chagué, Victoria Le Fourner, Manuela Martini and Éric Villemonte de La Clergerie. 2019. Deux siècles de sources disparates sur l'industrie textile en France : comment automatiser les traitements d'un corpus non-uniforme ? In Colloque DHNord 2019 »;Corpus et archives numériques»; Lille, France.
Murielle Fabre, Yoann Dupont and Éric Villemonte de La Clergerie. 2019. Syntactic Parsing versus MWEs: What can fMRI signal tell us. In PARSEME-FR 2019 consortium meeting. Blois, France.
Yixuan Li, Gerdes Kim and Dong Chuanming. 2019. Character-level Annotation for Chinese Surface-Syntactic Universal Dependencies. In Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019), pages 216–226. Association for Computational Linguistics. Paris, France.
This paper presents a new schema to annotate Chinese Treebanks on the character level. The original Universal Dependencies (UD) and Surface-Syntactic Universal Dependencies (SUD) projects provide token-level resources with rich morphosyntactic language details. However, without any commonly accepted word definition for Chinese, the dependency parsing always faces the dilemma of word segmentation. Therefore we present a character-level annotation schema integrated into the existing Universal Dependencies schema as an extension.
Xinying Chen and Kim Gerdes. 2019. The relation between dependency distance and frequency. In Quasy 2019, Quantitative Syntax 2019, Syntaxfest. Paris, France.
This present pilot study investigates the relationship between dependency distance and frequency based on the analysis of an English dependency treebank. The preliminary result shows that there is a non-linear relation between dependency distance and frequency. This relation between them can be further formalized as a power law function which can be used to predict the distribution of dependency distance in a treebank.
José Carlos Rosales Nunez, Djamé Seddah and Guillaume Wisniewski. 2019. A Comparison between NMT and PBSMT Performance for Translating Noisy User-Generated Content. In The 22nd Nordic Conference on Computational Linguistics (NoDaLiDa'19). Turku, Finland.
This work compares the performances achieved by Phrase-Based Statistical Ma- chine Translation systems (PBSMT) and attention-based Neural Machine Transla- tion systems (NMT) when translating User Generated Content (UGC), as encountered in social medias, from French to English. We show that, contrary to what could be ex- pected, PBSMT outperforms NMT when translating non-canonical inputs. Our error analysis uncovers the specificities of UGC that are problematic for sequential NMT architectures and suggests new avenue for improving NMT models.
Mohamed Khemakhem, Ioana Galleron, Geoffrey Williams, Laurent Romary and Pedro Javier Ortiz Suárez. 2019. How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures. In 19th annual Conference and Members' Meeting of the Text Encoding Initiative Consortium (TEI) -What is text, really? TEI and beyond. Graz, Austria.
Ganesh Jawahar and Djamé Seddah. 2019. Contextualized Diachronic Word Representations. In 1st International Workshop on Computational Approaches to Historical Language Change 2019 (colocated with ACL 2019). Florence, Italy.
Diachronic word embeddings play a key role in capturing interesting patterns about how language evolves over time. Most of the existing work focuses on studying corpora spanning across several decades, which is understandably still not a possibility when working on social media-based user-generated content. In this work, we address the problem of studying semantic changes in a large Twitter corpus collected over five years, a much shorter period than what is usually the norm in di-achronic studies. We devise a novel attentional model, based on Bernoulli word embeddings, that are conditioned on contextual extra-linguistic (social) features such as network, spatial and socioeconomic variables, which are associated with Twitter users, as well as topic-based features. We posit that these social features provide an inductive bias that helps our model to overcome the narrow time-span regime problem. Our extensive experiments reveal that our proposed model is able to capture subtle semantic shifts without being biased towards frequency cues and also works well when certain con-textual features are absent. Our model fits the data better than current state-of-the-art dynamic word embedding models and therefore is a promising tool to study diachronic semantic changes over small time periods.
Pedro Javier Ortiz Suárez, Laurent Romary and Benoît Sagot. 2019. Preparing the Dictionnaire Universel for Automatic Enrichment. In 10th International Conference on Historical Lexicography and Lexicology (ICHLL). Leeuwarden, Netherlands.
The Dictionnaire Universel (DU) is an encyclopaedic dictionary originally written by Antoine Furetière around 1676-78, later revised and improved by the Protestant jurist Henri Basnage de Beauval who expanded, corrected and included terms of arts, crafts and sciences, into the Dictionnaire.The aim of the BASNUM project is to digitize the DU in its second edition rewritten by Basnage de Beauval, to analyse it with computational methods in order to better assess the importance of this work for the evolution of sciences and mentalities in the 18th century, and to contribute to the contemporary movement for creating innovative and data-driven computational methods for text digitization, encoding and analysis.Based on the experience acquired within the research group, an enrichment workflow based upon a series of Natural Language Processing processes is being set up to be applied to Basnage's work. This includes, among others, automatic identification of the dictionary structure (macro-, meso- and microstructure), named-entity recognition (in particular persons and locations), classification of dictionary entries, detection and study of polysemy markers, tracking and classification of quotation use (bibliographic references), scoring semantic similarity between the DU and other dictionaries. The main challenges being the lack of available annotated data in order to train machine learning models, decreased accuracy when using modern pre-trained models due to the differences between present-day and 18th century French, and even unreliable or low quality OCRisation. The paper describes methods that are useful to tackle these issues in order to prepare the the DU for automatic enrichment going beyond what current available tools like Grobid-dictionaries can do, thanks to the advent of deep learning NLP models. The paper also describes how these methods could be applied to other dictionaries or even other types of ancient texts.
Sheena Bassett, Leon Wessels, Steven Krauwer, Bente Maegaard, Hella Hollander, Femmy Admiraal, Laurent Romary and Frank Uiterwaal. 2019. Connecting the Humanities through Research Infrastructures. In 4th Digital Humanities in the Nordic Countries (DHN 2019). Copenhagen, Denmark.
Several Research Infrastructures(RIs)exist in the Humanities and Social Sciences, some –such as CLARIN, DARIAH and CESSDA –which address specific areas of interest, i.e. linguistic studies, digital humanities and social science data archives. RIs are also unique in their scope and application, largely tailored to their specific community needs. However, commonalities do exist and it is recognised that benefits are to be gained from these such as efficient use of resources, enabling multi-disciplinary research and sharing good practices. As such,a bridging project PARTHENOS has worked closely with CLARIN and DARIAH as well as ARIADNE (archaeology), CENDARI (history), EHRI (holocaust studies) and E-RIHS (heritage science) to iden-tify, develop and promote these commonalities. In this paper, we present some specif-ic examples of cross-discipline and trans-border applications arising from joint RI collaboration, allowing for entirely new avenues of research
Books
Kim Gerdes and Sylvain Kahane. 2019. Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019). .
Book chapters
Kim Gerdes, Sylvain Kahane, Rachel Bawden, Julie Beliao, Éric Villemonte de La Clergerie and Ilaine Wang. 2019. Annotation tools for syntax. In Rhapsodie: A Prosodic and Syntactic Treebank for Spoken French. John Benjamins.
This chapter is devoted to the presentation of the tools and methods used for the different steps of the semi-automatic syntactic annotation: automatic preprocessing; microsyntactic parsing with the FRMG tool, correction of the parsing with the Arborator tool, agreement analysis, post-validation correction, and development of the final format of the Rhapsodie syntactic treebank. As FRMG is a parser for written French that was not configured to analyze disfluencies and reformulation, we used our manual pile marking to unfold the piles and produce a series of simplified “sentences” with only government relations. Despite having two annotators plus a validator for the corrections, we found a substantial number of errors in the post-validation procedure by using a set of rules to determine the well-formedness of the trees.
Sylvain Kahane, Paola Pietrandrea and Kim Gerdes. 2019. The annotation of list structures. In Rhapsodie: A Prosodic and Syntactic Treebank for Spoken French. John Benjamins.
This chapter presents phenomena we call “piles” or “lists”, which are characterized by the fact that a list of elements piles up in the same syntactic position. We therefore group the analysis of coordination together with the analysis of other phenomena such as reformulation, disfluency, partial answer, or negotiation. The elements of a pile are linked to one another by a relation that is both syntagmatic (they follow one another) and paradigmatic (they fill the same syntactic slot with respect to their common governor). The syntactic analysis of the other elements – junctors, paradigmatic adverbs, and list completers – is discussed. We also propose a typology of the different cases of pile structure and introduce the seven subcases of paradigmatic links taken into account in the annotation.
Sylvain Kahane, Kim Gerdes and Rachel Bawden. 2019. The microsyntactic annotation. In Rhapsodie~: A Prosodic and Syntactic Treebank for Spoken French. John Benjamins.
This chapter describes the microsyntactic analysis of the Rhapsodie corpus in terms of dependency syntax. Microsyntax studies the relations between words that are characterized by a strong syntactic cohesion, traditionally called “government”. The different steps in the annotation are presented: segmentation into words and labeling in parts of speech, dependency structure, and basic syntactic functions. We justify our decision to use a small set of tags without redundancies, but to introduce a predicative relation for elements that form a complex predicate including copula, auxiliaries, and some modal verbs. Complex cases of annotations such as extraction (relative and interrogative clauses, cleft sentences) and negation are also presented. In addition, we show how a constituent structure can be computed from the dependency structure.
Laurent Romary and Jennifer Edmond. 2019. A Tangential View on Impact for the Arts and Humanities through the Lens of the DARIAH-ERIC. In Stay Tuned To The Future - Impact of the Research Infrastructures for Social Sciences and Humanities. Leo S. Olschki Editore.