Publications

Explore our publications on the HAL archive

2024

PhD theses and Habiliations

Tu Anh Nguyen. 2024. Spoken Language Modeling from Raw Audio. PhD thesis. Sorbonne Université.

Speech has always been a dominant mode of social connection and communication. However, speech processing and modeling have been challenging due to its variability. Classic speech technologies rely on cascade modeling, i.e. transcribing speech to text with an Automatic Speech Recognition (ASR) system, processing transcribed text using Natural Language Processing (NLP) methods, and converting text back to speech with a Speech Synthesis model. This method eliminates speech variability but requires a lot of textual datasets, which are not always available for all languages. In addition, it removes all the expressivity contained in the speech itself.Recent advancements in self-supervised speech learning (SpeechSSL) have enabled the learning of good discrete speech representations from raw audio, bridging the gap between speech and text technologies. This allows to train language models on discrete representations (discrete units, or pseudo-text) obtained from the speech and has given rise to a new domain called TextlessNLP, where the task is to learn the language directly on audio signals, bypassing the need for ASR systems. The so-called Spoken Language Models (Speech Language Models, or SpeechLMs) have been shown to be working and offer new possibilities for speech processing compared to cascade systems.The objective of this thesis is thus to explore and improve this newly-formed domain. We are going to analyze why these discrete representations work, discover new applications of SpeechLMs to spoken dialogues, extend TextlessNLP to more expressive speech as well as improve the performance of SpeechLMs to reduce the gap between SpeechLMs and TextLMs.
Paul-Ambroise Duquenne. 2024. Sentence Embeddings for Massively Multilingual Speech and Text Processing. PhD thesis. Sorbonne Université.

Representation learning of sentences has been widely studied in NLP. While many works have explored different pre-training objectives to create contextual representations from sentences, several others have focused on learning sentence embeddings for multiple languages with the aim of closely encoding paraphrases and translations in the sentence embedding space.In this thesis, we first study how to extend text sentence embedding spaces to the speech modality in order to build a multilingual speech/text sentence embedding space. Next, we explore how to use this multilingual and multimodal sentence embedding space for large-scale speech mining. This allows us to automatically create alignments between written and spoken sentences in different languages. For high similarity thresholds in the latent space, aligned sentences can be considered as translations. If the alignments involve written sentences on one side and spoken sentences on the other, then these are potential speech-to-text translations. If the alignments involve on both sides spoken sentences, then these are potential speech-to-speech translations. To validate the quality of the mined data, we train speech-to-text translation models and speech-to-speech translation models. We show that adding the automatically mined data significantly improves the quality of the learned translation models, demonstrating the quality of the alignments and the usefulness of the mined data.Then, we study how to decode these sentence embeddings into text or speech in different languages. We explore several methods for training decoders and analyze their robustness to modalities/languages not seen during training, to evaluate cross-lingual and cross-modal transfers. We demonstrate that we could perform zero-shot cross-modal translation in this framework, achieving translation results close to systems learned in a supervised manner with a cross-attention mechanism. The compatibility between speech/text representations from different languages enables these very good performances, despite an intermediate fixed-size representation.Finally, we develop a new state-of-the-art massively multilingual speech/text sentence embedding space, named SONAR, based on conclusions drawn from the first two projects. We study different objective functions to learn such a space and we analyze their impact on the organization of the space as well as on the capabilities to decode these representations. We show that such sentence embedding space outperform previous state-of-the-art methods for both cross-lingual and cross-modal similarity search as well as decoding capabilities. This new space covers 200 written languages and 37 spoken languages. It also offers text translation results close to the NLLB system on which it is based, and speech translation results competitive with the Whisper supervised system. We also present SONAR EXPRESSIVE, which introduces an additional representation encoding non-semantic speech properties, such as vocal style or expressivity of speech.

Journal articles

Lucie Chenain, Rachid Riad, Nicolas Fraisse, Cécilia Jubin, Graça Morgado, Katia Youssov, Marine Lunven and Anne-Catherine Bachoud-Levi. 2024. Graph methods to infer spatial disturbances: Application to Huntington's Disease's speech. Cortex 176 pages 144 – 160. Elsevier.

Objective: Huntington's Disease (HD) is an inherited neurodegenerative disease caused by the mutation of the Htt gene, impacting all aspects of living and functioning. Among cognitive disabilities, spatial capacities are impaired, but their monitoring remains scarce as limited by lengthy experts' assessments. Language offers an alternative medium to evaluate patients' performance in HD. Yet, its capacities to assess HD's spatial abilities are unknown. Here, we aimed to bring proof-of-concept that HD's spatial deficits can be assessed through speech.Methods: We developed the Spatial Description Model to graphically represent spatial relations described during the Cookie Theft Picture (CTP) task. We increased the sensitivity of our model by using only sentences with spatial terms, unlike previous studies in Alzheimer's disease. 78 carriers of the mutant Htt, including 56 manifest and 22 premanifest individuals, as well as 25 healthy controls were included from the BIOHD & (NCT01412125) & Repair-HD (NCT03119246) cohorts. The convergence and divergence of the model were validated using the SelfCog battery.Results: Our Spatial Description Model was the only one among the four assessed approaches, revealing that individuals with manifest HD expressed fewer spatial relations and engaged in less spatial exploration compared to healthy controls. Their graphs correlated with both visuospatial and language SelfCog performances, but not with motor, executive nor memory functions.Conclusions: We provide the proof-of-concept using our Spatial Description Model that language can grasp HD patient's spatial disturbances. By adding spatial capabilities to the panel of functions tested by the language, it paves the way for eventual remote clinical application.
Cyril Chhun, Fabian M. Suchanek and Chloé Clavel. 2024. Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation. Transactions of the Association for Computational Linguistics 12 pages 1122–1142. MIT Press. Cambridge, MA.

Storytelling is an integral part of human experience and plays a crucial role in social interactions. Thus, Automatic Story Evaluation (ASE) and Generation (ASG) could benefit society in multiple ways, but they are challenging tasks which require high-level human abilities such as creativity, reasoning and deep understanding. Meanwhile, Large Language Models (LLM) now achieve state-of-the-art performance on many NLP tasks. In this paper, we study whether LLMs can be used as substitutes for human annotators for ASE. We perform an extensive analysis of the correlations between LLM ratings, other automatic measures, and human annotations, and we explore the influence of prompting on the results and the explainability of LLM behaviour. Most notably, we find that LLMs outperform current automatic measures for system-level evaluation but still struggle at providing satisfactory explanations for their answers.
Aina Garí Soler, Matthieu Labeau and Chloé Clavel. 2024. The Impact of Word Splitting on the Semantic Content of Contextualized Word Representations. Transactions of the Association for Computational Linguistics 12 pages 299–320. MIT Press. Cambridge, MA.

When deriving contextualized word representations from language models, a decision needs to be made on how to obtain one for out-of-vocabulary (OOV) words that are segmented into subwords. What is the best way to represent these words with a single vector, and are these representations of worse quality than those of in-vocabulary words? We carry out an intrinsic evaluation of embeddings from different models on semantic similarity tasks involving OOV words. Our analysis reveals, among other interesting findings, that the quality of representations of words that are split is often, but not always, worse than that of the embeddings of known words. Their similarity values, however, must be interpreted with caution.
Ana Salgado, Laurent Romary, Rute Costa, Toma Tasovac, Anas Fahad Khan, Margarida Ramos, Bruno Almeida, Sara Carvalho, Mohamed Khemakhem, Raquel Silva and Boris Lehečka. 2024. The Morais Dictionary: Following Best Practices in a Retro-digitized Dictionary Project. International Journal of Humanities and Arts Computing 18 pages 125 – 147. Edinburgh University Press.

This article outlines essential best practices for retro-digitized dictionary projects, using the ongoing MORDigital project (DOI 10.54499/PTDC/LLT-LIN/6841/2020) as a case study. The MORDigital project focuses on digitally transforming the historically significant Portuguese Morais dictionary’s first three editions (1789, 1813, 1823). While the primary objective is to create faithful digital versions of these renowned dictionaries, MORDigital stands out by going beyond the mere adoption of established best practices. Instead, it reflects on the choices made throughout the process, providing insights into the decision-making process. The key topics emphasized include (1) the establishment of a robust data model; (2) the refinement of metadata; (3) the implementation of consistent identifiers; and (4) the enhancement of encoding techniques; additionally exploring the issue of structuring domain labelling. The article aims to contribute to the ongoing discourse on best practices in retro-digitized dictionary projects and their implications for data preservation and knowledge organization.

Conference proceedings

Armel Zebaze, Benoît Sagot and Rachel Bawden. 2024. Tree of Problems: Improving structured problem solving with compositionality. In EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing pages 18028–18047. Miami, FL, United States.

Large Language Models (LLMs) have demonstrated remarkable performance across multiple tasks through in-context learning. For complex reasoning tasks that require step-by-step thinking, Chain-of-Thought (CoT) prompting has given impressive results, especially when combined with self-consistency. Nonetheless, some tasks remain particularly difficult for LLMs to solve. Tree of Thoughts (ToT) and Graph of Thoughts (GoT) emerged as alternatives, dividing the complex problem into paths of subproblems. In this paper, we propose Tree of Problems (ToP), a simpler version of ToT, which we hypothesise can work better for complex tasks that can be divided into identical subtasks. Our empirical results show that our approach outperforms ToT and GoT, and in addition performs better than CoT on complex reasoning tasks. All code for this paper is publicly available here: https://github.com/ArmelRandy/tree-of-problems.
Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Christof Monz, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popović, Mariya Shmatova, Stein\thór Steingrímsson and Vilém Zouhar. 2024. Findings of the WMT24 General Machine Translation Shared Task: The LLM Era is Here but MT is Not Solved Yet. In WMT 2024 - Ninth Conference on Machine Translation. pages 1–46. Miami, Florida, United States.

This overview paper presents the results of the General Machine Translation Task organised as part of the 2024 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of three to five different domains. In addition to participating systems, we collected translations from 8 different large language models (LLMs) and 4 online translation providers. We evaluate system outputs with professional human annotators using a new protocol called Error Span Annotations (ESA).
Sarah Bénière, Hugo Scheithauer, Juliette Janes and Laurent Romary. 2024. An ODD Schema for a Sustainable Encoding of Catalog Objects. In TEI 2024–Texts, Languages and Communities. Buenos Aires, Argentina.

Sales catalogs are a valuable resource for art historians as they are the witnesses of the circulation of works of art. The organization of information within the catalogs is consistent and structured, which makes it interesting material for automatic processing tasks. This long paper proposal presents our reflection on the structuration of the content of sales catalogs in TEI-XML. This consideration is part of a wider reflection within the framework of the DataCatalogue research project (Inria, BnF, INHA) on an automated workflow processing sales catalogs from digitization to publication.
Mariana Neves, Cristian Grozea, Philippe Thomas, Roland Roller, Rachel Bawden, Aurélie Névéol, Steffen Castle, Vanessa Bonato, Giorgio Maria Di Nunzio, Federica Vezzani, Maika Vicente Navarro, Lana Yeganova and Antonio Jimeno Yepes. 2024. Findings of the WMT 2024 Biomedical Translation Shared Task: Test Sets on Abstract Level. In Proceedings of the Ninth Conference on Machine Translation. pages 124–138. Association for Computational Linguistics. Miami, Florida, USA.

We present the results of the ninth edition of the Biomedical Translation Task at WMT'24. We released test sets for six language pairs, namely, French, German, Italian, Portuguese, Russian, and Spanish, from and into English. Each test set consists of 50 abstracts from PubMed. Differently from previous years, we did not split abstracts into sentences. We received submissions from five teams, and for almost all language directions. We used a baseline/comparison system based on Llama 3.1 and share the source code at https://github. com/cgrozea/wmt24biomed-ref.
Karim El Haff, Wissam Antoun, Agnès Braud, Florence Le Ber and Véronique Pitchon. 2024. Building and Assessing a Named Entity Recognition Resource for Ancient Pharmacopeias. In ECAI 2024. 392 pages 2354–2361. IOS Press. Santiago de Compostela, Spain.

This research revolves around utilising Named Entity Recognition (NER) to analyse and categorise data from English translations of pharmacopeias from the Abbasid era, noted for its valuable contributions to science and medicine. The main goal of this work, along with publishing this resource freely, is to assess crossmanuscript NER performance by evaluating the NER model's performance on unseen corpora and translation styles, as well as demonstrating the transferability of the NER task on such corpora. Two distinct experiments were conducted, focusing on F1-scores differences from mixing source translators and varying training dataset sizes. In experiments involving mixing translator styles, training on a mix of all available styles while accounting for dataset size yielded the best F1-scores compared to even training on the same style as the testing data, while experiments with dataset sizes show diminishing returns of scaling training datasets compared to varying translation styles. This work attempts to enhance the exploration of the medical knowledge embodied in these texts to facilitate their analysis for knowledge extraction relevant to modern medical practices. Furthermore, this research demonstrates strategies to optimise NER results in this context, forming a juncture between digitising historical information and enabling further explorations in pharmacopeia-related Natural Language Processing research.
Anh Ngo, Dirk Heylen, Nicolas Rollet, Catherine Pelachaud and Chloé Clavel. 2024. Exploration of Human Repair Initiation in Task-oriented Dialogue : A Linguistic Feature-based Approach. In Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue. pages 603–609. Kyoto, Japan.

<div><p>In daily conversations, people often encounter problems prompting conversational repair to enhance mutual understanding. By employing an automatic coreference solver, alongside examining repetition, we identify various linguistic features that distinguish turns when the addressee initiates repair from those when they do not. Our findings reveal distinct patterns that characterize the repair sequence and each type of other-repair initiation.</p></div>
Hugo Scheithauer and Laurent Romary. 2024. Experimenting With Generic Recognition Systems for Kuzushiji Documents: Furigana Extraction as a Use-Case. In JADH2024 - 13th Conference of Japanese Association for Digital Humanities «Leveraging AI and Digital Humanities for Sustainable Infrastructure»; Tokyo, Japan.

Simon Gabay and Thibault Clérice. 2024. The birth of French orthography. A computational analysis of French spelling systems in diachrony. In CHR2024–Computational Humanities Research Conference. Aahrus, Denmark.

The 17th c. is crucial for the French language, as it sees the creation of a strict orthographic norm that largely persists to this day. Despite its significance, the history of spelling systems remains however an overlooked area in linguistics for two reasons. On the one hand, spelling is made up of microchanges which requires a quantitative approach, and on the other hand, no corpus is available due to the interventions of editors in almost all the texts already available. In this paper, we therefore propose a new corpus allowing such a study, as well as the extraction and analysis tools necessary for our research. By comparing the text extracted with OCR and a version automatically aligned with contemporary French spelling, we extract the variant zones, we categorise these variants, and we study their frequency to study the (ortho)graphic change during the 17th century.
Benjamin Kiessling and Thibault Clérice. 2024. Does Context Matter ? Enhancing Handwritten Text Recognition with Metadata in Historical Manuscripts. In CHR2024–Computational Humanities Research Conference. Aahrus, Denmark.

The digitization of historical manuscripts has significantly advanced in recent decades, yet many documents remain as images without machine-readable text. Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting these images into text, facilitating large-scale analysis of historical collections. In 2024, the CATMuS Medieval dataset was released, featuring extensive diachronic coverage and a variety of languages and script types. Previous research indicated that model performance degraded on the best manuscripts over time as more data was incorporated, likely due to over-generalization. This paper investigates the impact of incorporating contextual metadata in training HTR models using the CATMuS Medieval dataset to mitigate this effect. Our experiments compare the performance of various model architectures, focusing on Conformer models with and without contextual inputs, as well as Conformer models trained with auxiliary classification tasks. Results indicate that Conformer models utilizing semantic contextual tokens (Century, Script, Language) outperform baseline models, particularly on challenging manuscripts. The study underscores the importance of metadata in enhancing model accuracy and robustness across diverse historical texts.
Rachel Bawden, Hatim Bourfoune, Bertrand Cabot, Nathan Cassereau, Pierre Cornette, Marco Naguib and François Yvon. 2024. Evaluer BLOOM en français. In Proceedings of the 2024 Atelier sur l'évaluation des modèles génératifs (LLM) et challenge d'extraction d'information few-shot. Toulouse, France.

The development of very large language models, capable of performing multipes tasks, implies to develop the necessary infrastructures to evaluate these models, ideally covering as many facets as possible. Numerous benchmarks have already been compiled for English, making it possible to precisely gauge their ability to process this language. In this paper, we present our own efforts to assemble a multi-task evaluation set for French, which is then used to evaluate models from the Bloom family. Our results complement the main evaluation results for Bloom in English ; they suggest that the performance obtained in French and English are very similar, and even better when the amorces used for contextual inference are in the same language as the texts to analyze
Chadi Helwe, Tom Calamai, Pierre-Henri Paris, Chloé Clavel and Fabian Suchanek. 2024. MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pages 4810–4845. Association for Computational Linguistics. Mexico City, Mexico.

We introduce MAFALDA, a benchmark for fallacy classification that merges and unites previous fallacy datasets. It comes with a taxonomy that aligns, refines, and unifies existing classifications of fallacies. We further provide a manual annotation of a part of the dataset together with manual explanations for each annotation. We propose a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity. We then evaluate several language models under a zero-shot learning setting and human performances on MAFALDA to assess their capability to detect and classify fallacies.
Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Suppa, Hila Gonen, Joseph Marvin Imperial, Börje Karlsson, Peiqin Lin, Nikola Ljube\vsić, Lester James Miranda, Barbara Plank, Arij Riabi and Yuval Pinter. 2024. Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pages 4322–4337. Association for Computational Linguistics. Mexico City, Mexico.

We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 19 datasets annotated with named entities in a cross-lingual consistent schema across 13 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public.
Anh Ngo, Chloé Clavel, Catherine Pelachaud and Nicolas Rollet. 2024. Multimodal models of repair in social human-agent interactions. In Proceedings of the 2024 Workshop Affect, Compagnons Artificiels et Interactions (WACAI 2024). Bordeaux, France.

People often encounter troubles in everyday conversations, prompting them to initiate repairs, which are various approaches employed to recognize and resolve those problems, fostering mutual understanding across conversational turns. However, maintaining a smooth interaction remains challenging for Conversational Agents (CAs), which are dialogue systems designed to simulate conversation with humans (including chatbots, social robots, and virtual assistants). To foster seamless human-agent interaction, the CA should be able to recognize repairs initiated by humans, utilize multimodal cues, and participate in the repair process. This article, which is an overview of our thesis research project, outlines our ongoing efforts to accomplish this objective. The initial phase involves analyzing repair phenomena in human-human interactions.
Nathaniel Robinson, Raj Dabre, Ammon Shurtz, Rasul Dent, Onenamiyi Onesi, Claire Monroc, Loïc Grobol, Hasan Muhammad, Ashi Garg, Naome Etori, Vijay Murari Tiyyala, Olanrewaju Samuel, Matthew Stutzman, Bismarck Odoom, Sanjeev Khudanpur, Stephen Richardson and Kenton Murray. 2024. Kreyòl-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pages 3083–3110. Association for Computational Linguistics. Mexico City, Mexico.

A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We present the largest cumulative dataset to date for Creole language MT, including 14.5M unique Creole sentences with parallel translations -- 11.6M of which we release publicly, and the largest bitexts gathered to date for 41 languages -- the first ever for 21. In addition, we provide MT models supporting all 41 Creole languages in 172 translation directions. Given our diverse dataset, we produce a model for Creole language MT exposed to more genre diversity than ever before, which outperforms a genre-specific Creole MT model on its own benchmark for 26 of 34 translation directions.
Lauriane Aufrant and Lucie Chasseur. 2024. UkraiNER: A New Corpus and Annotation Scheme towards Comprehensive Entity Recognition. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 16941–16952. ELRA and ICCL. Torino, Italia.

Named entity recognition as it is traditionally envisioned excludes in practice a significant part of the entities of potential interest for real-word applications: nested, discontinuous, non-named entities. Despite various attempts to broaden their coverage, subsequent annotation schemes have achieved little adoption in the literature and the most restrictive variant of NER remains the default. This is partly due to the complexity of those annotations and their format. In this paper, we introduce a new annotation scheme that offers higher comprehensiveness while preserving simplicity, together with an annotation tool to implement that scheme. We also release the corpus UkraiNER, comprised of 10,000 French sentences in the geopolitical news domain and manually annotated with comprehensive entity recognition. Our baseline experiments on UkraiNER provide a first point of comparison to facilitate future research (82 F1 for comprehensive entity recognition, 87 F1 when focusing on traditional nested NER), as well as various insights on the composition and challenges that this corpus presents for state-of-the-art named entity recognition models.
Simon Meoni, Éric De la Clergerie and Théo Ryffel. 2024. Generating Synthetic Documents with Clinical Keywords: A Privacy-Sensitive Methodology. In Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024. pages 115–123. ELRA and ICCL. Torino, Italia.

Electronic Health Records (EHR) store valuable patient-staff interaction data. These notes, often unstructured to save healthcare personnel time, can be challenging to analyze manually. Proprietary online LLMs have demonstrated impressive results in analyzing EHR notes. However, Clinical NLP faces unique challenges due to the sensitive and specialized nature of the data. Sending patient information via external APIs poses privacy risks, and hospitals require customized NLP systems to align with their practices. Developing customized LLMs using specific training datasets is crucial to address these challenges. We propose generating synthetic training data using keywords extracted without confidential information. Furthermore, we introduce a reward mechanism that iteratively refines the quality of synthetic documents. This involves scoring synthetic candidates against real clinical reports using a semantic textual similarity score and performing an alignment step to align the model with its best-scored utterances.
Arij Riabi, Menel Mahamdi, Virginie Mouilleron and Djamé Seddah. 2024. Cloaked Classifiers: Pseudonymization Strategies on Sensitive Classification Tasks. In Proceedings of the Fifth Workshop on Privacy in Natural Language Processing. pages 123–136. Association for Computational Linguistics. Bangkok, Thailand.

Protecting privacy is essential when sharing data, particularly in the case of an online radicalization dataset that may contain personal information. In this paper, we explore the balance between preserving data usefulness and ensuring robust privacy safeguards, since regulations like the European GDPR shape how personal information must be handled. We share our method for manually pseudonymizing a multilingual radicalization dataset, ensuring performance comparable to the original data. Furthermore, we highlight the importance of establishing comprehensive guidelines for processing sensitive NLP data by sharing our complete pseudonymization process, our guidelines, the challenges we encountered as well as the resulting dataset.
Léo Labat and Lauriane Aufrant. 2024. Évaluation de l'apport des chaînes de coréférences pour le liage d'entités. In Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position. pages 397–409. ATALA and AFPC. Toulouse, France.

Ce travail propose de revisiter les approches de liage d’entités au regard de la tâche très prochequ’est la résolution de coréférence. Nous observons en effet différentes configurations (appuyéespar l’exemple) où le reste de la chaîne de coréférence peut fournir des indices utiles pour améliorerla désambiguïsation. Guidés par ces motivations théoriques, nous menons une analyse d’erreursaccompagnée d’expériences oracles qui confirment le potentiel de stratégies de combinaison deprédictions au sein de la chaîne de coréférence (jusqu’à 4.3 F1 sur les mentions coréférentes en anglais). Nousesquissons alors une première preuve de concept de combinaison par vote, en explorant différentesheuristiques de pondération, qui apporte des gains modestes mais interprétables.
Ziqian Peng, Rachel Bawden and François Yvon. 2024. À propos des difficultés à traduire automatiquement de longs documents. In Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position. pages 2–21. ATALA and AFPC. Toulouse, France.

Les nouvelles architectures de traduction automatique sont capables de traiter des segments longs et de surpasser la traduction de phrases isolées, laissant entrevoir la possibilité de traduire des documents complets. Pour y parvenir, il est nécessaire de surmonter un certain nombre de difficultés liées à la longueur des documents à traduire. Dans cette étude, nous discutons de la traduction des documents sous l'angle de l'évaluation, en essayant de répondre à une question simple: comment mesurer s'il existe une dégradation des performances de traduction avec la longueur des documents ? Nos analyses, qui évaluent des systèmes encodeur-décodeur et un grand modèle de langue à l'aune de plusieurs métriques sur une tâche de traduction de documents scientifiques suggèrent que traduire les documents longs d'un bloc reste un problème difficile.
You Zuo, Kim Gerdes, Éric Clergerie and Benoît Sagot. 2024. PatentEval: Understanding Errors in Patent Generation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pages 2687–2710. Association for Computational Linguistics. Mexico City, Mexico.

In this work, we introduce a comprehensive error typology specifically designed for evaluating two distinct tasks in machine-generated patent texts: claims-to-abstract generation, and the generation of the next claim given previous ones. We have also developed a benchmark, PatentEval, for systematically assessing language models in this context. Our study includes a comparative analysis, annotated by humans, of various models. These range from those specifically adapted during training for tasks within the patent domain to the latest general-purpose large language models (LLMs). Furthermore, we explored and evaluated some metrics to approximate human judgments in patent text evaluation, analyzing the extent to which these metrics align with expert assessments. These approaches provide valuable insights into the capabilities and limitations of current language models in the specialized field of patent text generation.
Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis and Chloé Clavel. 2024. The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text. In Findings of the Association for Computational Linguistics: NAACL 2024. pages 3589–3604. Association for Computational Linguistics. Mexico City, Mexico.

This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.
Nathan Godey, Éric Clergerie and Benoît Sagot. 2024. Anisotropy Is Inherent to Self-Attention in Transformers. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). pages 35–48. Association for Computational Linguistics. St. Julian's, Malta.

The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which makes them unexpectedly close to each other in terms of angular distance (cosine-similarity). Some recent works tend to show that anisotropy is a consequence of optimizing the cross-entropy loss on long-tailed distributions of tokens. We show in this paper that anisotropy can also be observed empirically in language models with specific objectives that should not suffer directly from the same consequences. We also show that the anisotropy problem extends to Transformers trained on other modalities. Our observations tend to demonstrate that anisotropy might actually be inherent to Transformers-based models.
Jesujoba O. Alabi and Rachel Bawden. 2024. Exploring Inline Lexicon Injection for Cross-Domain Transfer in Neural Machine Translation. In Proceedings of the First International Workshop on Knowledge-Enhanced Machine Translation. pages 7–20. European Association for Machine Translation. Sheffield, United Kingdom.

Domain transfer remains a challenge in machine translation (MT), particularly concerning rare or unseen words. Amongst the strategies proposed to address the issue, one of the simplest and most promising in terms of generalisation capacity is coupling the MT system with external resources such as bilingual lexicons and appending inline annotations within source sentences. This method has been shown to work well for controlled language settings, but its usability for general language (and ambiguous) MT is less certain. In this article we explore this question further, testing the strategy in a multi-domain transfer setting for German-to-English MT, using the mT5 language model fine-tuned on parallel data. We analyse the MT outputs and design evaluation strategies to understand the behaviour of such models. Our analysis using distractor annotations suggests that although improvements are not systematic according to automatic metrics, the model does learn to select appropriate translation candidates and ignore irrelevant ones, thereby exhibiting more than a systematic copying behaviour. However, we also find that the method is less successful in a higher-resource setting with a larger lexicon, suggesting that it is not a magic solution, especially when the baseline model is already exposed to a wide range of vocabulary.
Rachel Bawden, Ziqian Peng, Maud Bénard, Éric Clergerie, Raphaël Esamotunu, Mathilde Huguin, Natalie Kübler, Alexandra Mestivier, Mona Michelot, Laurent Romary, Lichao Zhu and François Yvon. 2024. Translate your Own: a Post-Editing Experiment in the NLP domain. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1). pages 431–443. European Association for Machine Translation (EAMT). Sheffield, UK.

The improvements in neural machine translation make translation and post- editing pipelines ever more effective for a wider range of applications. In this paper, we evaluate the effectiveness of such a pipeline for the translation of scientific documents (limited here to article abstracts). Using a dedicated interface, we collect, then analyse the post-edits of approximately 350 abstracts (English→French) in the Natural Lan- guage Processing domain for two groups of post-editors: domain experts (academics encouraged to post-edit their own articles) on the one hand and trained translators on the other. Our results confirm that such pipelines can be effective, at least for high-resource language pairs. They also highlight the difference in the post-editing strategy of the two subgroups. Finally, they suggest that working on term translation is the most pressing issue to improve fully automatic translations, but that in a post-editing setup, other error types can be equally annoying for post-editors.
Maxime Guénette, Mathilde Verstraete, Marcello Vitali-Rosati and Alix Chagué. 2024. Transcrire un manuscrit en grec ancien. In Humanistica 2024. Meknès, Morocco.

Cette contribution a pour but de présenter les résultats de nos expérimentations d’entraînement d’un modèle de transcription automatique (HTR) pour le grec ancien à partir d’un corpus d’entraînement élaboré sur le Heidelbergensis Palatinus graecus 23 et avec l’environnement logiciel eScriptorium/Kraken. Ce manuscrit byzantin datant de la fin du Xe siècle est un témoin capital pour l’épigrammatique grecque, en ce qu’il est la source principale nous livrant l’Anthologie palatine. Sa structure claire et son écriture soignée en font un candidat idéal pour l’entraînement d’un modèle pour le grec ancien.
Simon Gabay, Thibault Clérice, Pauline Jacsont, Elina Leblanc, Marie Jeannot-Tirole, Sonia Solfrini, Sophie Dolto, Floriane Goy, Carmen Carrasco Luján, Maddalena Zaglio, Myriam Perregaux, Juliette Janes, Benoît Sagot, Rachel Bawden, Rasul Dent, Oriane Nédey and Alix Chagué. 2024. Reconnaissance des écritures dans les imprimés. In Humanistica 2024. Meknès, Morocco.

La reconnaissance optique de caractères (OCR) a connu d'importants succès pour les documents manuscrits ou les imprimés anciens ces dernières années, mais ce type de document reste marginal dans la production textuelle aujourd'hui disponible. Afin d'offrir aux chercheur.e.s des modèles performants couvrant un plus grand large éventail de cas, nous avons conçu un nouveau modèle généraliste, capable de gérer au mieux des imprimés, anciens comme contemporains, écrits dans une pluralité de langues. Plusieurs architectures sont évaluées, afin de comparer leur efficacité respective en terme de taux d'erreur par caractère, mais aussi de temps d'inférence.
Nathan Godey, Éric de la Clergerie and Benoît Sagot. 2024. On the Scaling Laws of Geographical Representation in Language Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 12416–12422. ELRA and ICCL. Torino, Italia.

Language models have long been shown to embed geographical information in their hidden representations. This line of work has recently been revisited by extending this result to Large Language Models (LLMs). In this paper, we propose to fill the gap between well-established and recent literature by observing how geographical knowledge evolves when scaling language models. We show that geographical knowledge is observable even for tiny models, and that it scales consistently as we increase the model size. Notably, we observe that larger language models cannot mitigate the geographical bias that is inherent to the training data.
Maria Dermentzi and Hugo Scheithauer. 2024. Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. In Proceedings of the First Workshop on Holocaust Testimonies as Language Resources (HTRes) @ LREC-COLING 2024. pages 18–28. ELRA and ICCL. Torino, Italia.

The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable. With this paper, we release EHRI-NER, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for Named Entity Recognition (NER) in Holocaust-related texts. EHRI-NER is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. We leverage this dataset to fine-tune the multilingual Transformer-based language model XLM-RoBERTa (XLM-R) to determine whether a single model can be trained to recognize entities across different document types and languages. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5\%. We argue that this score is sufficiently high to consider the next steps towards deploying this model.
Sarah Bénière, Floriane Chiffoleau and Laurent Romary. 2024. TEI Specifications for a Sustainable Management of Digitized Holocaust Testimonies. In Proceedings of the First Workshop on Holocaust Testimonies as Language Resources (HTRes) @ LREC-COLING 2024. pages 10–17. ELRA and ICCL. Torino, Italia.

Data modeling and standardization are central issues in the field of Digital Humanities, and all the more so when dealing with Holocaust testimonies, where stable preservation and long-term accessibility are key. The EHRI Online Editions are composed of documents of diverse nature (testimonies, letters, diplomatic reports, etc.), held by EHRI’s partnering institutions, and selected, gathered thematically and encoded according to the TEI Guidelines by the editors within the EHRI Consortium. Standardization is essential in order to make sure that the editions are consistent with one another. The issue of consistency also encourages a broader reflection on the usage of standards when processing data, and on the standardization of digital scholarly editions of textual documents in general. In this paper, we present the normalization work we carried out on the EHRI Online Editions. It includes a customization of the TEI adapted to Holocaust-related documents, and a focus on the implementation of controlled vocabulary. We recommend the use of these encoding specifications as a tool for researchers and/or non-TEI experts to ensure their encoding is valid and consistent across editions, but also as a mechanism for integrating the edition work smoothly within a wider workflow leading from image digitization to publication.
Fahad Khan, Maxim Ionov, Christian Chiarcos, Laurent Romary, Gilles Sérasset and Besim Kabashi. 2024. On Modelling Corpus Citations in Computational Lexical Resources. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 12385–12394. ELRA and ICCL. Torino, Italia.

In this article we look at how two different standards for lexical resources, TEI and OntoLex, deal with corpus citations in lexicons. We will focus on how corpus citations in retrodigitised dictionaries can be modelled using each of the two standards since this provides us with a suitably challenging use case. After looking at the structure of an example entry from a legacy dictionary, we examine the two approaches offered by the two different standards by outlining an encoding for the example entry using both of them (note that this article features the first extended discussion of how the Frequency Attestation and Corpus (FrAC) module of OntoLex deals with citations). After comparing the two approaches and looking at the advantages and disadvantages of both, we argue for a combination of both. In the last part of the article we discuss different ways of doing this, giving our preference for a strategy which makes use of RDFa.
Rian Touchent and Éric de la Clergerie. 2024. CamemBERT-bio: Leveraging Continual Pre-training for Cost-Effective Models on French Biomedical Data. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 2692–2701. ELRA and ICCL. Torino, Italia.

Clinical data in hospitals are increasingly accessible for research through clinical data warehouses. However these documents are unstructured and it is therefore necessary to extract information from medical reports to conduct clinical studies. Transfer learning with BERT-like models such as CamemBERT has allowed major advances for French, especially for named entity recognition. However, these models are trained for plain language and are less efficient on biomedical data. Addressing this gap, we introduce CamemBERT-bio, a dedicated French biomedical model derived from a new public French biomedical dataset. Through continual pre-training of the original CamemBERT, CamemBERT-bio achieves an improvement of 2.54 points of F1-score on average across various biomedical named entity recognition tasks, reinforcing the potential of continual pre-training as an equally proficient yet less computationally intensive alternative to training from scratch. Additionally, we highlight the importance of using a standard evaluation protocol that provides a clear view of the current state-of-the-art for French biomedical models.
Niyati Bafna, Cristina España-Bonet, Josef van Genabith, Benoît Sagot and Rachel Bawden. 2024. When Your Cousin Has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 17544–17556. ELRA and ICCL. Torino, Italia.

Most existing approaches for unsupervised bilingual lexicon induction (BLI) depend on good quality static or contextual embeddings requiring large monolingual corpora for both languages. However, unsupervised BLI is most likely to be useful for low-resource languages (LRLs), where large datasets are not available. Often we are interested in building bilingual resources for LRLs against related high-resource languages (HRLs), resulting in severely imbalanced data settings for BLI. We first show that state-of-the-art BLI methods in the literature exhibit near-zero performance for severely data-imbalanced language pairs, indicating that these settings require more robust techniques. We then present a new method for unsupervised BLI between a related LRL and HRL that only requires inference on a masked language model of the HRL, and demonstrate its effectiveness on truly low-resource languages Bhojpuri and Magahi (with <5M monolingual tokens each), against Hindi. We further present experiments on (mid-resource) Marathi and Nepali to compare approach performances by resource range, and release our resulting lexicons for five low-resource Indic languages: Bhojpuri, Magahi, Awadhi, Braj, and Maithili, against Hindi.
Lydia Nishimwe, Benoît Sagot and Rachel Bawden. 2024. Making Sentence Embeddings Robust to User-Generated Content. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 10984–10998. ELRA and ICCL. Torino, Italia.

NLP models have been known to perform poorly on user-generated content (UGC), mainly because it presents a lot of lexical variations and deviates from the standard texts on which most of these models were trained. In this work, we focus on the robustness of LASER, a sentence embedding model, to UGC data. We evaluate this robustness by LASER's ability to represent non-standard sentences and their standard counterparts close to each other in the embedding space. Inspired by previous works extending LASER to other languages and modalities, we propose RoLASER, a robust English encoder trained using a teacher-student approach to reduce the distances between the representations of standard and UGC sentences. We show that with training only on standard and synthetic UGC-like data, RoLASER significantly improves LASER's robustness to both natural and artificial UGC data by achieving up to 2x and 11x better scores. We also perform a fine-grained analysis on artificial UGC data and find that our model greatly outperforms LASER on its most challenging UGC phenomena such as keyboard typos and social media abbreviations. Evaluation on downstream tasks shows that RoLASER performs comparably to or better than LASER on standard data, while consistently outperforming it on UGC data.
Biswesh Mohapatra, Seemab Hassan, Laurent Romary and Justine Cassell. 2024. Conversational Grounding: Annotation and Analysis of Grounding Acts and Grounding Units. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 3967–3977. ELRA and ICCL. Torino, Italia.

Successful conversations often rest on common understanding, where all parties are on the same page about the information being shared. This process, known as conversational grounding, is crucial for building trustworthy dialog systems that can accurately keep track of and recall the shared information. The proficiencies of an agent in grounding the conveyed information significantly contribute to building a reliable dialog system. Despite recent advancements in dialog systems, there exists a noticeable deficit in their grounding capabilities. Traum provided a framework for conversational grounding introducing Grounding Acts and Grounding Units, but substantial progress, especially in the realm of Large Language Models, remains lacking. To bridge this gap, we present the annotation of two dialog corpora employing Grounding Acts, Grounding Units, and a measure of their degree of grounding. We discuss our key findings during the annotation and also provide a baseline model to test the performance of current Language Models in categorizing the grounding acts of the dialogs. Our work aims to provide a useful resource for further research in making conversations with machines better understood and more reliable in natural day-to-day collaborative dialogs.
Seth Aycock and Rachel Bawden. 2024. Topic-guided Example Selection for Domain Adaptation in LLM-based Machine Translation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop. pages 175–195. Association for Computational Linguistics. St. Julian's, Malta.

Current machine translation (MT) systems perform well in the domains on which they were trained, but adaptation to unseen domains remains a challenge. Rather than fine-tuning on domain data or modifying the architecture for training, an alternative approach exploits large language models (LLMs), which are performant across NLP tasks especially when presented with in-context examples. We focus on adapting a pre-trained LLM to a domain at inference through in-context example selection. For MT, examples are usually randomly selected from a development set. Some more recent methods though select using the more intuitive basis of test source similarity. We employ topic models to select examples based on abstract semantic relationships below the level of a domain. We test the relevance of these statistical models and use them to select informative examples even for out-of-domain inputs, experimenting on 7 diverse domains and 11 language pairs of differing resourcedness. Our method outperforms baselines on challenging multilingual out-of-domain tests, though it does not match performance with strong baselines for the in-language setting. We find that adding few-shot examples and related keywords consistently improves translation quality, that example diversity must be balanced with source similarity, and that our pipeline is overly restrictive for example selection when a targeted development set is available.
Thibault Clérice, Ariane Pinche, Malamatenia Vlachou-Efstathiou, Alix Chagué, Jean-Baptiste Camps, Matthias Gille-Levenson, Olivier Brisville-Fertin, Franz Fischer, Michaels Gervers, Agnès Boutreux, Avery Manton, Simon Gabay, Patricia O'Connor, Wouter Haverals, Mike Kestemont, Caroline Vandyck and Benjamin Kiessling. 2024. CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond. In Proceedings of the 2024 International Conference on Document Analysis and Recognition (ICDAR). Athens, Greece.

The surge in digitisation initiatives by Cultural Heritage institutions has facilitated online accessibility to numerous historical manuscripts. However, a substantial portion of these documents exists solely as images, lacking machine-readable text. Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting these images into machine-readable formats, enabling researchers and scholars to analyse vast collections efficiently. Despite significant technological progress, establishing consistent ground truth across projects for HTR tasks, particularly for complex and heterogeneous historical sources like medieval manuscripts in Latin scripts (8th-15th century CE), remains nonetheless challenging. We introduce the Consistent Approaches to Transcribing Manuscripts (CATMuS) dataset for medieval manuscripts, which offers (1) a uniform framework for annotation practices for medieval manuscripts, a benchmarking environment (2) for evaluating automatic text recognition models across multiple dimensions thanks to rich metadata (century of production, language, genre, script, etc.), (3) for other tasks (such as script classification or dating approaches), (4) and finally for exploratory work pertaining to computer vision and digital paleography around line-based tasks, such as generative approaches.Developed through collaboration among various institutions and projects, CATMuS provides an inter-compatible dataset spanning more than 200 manuscripts and incunabula in 10 different languages, comprising over 160,000 lines of text and 5 million characters spanning from the 8th century to the 16th. The dataset's consistency in transcription approaches aims to mitigate challenges arising from the diversity in standards for medieval manuscript transcriptions, providing a comprehensive benchmark for evaluating HTR models on historical sources.
Wissam Antoun, Benoît Sagot and Djamé Seddah. 2024. From Text to Source: Results in Detecting Large Language Model-Generated Content. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 7531–7543. ELRA and ICCL. Torino, Italia.

The widespread use of Large Language Models (LLMs), celebrated for their ability to generate human-like text, has raised concerns about misinformation and ethical implications. Addressing these concerns necessitates the development of robust methods to detect and attribute text generated by LLMs. This paper investigates "Cross-Model Detection," by evaluating whether a classifier trained to distinguish between source LLM-generated and human-written text can also detect text from a target LLM without further training. The study comprehensively explores various LLM sizes and families, and assesses the impact of conversational fine-tuning techniques, quantization, and watermarking on classifier generalization. The research also explores Model Attribution, encompassing source model identification, model family, and model size classification, in addition to quantization and watermarking detection. Our results reveal several key findings: a clear inverse relationship between classifier effectiveness and model size, with larger LLMs being more challenging to detect, especially when the classifier is trained on data from smaller models. Training on data from similarly sized LLMs can improve detection performance from larger models but may lead to decreased performance when dealing with smaller models. Additionally, model attribution experiments show promising results in identifying source models and model families, highlighting detectable signatures in LLM-generated text, with particularly remarkable outcomes in watermarking detection, while no detectable signatures of quantization were observed. Overall, our study contributes valuable insights into the interplay of model size, family, and training data in LLM detection and attribution.
Thibault Clerice. 2024. Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 4772–4783. ELRA and ICCL. Torino, Italia.

In this study, we propose to evaluate the use of deep learning methods for semantic classification at the sentence level to accelerate the process of corpus building in the field of humanities and linguistics, a traditional and time-consuming task. We introduce a novel corpus comprising around 2500 sentences spanning from 300 BCE to 900 CE including sexual semantics (medical, erotica, etc.). We evaluate various sentence classification approaches and different input embedding layers, and show that all consistently outperform simple token-based searches. We explore the integration of idiolectal and sociolectal metadata embeddings (centuries, author, type of writing), but find that it leads to overfitting. Our results demonstrate the effectiveness of this approach, achieving high precision and true positive rates (TPR) of respectively 70.60% and 86.33% using HAN. We evaluate the impact of the dataset size on the model performances (420 instead of 2013), and show that, while our models perform worse, they still offer a high enough precision and TPR, even without MLM, respectively 69% and 51%. Given the result, we provide an analysis of the attention mechanism as a supporting added value for humanists in order to produce more data.

Communications

Lorraine Vanel, Ariel R. Ramos Vela, Alya Yacoubi and Chloé Clavel. 2024. Socio-Emotional Response Generation: A Human Evaluation Protocol for LLM-Based Conversational Systems. In AHRI 2024 : The 3rd Workshop on Affective Human-Robot Interaction at ACII 2024. Glasgow, United Kingdom.

Conversational systems are now capable of producing impressive and generally relevant responses. However, we have no visibility nor control of the socio-emotional strategies behind state-of-the-art Large Language Models (LLMs), which poses a problem in terms of their transparency and thus their trustworthiness for critical applications. Another issue is that current automated metrics are not able to properly evaluate the quality of generated responses beyond the dataset's ground truth. In this paper, we propose a neural architecture that includes an intermediate step in planning socio-emotional strategies before response generation. We compare the performance of open-source baseline LLMs to the outputs of these same models augmented with our planning module. We also contrast the outputs obtained from automated metrics and evaluation results provided by human annotators. We describe a novel evaluation protocol that includes a coarse-grained consistency evaluation, as well as a finer-grained annotation of the responses on various social and emotional criteria. Our study shows that predicting a sequence of expected strategy labels and using this sequence to generate a response yields better results than a direct end-to-end generation scheme. It also highlights the divergences and the limits of current evaluation metrics for generated content. The code for the annotation platform and the annotated data are made publicly available for the evaluation of future models.
Simon Gabay, Ariane Pinche, Peter Nahon, Alix Chagué, Pauline Jacsont, Élodie Paupe, Jean-Claude Rebetez, Maxime Humeau, Christine Payot, Thibault Maillard, Yvan Jauregui, Elina Leblanc and Loraine Chappuis. 2024. Vers un modèle diachronique pour les mains modernes françaises. In Humanistica 2024 - Colloque annuel de l'Association francophone des humanités numériques. Meknès, Morocco.

Pour le domaine francophone, les manuscrits rédigés après le Moyen Âge restent le dernier type de document qui n'est pas correctement traité par les moteurs de reconnaissance optique de caractères. Si des modèles ont déjà été publiés, leur efficacité et leur documentation restent encore insatisfaisants, en grande partie à cause des problèmes posés par l'importante évolution graphique (au sens paléographique comme linguistique) qu'a connu la langue au cours des siècles, et donc de la diversité des formes à traiter. Après une brève description du problème philologique, nous proposons donc ici quelques premières réflexions sur la transcription des documents modernes, ainsi qu'un nouveau modèle pour améliorer les conditions de travail des chercheur•se•s, le temps de concevoir une solution véritablement satisfaisante.
Alix Chagué. 2024. McCATMuS : retours sur la production d'un méta-dataset multilingue et multiséculaire. In Le patrimoine archivistique face au virage numérique. Rimouski, Canada.

Alix Chagué. 2024. FAIRer transcriptions: HTR-United and the possibility of a common for training data. In Horizons of digital philology. Naples, Italy.

Alix Chagué. 2024. Initiation to Handwritten Text Recognition with eScriptorium. In Horizons of digital philology. Naples, Italy.

Alix Chagué and Hugo Scheithauer. 2024. Do (colored) backgrounds matter? An experiment on artificially augmented ground truth for handwritten text recognition applied to historical manuscripts. In CSDH/SCHN 2024: Sustaining Shared Futures. Montréal, Canada.

We present an experiment conducted on the augmentation of older grayscale datasets designed for automatic text recognition on contemporary handwriting (IAM-Database). The augmentation method relies on the addition of colored backgrounds taken from real-world historical blank pages and allows us to create an enhanced version of IAM-Database. We train various transcription models playing on the composition of trainset and validationset using the original and enhanced IAM-Database. We test the resulting models against the original and enhanced testsets, as well as a testset composed from real-world historical documents. We find that though the transcription engine proves robust to color changes, this technique could be used to bring up to speed older grayscale datasets to create transcription models efficient on historical handwriting. Additionally, we consider the environmental costs of using enhanced data as opposed to the original dataset, and find that the impact is minor.
Sarah Bénière, Floriane Chiffoleau and Hugo Scheithauer. 2024. Streamlining the Creation of Holocaust-related Digital Editions with Automatic Tools. In EHRI Academic Conference - Researching the Holocaust in the Digital Age. Varsovie, Poland.

Alix Chagué, Floriane Chiffoleau and Hugo Scheithauer. 2024. Collaboration and Transparency: A User-Generated Documentation for eScriptorium. In DH2024 Reinvention & Responsibility. Washington D. C., United States.

Floriane Chiffoleau and Hugo Scheithauer. 2024. Leveraging EHRI Online Editions for training automated edition tools. In EHRI Workshop Natural Language Processing Meets Holocaust Archives. Prague, Czech Republic.

Thibault Clérice and Malamatenia Vlachou-Efstathiou. 2024. The CATMuS initiative: building large and diverse corpora for handwritten text recognition. In DH AI Seminar 2024 - Digital Humanities / Artificial Intelligence. Paris, France.

Thibault Clérice. 2024. Distributed Texts Services: Présentation. In Journées Biblissima+: Partager, décloisonner, réutiliser : outiller la recherche et développer de nouveaux usages. Aubervilliers, France.

Hugo Scheithauer, Sarah Bénière and Laurent Romary. 2024. Automatic retro-structuration of auction sales catalogs layout and content. In DH2024 - Reinvention and Responsibility. Washinghton DC, United States.

This paper showcases a pipeline for automatically retro-structuring auction sales catalogs, based on document layout analysis and information extraction technologies. Structured layout and textual data are then transformed into TEI XML for publication. It also advocates for a generalized use of layout segmentation in digitization pipelines.
Thibault Clérice, Juliette Janes, Hugo Scheithauer, Sarah Bénière, Laurent Romary and Benoît Sagot. 2024. Layout Analysis Dataset with SegmOnto. In DH2024 - Annual conference of the Alliance of Digital Humanities Organizations. Washington DC, United States.

Ariane Pinche, Thibault Clérice, Alix Chagué, Jean-Baptiste Camps, Malamatenia Vlachou-Efstathiou, Matthias Gille Levenson, Olivier Brisville-Fertin, Federico Boschetti, Franz Fischer, Michael Gervers, Agnès Boutreux, Avery Manton, Simon Gabay, Wouter Haverals, Mike Kestemont, Caroline Vandyck and Patricia O'Connor. 2024. CATMuS-Medieval: Consistent Approaches to Transcribing ManuScripts. In Digital Humanities - DH2024. Washington DC, United States.

Books

Benoît Sagot. 2024. Apprendre les langues aux machines. 325 Éditions du Collège de France.

À l’automne 2022, le lancement de ChatGPT a installé l’intelligence artificielle au cœur de l’actualité. Chacun a pu s’emparer de cet agent conversationnel et prendre la mesure de sa puissance, mais son fonctionnement est resté pour beaucoup mystérieux. Cette leçon inaugurale lève le voile sur un domaine de recherche auquel il doit son existence, le traitement automatique des langues, ou TAL.Pas à pas, l’auteur nous conduit à travers l’histoire du TAL afin de dégager les enjeux actuels de cette discipline aussi ancienne que l’informatique et qui s’efforce d’apprendre les langues aux machines. Comment en est-on arrivé à l’apprentissage automatique, aux réseaux de neurones et aux modèles génératifs ? Quels aspects éthiques requièrent notre vigilance face à l’accélération de la recherche et de l’innovation ? En fin de compte, ChatGPT est-il vraiment une révolution ?

Tech reports

Ziqian Peng, Rachel Bawden and François Yvon. 2024. Model Cards for the MaTOS Project. Technical report.

Nicolas Dahan, Rachel Bawden and François Yvon. 2024. Survey of Automatic Metrics for Evaluating Machine Translation at the Document Level. Technical report.

This report presents a survey of document-level automatic metrics for machine translation (MT), addressing the need for sophisticated evaluation methods that extend beyond sentence-level assessments. Traditional metrics, which evaluate translations on a sentence-by-sentence basis, often fail to capture the complexities of discourse phenomena, leading to gaps in assessing coherence, cohesion, and cross-sentence dependencies.The report starts by introducing the terminology and notation relevant to document-level MT evaluation. It then describes the linguistic phenomena that are crucial at the document level, related for example to lexical and grammatical cohesion, and overall text coherence, which pose significant challenges for MT systems.Following this, we explore human evaluation protocols targeting document-level translation, discussing the methodologies used to judge translation quality in a more holistic manner. Studying human judgments is necessary, as automatic metrics often aim at reproducing them. We also examine the various test sets that have been developed to support the evaluation of document-level MT.The core of the survey focuses on automatic evaluation metrics designed for document-level translation. These metrics aim to provide a more accurate representation of translation quality by considering the broader context and long-range dependencies within a text, offering a more comprehensive assessment than sentence-level metrics.The report concludes with an overview of the current trends in document-level MT evaluation, summarizing key challenges and identifying areas for future research. It emphasizes the need for the development of context-aware metrics and the importance of creating standardized, document-level test sets to advance MT evaluation.
Alix Chagué, Floriane Chiffoleau, Matthias Gille Levenson, Hugo Scheithauer and Ariane Pinche. 2024. Chaînes d'acquisition, de traitement et de publication du texte. Technical report.

Né dans le contexte du consortium Ariane-HN , et face à l’émergence de l’intégration de l’intelligence artificielle dans la production de textes en sciences humaines, ce livrable vise à de présenter les différentes étapes d'une chaîne d'acquisition textuelle de la transcription à la mise en ligne (acquisition, modélisation des données, enrichissement et mise en ligne). Les protocoles proposés ne sont pas limités par des principes éditoriaux stricts, mais souples, adaptables et indépendants d’outils particuliers. En effet, enfermer cette réflexion dans une chaîne logicielle présenterait des risques, notamment en raison de leur obsolescence, de la diversité des besoins et du niveau de complexité des tâches liées aux particularités des corpus. C’est pourquoi nous avons préféré nous en tenir à une chaîne théorique adaptable en fonction des solutions techniques disponibles. Ainsi, en fonction des ressources à disposition et des objectifs des projets, nous proposons, ici, deux voies : une voie simple qui demandera peu de compétences en ingénierie et une voie plus complexe qui ajoutera un certain nombre de tâches d’automatisation dans l’acquisition du texte et son enrichissement, nécessitant une plus grande maîtrise des outils techniques ainsi qu’une compréhension plus approfondie de leurs enjeux scientifiques.
Ziqian Peng, Rachel Bawden and François Yvon. 2024. Handling Very Long Contexts in Neural Machine Translation: a Survey. Technical report.

This report examines methods for integrating an extended discourse context in machine translation, focusing on neural translation methods. Machine translation systems generally translate each sentence independently of its neighbors, which yields systematic errors resulting from a limited discourse context. Therefore, various approaches have been proposed to incorporate cross-sentential context, mostly based on the predominant Transformer architecture.Recently, the introduction of large language models (LLMs) also created novel opportunities to process long-range dependencies, inspiring several context-aware machine translation approaches.We present the challenges of translating long inputs, then investigate encoder-decoder architectures and LLM-based approaches, with a brief overview of efficient transformer implementations as a common background. Furthermore, we also discuss strategies to extend other NLP tasks to a longer context, and list recently available open-source document-level parallel corpus for future exploration. We conclude with a summary of current work and the main research directions.

Other

Sarah Bénière. 2024. DataCatalogue : Restructurer automatiquement les catalogues de ventes.

Présentation du projet DataCatalogue et de sa chaîne de traitement dans le cadre du cours "Panorama de projets" dispensé aux étudiant·e·s du M2 TNAH à l'École nationale des chartes, le 24 janvier 2024.
Sarah Bénière. 2024. TEI Publisher: A Platform for Digital Editions.

Preprints

Wissam Antoun, Francis Kulumba, Rian Touchent, Eric Villemonte de La Clergerie, Benoît Sagot and Djamé Seddah. 2024. CamemBERT 2.0: A Smarter French Language Model Aged to Perfection. Preprint.

French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due to temporal concept drift, where outdated training data leads to a decline in performance, especially when encountering new topics and terminology. This issue emphasizes the need for updated models that reflect current linguistic trends. In this paper, we introduce two new versions of the CamemBERT base model-CamemBERTav2 and CamemBERTv2-designed to address these challenges. CamemBERTav2 is based on the DeBERTaV3 architecture and makes use of the Replaced Token Detection (RTD) objective for better contextual understanding, while CamemBERTv2 is built on RoBERTa, which uses the Masked Language Modeling (MLM) objective. Both models are trained on a significantly larger and more recent dataset with longer context length and an updated tokenizer that enhances tokenization performance for French. We evaluate the performance of these models on both general-domain NLP tasks and domain-specific applications, such as medical field tasks, demonstrating their versatility and effectiveness across a range of use cases. Our results show that these updated models vastly outperform their predecessors, making them valuable tools for modern NLP systems. All our new models, as well as intermediate checkpoints, are made openly available on Huggingface.
Thibault Clérice, Juliette Janes, Hugo Scheithauer, Sarah Bénière, Florian Cafiero, Laurent Romary, Simon Gabay and Benoît Sagot. 2024. Diachronic Document Dataset for Semantic Layout Analysis. Preprint.

We present a novel, open-access dataset designed for semantic layout analysis, built to support document recreation workflows through mapping with the Text Encoding Initiative (TEI) standard. This dataset includes 7,254 annotated pages spanning a large temporal range (1600-2024) of digitised and born-digital materials across diverse document types (magazines, papers from sciences and humanities, PhD theses, monographs, plays, administrative reports, etc.) sorted into modular subsets. By incorporating content from different periods and genres, it addresses varying layout complexities and historical changes in document structure. The modular design allows domain-specific configurations. We evaluate object detection models on this dataset, examining the impact of input size and subset-based training. Results show that a 1280-pixel input size for YOLO is optimal and that training on subsets generally benefits from incorporating them into a generic model rather than fine-tuning pre-trained weights.
Matthieu Futeral, Cordelia Schmid, Benoît Sagot and Rachel Bawden. 2024. Towards Zero-Shot Multimodal Machine Translation. Preprint.

Current multimodal machine translation (MMT) systems rely on fully supervised data (i.e models are trained on sentences with their translations and accompanying images). However, this type of data is costly to collect, limiting the extension of MMT to other language pairs for which such data does not exist. In this work, we propose a method to bypass the need for fully supervised data to train MMT systems, using multimodal English data only. Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives: visually conditioned masked language modelling and the Kullback-Leibler divergence between the original and new MMT outputs. We evaluate on standard MMT benchmarks and the recently released CoMMuTE, a contrastive benchmark aiming to evaluate how well models use images to disambiguate English sentences. We obtain disambiguation performance close to state-of-the-art MMT models trained additionally on fully supervised examples. To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese. We further show that we can control the trade-off between disambiguation capabilities and translation fidelity at inference time using classifier-free guidance and without any additional data. Our code, data and trained models are publicly accessible.
Rasul Dent, Juliette Janes, Thibault Clérice, Pedro Ortiz Suarez and Benoît Sagot. 2024. Molyé: A Corpus-based Approach to Language Contact in Colonial France. Preprint.

Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of intermediate forms. This work introduces a new open corpus, the Molyé corpus, which combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.
Armel Zebaze, Benoît Sagot and Rachel Bawden. 2024. In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation. Preprint.

The ability of generative large language models (LLMs) to perform in-context learning has given rise to a large body of research into how best to prompt models for various natural language processing tasks. In this paper, we focus on machine translation (MT), a task that has been shown to benefit from in-context translation examples. However no systematic studies have been published on how best to select examples, and mixed results have been reported on the usefulness of similarity-based selection over random selection. We provide a study covering multiple LLMs and multiple in-context example retrieval strategies, comparing multilingual sentence embeddings. We cover several language directions, representing different levels of language resourcedness (English into French, German, Swahili and Wolof). Contrarily to previously published results, we find that sentence embedding similarity can improve MT, especially for low-resource language directions, and discuss the balance between selection pool diversity and quality. We also highlight potential problems with the evaluation of LLM-based MT and suggest a more appropriate evaluation protocol, adapting the COMET metric to the evaluation of LLMs. Code and outputs are freely available at https://github.com/ArmelRandy/ICL-MT.
Francis Kulumba, Wissam Antoun, Guillaume Vimont and Laurent Romary. 2024. Harvesting Textual and Structured Data from the HAL Publication Repository. Preprint.

HAL (Hyper Articles en Ligne) is the French national publication repository, used by most higher education and research organizations for their open science policy. As a digital library, it is a rich repository of scholarly documents, but its potential for advanced research has been underutilized. We present HALvest, a unique dataset that bridges the gap between citation networks and the full text of papers submitted on HAL. We craft our dataset by filtering HAL for scholarly publications, resulting in approximately 700,000 documents, spanning 56 languages across 13 identified domains, suitable for language model training, and yielding approximately 16.5 billion tokens (with 8 billion in French and 7 billion in English, the most represented languages). We transform the metadata of each paper into a citation network, producing a directed heterogeneous graph. This graph includes uniquely identified authors on HAL, as well as all open submitted papers, and their citations. We provide a baseline for authorship attribution using the dataset, implement a range of state-of-the-art models in graph representation learning for link prediction, and discuss the usefulness of our generated knowledge graph structure.
Matthieu Futeral, Armel Zebaze, Pedro Ortiz Suarez, Julien Abadji, Rémi Lacroix, Cordelia Schmid, Rachel Bawden and Benoît Sagot. 2024. mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus. Preprint.

Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. [2022] showed that additionally training them on interleaved sequences of text and images can lead to the emergence of in-context learning capabilities. However, the dataset they used, M3W, is not public and is only in English. There have been attempts to reproduce their results but the released datasets are English-only. In contrast, current multilingual and multimodal datasets are either composed of caption-like only or medium-scale or fully private data. This limits mLLM research for the 7,000 other languages spoken in the world. We therefore introduce mOSCAR, to the best of our knowledge the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 315M documents, 214B tokens and 1.2B images. We carefully conduct a set of filtering and evaluation steps to make sure mOSCAR is sufficiently safe, diverse and of good quality. We additionally train two types of multilingual model to prove the benefits of mOSCAR: (1) a model trained on a subset of mOSCAR and captioning data and (2) a model train on captioning data only. The model additionally trained on mOSCAR shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks, confirming previous findings for English-only mLLMs.
Nathan Godey, Eric Villemonte de La Clergerie and Benoît Sagot. 2024. Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck. Preprint.

Recent advances in language modeling consist in pretraining highly parameterized neural networks on extremely large web-mined text corpora. Training and inference with such models can be costly in practice, which incentivizes the use of smaller counterparts. However, it has been observed that smaller models can suffer from saturation, characterized as a drop in performance at some advanced point in training followed by a plateau. In this paper, we find that such saturation can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution. This mismatch affects the performance of the linear prediction head used in such models through the well-known softmax bottleneck phenomenon. We measure the effect of the softmax bottleneck in various settings and find that models based on less than 1000 hidden dimensions tend to adopt degenerate latent representations in late pretraining, which leads to reduced evaluation performance.
Alix Chagué and Hugo Scheithauer. 2024. Do (colored) backgrounds matter? An experiment on artificially augmented ground truth for handwritten text recognition applied to historical manuscripts. Preprint.

We present an experiment conducted on the augmentation of older grayscale datasets designed for automatic text recognition on contemporary handwriting (IAM-Database). The augmentation method relies on the addition of colored backgrounds taken from real-world historical blank pages and allows us to create an enhanced version of IAM-Database. We train various transcription models playing on the composition of trainset and validationset using the original and enhanced IAM-Database. We test the resulting models against the original and enhanced testsets, as well as a testset composed from real-world historical documents. We find that though the transcription engine proves robust to color changes, this technique could be used to bring up to speed older grayscale datasets to create transcription models efficient on historical handwriting. Additionally, we consider the environmental costs of using enhanced data as opposed to the original dataset, and find that the impact is minor.
Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-Jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoît Sagot and Emmanuel Dupoux. 2024. SpiRit-LM: Interleaved Spoken and Written Language Model. Preprint.

We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. SPIRIT-LM comes in two versions: a BASE version that uses speech semantic units and an EXPRESSIVE version that models expressivity using pitch and style units in addition to the semantic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that SPIRIT-LM is able to learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification).
Rachel Bawden, Hatim Bourfoune, Bertrand Cabot, Nathan Cassereau, Pierre Cornette, Marco Naguib, Aurélie Névéol and François Yvon. 2024. Les modèles Bloom pour le traitement automatique de la langue française. Preprint.

The development of very large language models, capable of performing a large range of automatic language processing tasks, simultaneously requires to develop the infrastructure needed to evaluate these models, ideally covering as many tasks as possible. Numerous benchmarks have already been compiled for the English language, making it possible to evaluate these large models from multiple angles. Several multilingual test sets are also available, with a much lesser coverage, which are used to measure the ability of these models to handle multiple languages. In this paper, we present our efforts to assemble a multi-task evaluation set for French, which is then used to evaluate models from the BLOOM family. Our results confirm and complement the main evaluation results for BLOOM in English; they allow us to conclude that the performances obtained in French and English are very similar and even better when the prompts used at inference are written in the same language as the texts to analyze.

2023

PhD theses and Habiliations

Lionel Tadonfouet Tadjou. 2023. Constitution de fils de discussion cohérents à partir de conversations issues d'outils professionnels de communication et de collaboration. PhD thesis. Sorbonne Université.

Constituting coherent threads of conversation from professional communication and collaboration tools is a process of transforming a written, asynchronous conversation into sub-conversations, each dealing with a specific topic while maintining the order of arrival of the messages sent by interlocutors in the original conversation. These sub-conversations thus result in linear or tree-like conversation structures. This process can be applied to forum discussions but also to e-mail conversations, both examples being more generally representative of Computer Mediated Content (CMC). To build up these sub-threads of e-mail conversations, we need to rely on their metadata and content. In practice, however, these elements do not seem sufficient. An e-mail conversation is, in fact, a dialogue with a discursive structure that is potentially useful for tracking the evolution of the discussion. It should be noted, however, that this dialogue is asynchronous, which emphasizes specificities. In synchronous dialogues, very strong relationships often emerge between consecutive utterances, which in a long discussion can form clusters of sub-conversations. The constitution of conversation sub-threads from main conversations is based in this type of relationships between the sentences of successive emails in a conversation : this type of relationship is refered to as transverse. Unlike dialogues, where such relations can easily be identified, this is a very complex task in email conversations and constitutes the main sub-problem called statement matching for which we suggest several resolution methods. Conversations generally abound in linguistic and paralinguistic information, among which are dialogue acts. They very often help to better identify the content of a dialogue and could strongly contribute to constituting conversation sub-threads via a better identification of relations between utterances. This is the hypothesis we state in the context of solving the statement matching problem, based on an initial phase of classification of dialogue statements. This manuscript decribes the work related to our core problem, as well as the sub-problems mentioned above. Around this main focus, we address various related but important, necessary or useful aspects. Thus, we take an in-depth look at CMOs, discourse analysis and its historicity, as well as the available corpus to approach such problems. Then we offer different resolution methods for our sub-problems, with well-detailed experiments and evaluations of said methods. Finally, our manuscript concludes with the following propositions : the application of the proposed methods to other types of CMO, such as forums, and other possibilities to be explored to solve the problem of constituting conversational sub-threads.
José Rosales Núñez. 2023. Machine Translation of User-Generated Contents : an Evaluation of Neural Translation Systems under Zero-shot Conditions. PhD thesis. Université Paris-Saclay.

The rapid advancements in telecommunications over the past few decades have revolutionized the way people exchange information. Thanks to these advancements, the average user can now communicate with others across the globe in real-time and with minimal delay. With approximately 60% of the global population having Internet access, billions of individuals interact by sharing user-generated content (UGC) in various forms. This UGC, which often includes reviews and opinions, provides a valuable source of information, offering a comprehensive view of global trends. Machine Translation (MT) plays a vital role in enabling smooth communication and facilitating the automatic processing of UGC for data mining purposes.However, translating UGC presents unique challenges compared to translating traditional text. UGC is highly productive and exhibits various phenomena such as repeated characters, typographical errors, contractions, jargon, and unconventional sentence structures. These specificities lead to a significant number of Out-of-Vocabulary tokens (OOVs) and rare sequences, which pose problems since they are not adequately represented in the standard parallel corpora used to train MT models. Additionally, conventional domain adaptation techniques like fine-tuning have limited success in addressing these challenges. They suffer from performance degradation when applied to in-domain data and are unable to keep up with the ever-evolving nature of UGC.In this study, we focus on the task of automatically translating UGC in the zero-shot scenario, where we restrain from using any UGC-specific training data. Our aim is to develop more generalized MT architectures that can handle the distributional drift inherent in UGC. In the initial phase of our research, we dedicated our efforts to identifying and quantifying the specificities of UGC that hinder translation performance. We have also created evaluation frameworks and data collections to aid in this endeavor. Using off-the-shelf models, we investigate the challenges faced by MT systems when translating UGC and link the errors to their underlying mechanisms.Subsequently, we delve into the study and proposal of different methods to address the challenges posed by UGC. These methods include exploring normalization pipelines, employing more granular tokenization techniques, and utilizing latent variable models to enhance the robustness of MT systems. For each of these approaches, we systematically evaluate the performance and robustness of the systems, conduct a detailed error analysis, and offer insights into promising avenues for tackling the automatic translation of UGC in the zero-shot setting.

Journal articles

Rute Costa, Ana Salgado, Margarida Ramos, Sara Carvalho, Fahad Khan, Toma Tasovac, Bruno Almeida, Mohamed Khemakhem, Laurent Romary and Raquel Silva. 2023. A crossroad between lexicography and terminology work: Knowledge organization and domain labelling. Digital Scholarship in the Humanities 38 pages i17–i29. Oxford University Press.

Abstract MORDigital project aims to encode the selected editions of Diccionario de Lingua Portugueza by António de Morais Silva, first published in 1789. Our ultimate goals are, on the one hand, to promote accessibility to cultural heritage while fostering reusability and, on the other hand, to contribute towards a more significant presence of lexicographic digital content in Portuguese through open tools and standards. The Morais dictionary represents a significant legacy, since it marks the beginning of Portuguese dictionaries, having served as a model for all subsequent lexicographic production. The team follows a new paradigm in lexicography, which results from the convergence between lexicography, terminology, computational linguistics, and ontologies as an integral part of digital humanities and linked (open) data. In the Portuguese context, this research fills a gap concerning searchable online retrodigitized dictionaries, built on current standards and methodologies which promote data sharing and harmonization, namely TEI Lex-0. The team will further ensure the connection to other existing systems and lexical resources, particularly in the Portuguese-speaking world.
Simon Gabay, Philippe Gambette, Rachel Bawden and Benoît Sagot. 2023. Ancien ou moderne ? Pistes computationnelles pour l'analyse graphématique des textes écrits au XVIIe siècle. Linx 85 Presses Universitaires de Paris Nanterre.

The use of contemporary spelling rather than old graphic systems in the vast majority of current editions of 17th century French texts has the unfortunate effect of masking their graphematic richness. Such valuable information has remained concealed and therefore under-exploited, despite the potential it holds in terms of analysis. By favouring a practical corpus-based approach, rather than a theoretical one, and by relying on a recategorisation of the various competing systems at that time in French scriptae, we propose the foundations of a scriptometric study of the classical language, focusing on the analysis of specific documents, both manuscripts and old prints.
Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoît Sagot, Abdelrahman Mohamed and Emmanuel Dupoux. 2023. Generative Spoken Dialogue Language Modeling. Transactions of the Association for Computational Linguistics 11 pages 250–266. The MIT Press.

We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model.
Thibault Clérice, Malamatenia Vlachou-Efstathiou and Alix Chagué. 2023. CREMMA Medii Aevi: Literary manuscript text recognition in Latin. Journal of Open Humanities Data 9 pages 1–19. Ubiquity Press.

This paper present a novel segmentation and handwritten text recognition dataset for Medieval Latin, from the 11 th to the 16 th century. It connects with Medieval French dataset as well as ealier Latin dataset, by enforcing common guidelines. We provide our own addition to Ariane Pinche's Old French guidelines to deal with specific Latin case. We also offer an overview of how we addressed this dataset compilation through the use of pre-existing resources. With a higher abbreviation ratio and a better representation of abbreviating marks, we offer new models that outperform the base Old French model on Latin dataset, reaching readability levels on unknown manuscripts.
Thibault Clérice. 2023. You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine. Journal of Data Mining and Digital Humanities Historical Documents and... INRIA.

Layout Analysis (the identification of zones and their classification) is the first step along line segmentation in Optical Character Recognition and similar tasks. The ability of identifying main body of text from marginal text or running titles makes the difference between extracting the work full text of a digitized book and noisy outputs. We show that most segmenters focus on pixel classification and that polygonization of this output has not been used as a target for the latest competition on historical document (ICDAR 2017 and onwards), despite being the focus in the early 2010s. We propose to shift, for efficiency, the task from a pixel classification-based polygonization to an object detection using isothetic rectangles. We compare the output of Kraken and YOLOv5 in terms of segmentation and show that the later severely outperforms the first on small datasets (1110 samples and below). We release two datasets for training and evaluation on historical documents as well as a new package, YALTAi, which injects YOLOv5 in the segmentation pipeline of Kraken 4.1.

Conference proceedings

Marc Hulcelle, Giovanna Varni, Nicolas Rollet and Chloé Clavel. 2023. Comparing a Mentalist and an Interactionist Approach for Trust Analysis in Human-Robot Interaction. In Proceedings of the 11th International Conference on Human-Agent Interaction. pages 273–280. ACM. Gothenburg, Sweden.

Yanzhu Guo, Guokan Shang, Virgile Rennard, Michalis Vazirgiannis and Chloé Clavel. 2023. Automatic Analysis of Substantiation in Scientific Peer Reviews. In Findings of the Association for Computational Linguistics: EMNLP 2023. pages 10198–10216. Association for Computational Linguistics. Singapore.

With the increasing amount of problematic peer reviews in top AI conferences, the community is urgently in need of automatic quality control measures. In this paper, we restrict our attention to substantiation — one popular quality aspect indicating whether the claims in a review are sufficiently supported by evidence — and provide a solution automatizing this evaluation process. To achieve this goal, we first formulate the problem as claim-evidence pair extraction in scientific peer reviews, and collect SubstanReview, the first annotated dataset for this task. SubstanReview consists of 550 reviews from NLP conferences annotated by domain experts. On the basis of this dataset, we train an argument mining system to automatically analyze the level of substantiation in peer reviews. We also perform data analysis on the SubstanReview dataset to obtain meaningful insights on peer reviewing quality in NLP conferences over recent years. The dataset is available at https://github.com/YanzhuGuo/SubstanReview.
Robin Algayres, Yossi Adi, Tu Nguyen, Jade Copet, Gabriel Synnaeve, Benoît Sagot and Emmanuel Dupoux. 2023. Generative Spoken Language Model based on continuous word-sized audio tokens. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pages 3008–3028. Association for Computational Linguistics. Singapore.

In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from wordbased LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio embeddings that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous embeddings. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is five times more memory efficient thanks to its large 200ms units. In addition, the embeddings before and after the Lexical Embedder are phonetically and semantically interpretable.
Robin Algayres, Pablo Diego-Simon, Benoît Sagot and Emmanuel Dupoux. 2023. XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words. In Findings of the Association for Computational Linguistics: EMNLP 2023. pages 12103–12112. Association for Computational Linguistics. Singapore.

Due to the absence of explicit word boundaries in the speech stream, the task of segmenting spoken sentences into word units without text supervision is particularly challenging. In this work, we leverage the most recent selfsupervised speech models that have proved to quickly adapt to new tasks through fine-tuning, even in low resource conditions. Taking inspiration from semi-supervised learning, we finetune an XLS-R model to predict word boundaries themselves produced by top-tier speech segmentation systems: DPDP, VG-HuBERT, GradSeg and DP-Parse. Once XLS-R is finetuned, it is used to infer new word boundary labels that are used in turn for another finetuning step. Our method consistently improves the performance of each system and sets a new state-of-the-art that is, on average 130% higher than the previous one as measured by the F1 score on correctly discovered word tokens on five corpora featuring different languages. Finally, our system can segment speech from languages unseen during fine-tuning in a zero-shot fashion 1 .
Simon Meoni, Eric De la Clergerie and Theo Ryffel. 2023. Large Language Models as Instructors: A Study on Multilingual Clinical Entity Extraction. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. pages 178–190. Association for Computational Linguistics. Toronto, Canada.

In clinical and other specialized domains, data are scarce due to their confidential nature. This lack of data is a major problem when finetuning language models. Nevertheless, very large language models (LLMs) are promising for the medical domain but cannot be used directly in healthcare facilities due to data confidentiality issues. We explore an approach of annotating training data with LLMs to train smaller models more adapted to our problem. We show that this method yields promising results for information extraction tasks.
José Rosales Núñez, Djamé Seddah and Guillaume Wisniewski. 2023. Multi-way Variational NMT for UGC: Improving Robustness in Zero-shot Scenarios via Mixture Density Networks. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa). pages 447–459. University of Tartu Library. Tórshavn, Faroe Islands.

This work presents a novel Variational Neural Machine Translation (VNMT) architecture with enhanced robustness properties, which we investigate through a detailed case-study addressing noisy French user-generated content (UGC) translation to English. We show that the proposed model, with results comparable or superior to state-of-the-art VNMT, improves performance over UGC translation in a zero-shot evaluation scenario while keeping optimal translation scores on in-domain test sets. We elaborate on such results by visualizing and explaining how neural learning representations behave when processing UGC noise. In addition, we show that VNMT enforces robustness to the learned embeddings, which can be later used for robust transfer learning approaches.
Rachel Bawden and Benoît Sagot. 2023. RoCS-MT: Robustness Challenge Set for Machine Translation. In Proceedings of the Eighth Conference on Machine Translation. pages 198–216. Association for Computational Linguistics. Singapore.

RoCS-MT, a Robust Challenge Set for Machine Translation (MT), is designed to test MT systems' ability to translate user-generated content (UGC) that displays non-standard characteristics, such as spelling errors, devowelling, acronymisation, etc. RoCS-MT is composed of English comments from Reddit, selected for their non-standard nature, which have been manually normalised and professionally translated into five languages: French, German, Czech, Ukrainian and Russian. In the context of the WMT23 test suite shared task, we analyse the models submitted to the general MT task for all from-English language pairs, offering some insights into the types of problems faced by state-of-the-art MT models when dealing with non-standard UGC texts. We compare automatic metrics for MT quality, including quality estimation to see if the same conclusions can be drawn without references. In terms of robustness, we find that many of the systems struggle with non-standard variants of words (e.g. due to phonetically inspired spellings, contraction, truncations, etc.), but that this depends on the system and the amount of training data, with the best overall systems performing better across all phenomena. GPT4 is the clear frontrunner. However we caution against drawing conclusions about generalisation capacity as it and other systems could be trained on the source side of RoCS and also on similar data.
Mariana Neves, Antonio Jimeno Yepes, Aurélie Névéol, Rachel Bawden, Giorgio Maria Di Nunzio, Roland Roller, Philippe Thomas, Federica Vezzani, Maika Vicente Navarro, Lana Yeganova, Dina Wiemann and Cristian Grozea. 2023. Findings of the WMT 2023 Biomedical Translation Shared Task: Evaluation of ChatGPT 3.5 as a Comparison System. In Proceedings of the Eighth Conference on Machine Translation. pages 43–54. Association for Computational Linguistics. Singapore.

We present an overview of the Biomedical Translation Task that was part of the Eighth Conference on Machine Translation (WMT23). The aim of the task was the automatic translation of biomedical abstracts from the PubMed database. It included twelve language directions, namely, French, Spanish, Portuguese, Italian, German, and Russian, from and into English. We received submissions from 18 systems and for all the test sets that we released. Our comparison system was based on Chat-GPT 3.5 and performed very well in comparison to many of the submissions.
Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ond\vrej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Philipp Koehn, Benjamin Marie, Christof Monz, Makoto Morishita, Kenton Murray, Makoto Nagata, Toshiaki Nakazawa, Martin Popel, Maja Popović and Mariya Shmatova. 2023. Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet. In Proceedings of the Eighth Conference on Machine Translation. pages 1–42. Association for Computational Linguistics. Singapore.

This paper presents the results of the General Machine Translation Task organised as part of the 2023 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 8 language pairs (covering 14 translation directions), to be evaluated on test sets consisting of up to four different domains. We evaluate system outputs with professional human annotators using a combination of source-based Direct Assessment and scalar quality metric (DA+SQM).
Valentin Taillandier, Dieuwke Hupkes, Benoît Sagot, Emmanuel Dupoux and Paul Michel. 2023. Neural Agents Struggle to Take Turns in Bidirectional Emergent Communication. In Proceedings of 11th International Conference on Learning Representation (ICLR 2023). Kigali, Rwanda.

The spontaneous exchange of turns is a central aspect of human communication. Although turn-taking conventions come to us naturally, artificial dialogue agents struggle to coordinate, and must rely on hard-coded rules to engage in interactive conversations with human interlocutors. In this paper, we investigate the conditions under which artificial agents may naturally develop turn-taking conventions in a simple language game. We describe a cooperative task where success is contingent on the exchange of information along a shared communication channel where talking over each other hinders communication. Despite these environmental constraints, neural-network based agents trained to solve this task with reinforcement learning do not systematically adopt turn-taking conventions. However, we find that agents that do agree on turn-taking protocols end up performing better. Moreover, agents that are forced to perform turn-taking can learn to solve the task more quickly. This suggests that turn-taking may help to generate conversations that are easier for speakers to interpret.
Paul-Ambroise Duquenne, Hongyu Gong, Ning Dong, Jingfei Du, Ann Lee, Vedanuj Goswami, Changhan Wang, Juan Pino, Benoît Sagot and Holger Schwenk. 2023. SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pages 16251–16269. Association for Computational Linguistics. Toronto, Canada.

We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations (S2ST) mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on Europarl-ST, VoxPopuli and FLEURS test sets. Enabled by the multilinguality of SpeechMatrix, we also explore multilingual speech-to-speech translation, a topic which was addressed by few other works. We also demonstrate that model pretraining and sparse scaling using Mixture-of-Experts bring large gains to translation performance. We are open-sourcing the mined data, speech encoders used for mining, multilingual HuBERT models in four language families for target unit generation, languagespecific vocoders for speech synthesis from discrete units, and S2S models trained and presented in this work. 1
Paul-Ambroise Duquenne, Holger Schwenk and Benoît Sagot. 2023. Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer. In Proceedings of the 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023). Dublin, Ireland.

Recent research has shown that independently trained encoders and decoders, combined through a shared fixed-size representation, can achieve competitive performance in speech-to-text translation. In this work, we show that this type of approach can be further improved with multilingual training. We observe significant improvements in zero-shot cross-modal speech translation, even outperforming a supervised approach based on XLSR for several languages.
Jean-Baptiste Camps, Nicolas Baumard, Pierre-Carl Langlais, Olivier Morin, Thibault Clérice and Jade Norindr. 2023. Make Love or War? Monitoring the Thematic Evolution of Medieval French Narratives. In Proceedings of Computational Humanities Research 2023. pages 734–756. Paris, France.

In this paper, we test a famous conjecture in literary history put forward by Seignobos and de Rougemont according to which the French central medieval period (12-13th centuries) is characterized by an important increase in the cultural importance of love. To do that, we focus on the large and culturally important body of manuscripts containing medieval French long narrative fictions, in particular epics (chansons de geste, of the Matter of France) and romances (chiefly romans on the Matters of Britain and of Rome), both in verse and in prose, from the 12th to the 15th century. We introduce the largest available corpus of these texts, the Corpus of Medieval French Epics and Romances, composed of digitised manuscripts drawn from Gallica, and processed through layout analysis and handwritten text recognition. We then use semantic representations based on embeddings to monitor the place given to love and violence in this corpus, through time. We observe that themes (such as the relation between love and death) and emblematic works well identified by literary history do indeed play a central part in the representation of love in the corpus, but our modelling also points to the characteristic nature of more overlooked works. Variation in time seems to show that there is indeed an phase of expansion of love in these fictions, in the 13th and early 14th century, followed by a period of contraction, that seem to correlate with the Crisis of the Late Middle Ages.
Arij Riabi, Menel Mahamdi and Djamé Seddah. 2023. Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting an Under-Resourced Language. In Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII). pages 266–278. Association for Computational Linguistics. Toronto, Canada.

In this paper we address the scarcity of annotated data for NArabizi, a Romanized form of North African Arabic used mostly on social media, which poses challenges for Natural Language Processing (NLP). We introduce an enriched version of NArabizi Treebank (Seddah et al., 2020) with three main contributions: the addition of two novel annotation layers (named entity recognition and offensive language detection) and a re-annotation of the tokenization, morpho-syntactic and syntactic layers that ensure annotation consistency. Our experimental results, using different tokenization schemes, showcase the value of our contributions and highlight the impact of working with non-gold tokenization for NER and dependency parsing. To facilitate future research, we make these annotations publicly available. Our enhanced NArabizi Treebank paves the way for creating sophisticated language models and NLP tools for this under-represented language.
Galo Castillo-lópez, Arij Riabi and Djamé Seddah. 2023. Analyzing Zero-Shot transfer Scenarios across Spanish variants for Hate Speech Detection. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). pages 1–13. Association for Computational Linguistics. Dubrovnik, Croatia.

Hate speech detection in online platforms has been widely studied in the past. Most of these works were conducted in English and a few rich-resource languages. Recent approaches tailored for low-resource languages have explored the interests of zero-shot cross-lingual transfer learning models in resource-scarce scenarios. However, languages variations between geolects such as American English and British English, Latin-American Spanish, and European Spanish is still a problem for NLP models that often relies on (latent) lexical information for their classification tasks. More importantly, the cultural aspect, crucial for hate speech detection, is often overlooked. In this work, we present the results of a thorough analysis of hate speech detection models performance on different variants of Spanish, including a new hate speech toward immigrants Twitter data set we built to cover these variants. Using mBERT and Beto, a monolingual Spanish Bert-based language model, as the basis of our transfer learning architecture, our results indicate that hate speech detection models for a given Spanish variant are affected when different variations of such language are not considered. Hate speech expressions could vary from region to region where the same language is spoken.
Alafate Abulimiti, Chloé Clavel and Justine Cassell. 2023. When to generate hedges in peer-tutoring interactions. In SIGDIAL - 24th Meeting of the Special Interest Group on Discourse and Dialogue. Prague, Czech Republic.

This paper explores the application of machine learning techniques to predict where hedging occurs in peer-tutoring interactions. The study uses a naturalistic face-to-face dataset annotated for natural language turns, conversational strategies, tutoring strategies, and nonverbal behaviours. These elements are processed into a vector representation of the previous turns, which serves as input to several machine learning models. Results show that embedding layers, that capture the semantic information of the previous turns, significantly improves the model's performance. Additionally, the study provides insights into the importance of various features, such as interpersonal rapport and nonverbal behaviours, in predicting hedges by using Shapley values for feature explanation. We discover that the eye gaze of both the tutor and the tutee has a significant impact on hedge prediction. We further validate this observation through a follow-up ablation study.
Thibault Clérice and Anthony Glaise. 2023. Twenty-One* Pseudo-Chrysostoms and more: authorship verification in the patristic world. In Proceedings of the Computational Humanities Research Conference 2023. Paris, France.

As the most prolific of the Church Fathers, John Chrysostom (344-407 CE) has a vast textual mass and theological importance that has led to a significant misattribution of texts, resulting in the existence of a second corpus known as the pseudo-Chrysostomian corpus. Like many Greek-language Church Fathers' works, this corpus comprises anonymous texts, which scholars have attempted to reattribute or group together based on factors such as the person's function, biography, ideology, style, etc. One survey conducted by Voicu in 1981 explored potential groupings of such texts and produced a critical list of 21 Pseudo-Chrysostom works identified by scholars, including Montfaucon (1655-1741), one of the first modern editors of Chrysostom's writings. In this paper, we present a novel approach to addressing pseudonymous work in the context of chrysostomian studies. We propose to employ siamese networks within an authorship verification framework, following the methodology commonly used in recent computational linguistic competitions. Our embedding model is trained using commonly used features in the digital humanities landscape, such as the most frequent words, affixes, and POS trigrams, utilizing a signal-to-noise ratio distance and pair mining. The results of our model show high AUCROC scores (0.855). Furthermore, the article concludes with an analysis of the pseudo-Chrysostoms proposed by Voicu. We validate a significant portion of the hypotheses found in Voicu's survey while also providing counter-arguments for two Pseudo-Chrysostoms. This research contributes to shedding light on the attribution of ancient texts and enriches the field of chrysostomian studies.
Itai Gat, Felix Kreuk, Tu Anh Nguyen, Ann Lee, Jade Copet, Gabriel Synnaeve, Emmanuel Dupoux and Yossi Adi. 2023. Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023). pages 465–477. Association for Computational Linguistics. Toronto, Canada (in-person and online).

Generative Spoken Language Modeling research focuses on optimizing speech Language Models (LMs) using raw audio recordings without accessing any textual supervision. Such speech LMs usually operate over discrete units obtained from quantizing internal representations of self-supervised models. Although such units show impressive modeling results, their robustness capabilities have not been extensively investigated. This work focuses on improving the invariance of discrete input representations to non-spoken augmentations for generative spoken language modeling. First, we formally define how to measure the robustness of such representations to various signal variations that do not alter the spoken information (e.g., time-stretch). Next, we empirically demonstrate how current state-of-the-art representation models lack robustness to such variations. To overcome this, we propose an effective and efficient method to learn invariant discrete speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudolabeling scheme. Our method significantly improves over the evaluated baselines when considering encoding and modeling metrics. We additionally evaluate our method on the speechto-speech translation task, considering Spanish-English and French-English translations, and show the proposed approach outperforms the evaluated baselines.
Tu Anh Nguyen, Wei-Ning Hsu, Antony d'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi and Emmanuel Dupoux. 2023. Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis. In Proceedings of the 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023). pages 4823–4827. ISCA. Dublin, Ireland.

Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce EXPRESSO, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in low-bitrate units and resynthesize it in a target voice while preserving content and style. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders, and explore tradeoffs between quality, bitrate and invariance to speaker and style. The dataset, evaluation metrics and baseline models are open sourced.
Ali Elkahky, Wei-Ning Hsu, Paden Tomasello, Tu Anh Nguyen, Robin Algayres, Yossi Adi, Jade Copet, Emmanuel Dupoux and Abdelrahman Mohamed. 2023. Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training? In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. Ixia-Ialyssos, Greece.

The research community has produced many successful selfsupervised speech representation learning methods over the past few years. Discrete units have been utilized in various self-supervised learning frameworks, such as VQ-VAE [1], wav2vec 2.0 [2], HuBERT [3], and Wav2Seq [4]. This paper studies the impact of altering the granularity and improving the quality of these discrete acoustic units for pre-training encoder-only and encoder-decoder models. We systematically study the current proposals of using Byte-Pair Encoding (BPE) and new extensions that use cluster smoothing and Brown clustering. The quality of learned units is studied intrinsically using zero speech metrics and on the downstream speech recognition (ASR) task. Our results suggest that longer-range units are helpful for encoder-decoder pre-training; however, encoder-only masked-prediction models cannot yet benefit from self-supervised word-like targets.
Maud Bénard, Alexandra Mestivier, Natalie Kubler, Lichao Zhu, Rachel Bawden, Eric De La Clergerie, Laurent Romary, Mathilde Huguin, Jean-François Nominé, Ziqian Peng and François Yvon. 2023. MaTOS: Traduction automatique pour la science ouverte. In Actes de CORIA-TALN 2023. Actes de l'atelier «Analyse et Recherche de Textes Scientifiques»; (ARTS)@TALN 2023. pages 8–15. ATALA. Paris, France.

This contribution presents the MaTOS (Machine Translation for Open Science) project, which aims to develop new methods for the complete machine translation (MT) of scientific documents between English and French, as well as automatic metrics to evaluate the translation quality. To this end, MaTOS is interested in (a) the collection of open resources for specialised MT ; (b) the description of textual coherence markers for scientific articles ; (c) the development of new multilingual processing methods for documents ; and (d) metrics to measure progress in document-level machine translation.
Simon Meoni, Rian Touchent and Eric De La Clergerie. 2023. Passe ta pharma d'abord ! In Actes de CORIA-TALN 2023. Actes du Défi Fouille de Textes@TALN2023. pages 68–76. ATALA. Paris, France.

Nous présentons les 3 expériences menées par l'équipe ALMAnaCH - Arkhn et leurs résultats pour le DÉfi Fouille de Textes (DEFT) 2023. Les scores sont encourageants mais suggèrent surtout de nouveaux éléments à prendre en compte pour réussir ce défi. Nous avons exploré différentes approches avec des modèles de tailles variables et modélisé la tâche de différentes manières (classification multi-labels, implication textuelle, séquence à séquence). Nous n'avons pas observé des gains de performance significatifs. Nos expériences semblent montrer la nécessité de l'utilisation de bases de connaissances externes pour obtenir de bons résultats sur ce type de tâche.
Lionel Tadonfouet Tadjou, Eric De La Clergerie, Fabrice Bourge and Tiphaine Marie. 2023. Constitution de sous-fils de conversations d'emails. In Actes de CORIA-TALN 2023. Actes de la 18e Conférence en Recherche d'Information et Applications (CORIA). pages 157–171. ATALA. Paris, France.

Email conversations in the workplace are sometimes difficult to follow by collaborators because they can deal with multiple topics and involve many interlocutors. To improve understanding of key messages, it’s helpful to create subthreads within the conversation. In our study, we propose a two-stage pipeline to recognize dialogue acts in email text segments and link them to improveinformation accessibility. This pipeline creates pairs of text segments across the conversation, making it easier to understand the key messages. To our knowledge, this is the first time this issue of creating conversation threads has been addressed in email conversations. We annotated the BC3 corpus of emails with dialogue acts and linked conversation email text segments.
Lydia Nishimwe. 2023. Normalisation lexicale de contenus générés par les utilisateurs sur les réseaux sociaux. In Actes de CORIA-TALN 2023. Actes des 16e Rencontres Jeunes Chercheurs en RI (RJCRI) et 25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL). pages 160–183. ATALA. Paris, France.

The boom of natural language processing (NLP) is taking place in a world where more and more content is produced online. On social networks especially, textual content published by users are full of “non-standard” phenomena such as spelling mistakes, jargon, marks of expressiveness, etc. Thus, NLP models, which are largely trained on “standard” data, suffer a decline in performance when applied to user-generated content (UGC). One approach to mitigate this degradation is through lexical normalisation where non-standard words are replaced by their standard forms. In this paper, we review the state of the art of lexical normalisation of UGC, as well as run a preliminary experimental study to show the advantages and difficulties of this task.
Simon Meoni, Théo Ryffel and Eric De La Clergerie. 2023. Annotation d'entités cliniques en utilisant les Larges Modèles de Langue. In Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux–articles longs. pages 190–203. ATALA. Paris, France.

Dans le domaine clinique et dans d'autres domaines spécialisés, les données sont rares du fait de leur caractère confidentiel. Ce manque de données est un problème majeur lors du fine-tuning de modèles de langue.Par ailleurs, les modèles de langue de très grande taille (LLM) ont des performances prometteuses dans le domaine médical. Néanmoins, ils ne peuvent pas être utilisés directement dans les infrastructures des établissements de santé pour des raisons de confidentialité des données. Nous explorons une approche d'annotation des données d'entraînement avec des LLMs pour entraîner des modèles de moins grandes tailles mieux adaptés à notre problématique. Cette méthode donne des résultats prometteurs pour des tâches d'extraction d'information
You Zuo, Benoît Sagot, Kim Gerdes, Houda Mouzoun and Samir Ghamri Doudane. 2023. Exploring Data-Centric Strategies for French Patent Classification: A Baseline and Comparisons. In Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux–articles longs. pages 349–365. ATALA. Paris, France.

This paper proposes a novel approach to French patent classification leveraging data-centric strategies. We compare different approaches for the two deepest levels of the IPC hierarchy: the IPC group and subgroups. Our experiments show that while simple ensemble strategies work for shallower levels, deeper levels require more sophisticated techniques such as data augmentation, clustering, and negative sampling. Our research highlights the importance of language-specific features and data-centric strategies for accurate and reliable French patent classification. It provides valuable insights and solutions for researchers and practitioners in the field of patent classification, advancing research in French patent classification.
Rian Touchent, Laurent Romary and Eric De La Clergerie. 2023. CamemBERT-bio : Un modèle de langue français savoureux et meilleur pour la santé. In Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux–articles longs. pages 323–334. ATALA. Paris, France.

Les données cliniques dans les hôpitaux sont de plus en plus accessibles pour la recherche à travers les entrepôts de données de santé, cependant ces documents sont non-structurés. Il est donc nécessaire d'extraire les informations des comptes-rendus médicaux. L'utilisation du transfert d'apprentissage grâce à des modèles de type BERT comme CamemBERT ont permis des avancées majeures, notamment pour la reconnaissance d'entités nommées. Cependant, ces modèles sont entraînés pour le langage courant et sont moins performants sur des données biomédicales. C'est pourquoi nous proposons un nouveau jeu de données biomédical public français sur lequel nous avons poursuivi le pré-entraînement de CamemBERT. Ainsi, nous présentons une première version de CamemBERT-bio, un modèle public spécialisé pour le domaine biomédical français qui montre un gain de 2,54 points de F-mesure en moyenne sur différents jeux d'évaluations de reconnaissance d'entités nommées biomédicales.
Niyati Bafna, Cristina España-Bonet, Josef Van Genabith, Benoît Sagot and Rachel Bawden. 2023. Cross-lingual Strategies for Low-resource Language Modeling: A Study on Five Indic Dialects. In Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux–articles longs. pages 28–42. ATALA. Paris, France.

Neural language models play an increasingly central role for language processing, given their success for a range of NLP tasks. In this study, we compare some canonical strategies in language modeling for low-resource scenarios, evaluating all models by their (finetuned) performance on a POS-tagging downstream task. We work with five (extremely) low-resource dialects from the Indic dialect continuum (Braj, Awadhi, Bhojpuri, Magahi, Maithili), which are closely related to each other and the standard mid-resource dialect, Hindi. The strategies we evaluate broadly include from-scratch pretraining, and cross-lingual transfer between the dialects as well as from different kinds of off-the- shelf multilingual models; we find that a model pretrained on other mid-resource Indic dialects and languages, with extended pretraining on target dialect data, consistently outperforms other models. We interpret our results in terms of dataset sizes, phylogenetic relationships, and corpus statistics, as well as particularities of this linguistic system.
Wissam Antoun, Virginie Mouilleron, Benoît Sagot and Djamé Seddah. 2023. Towards a Robust Detection of Language Model-Generated Text: Is ChatGPT that easy to detect? In Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux–articles longs. pages 14–27. ATALA. Paris, France.

Recent advances in natural language processing (NLP) have led to the development of large language models (LLMs) such as ChatGPT. This paper proposes a methodology for developing and evaluating ChatGPT detectors for French text, with a focus on investigating their robustness on out-of-domain data and against common attack schemes. The proposed method involves translating an English dataset into French and training a classifier on the translated data. Results show that the detectors can effectively detect ChatGPT-generated text, with a degree of robustness against basic attack techniques in in-domain settings. However, vulnerabilities are evident in out-of-domain contexts, highlighting the challenge of detecting adversarial text. The study emphasizes caution when applying in-domain testing results to a wider variety of content. We provide our translated datasets and models as open-source resources.
Francesca Frontini, Laurent Romary and Anas Fahad Khan. 2023. ISO LMF 24613-6: A Revised Syntax Semantics Module for the Lexical Markup Framework. In Proceedings of the 4th Conference on Language, Data and Knowledge. pages 316–321. NOVA CLUNL, Portugal. Vienna, Austria.

The Lexical Markup Framework (LMF) is a meta-model for representing data in monolingual and multilingual lexical databases with a view to its use in computer applications. The "new LMF" replaces the old LMF standard, ISO 24613:2008, and is being published as a multi-part standard. This short paper introduces one of these new parts, ISO 24613-6, namely the Syntax and Semantics (SynSem) module. The SynSem module allows for the description of syntactic and semantic properties of lexemes, as well as the complex interactions between them. While the new standard remains faithful to (and backwards compatible with) the syntax and semantics coverage of the previous model, the new standard clarifies and simplifies it in a few places, which will be illustrated.
Alix Chagué, Thibault Clérice, Jade Norindr, Maxime Humeau, Baudoin Davoury, Elsa Van Kote, Anaïs Mazoue, Margaux Faure and Soline Doat. 2023. Manu McFrench, from zero to hero: impact of using a generic handwriting recognition model for smaller datasets. In Digital Humanities 2023: Collaboration as Opportunity. Graz, Austria.

Long paper presentation for ADHO's annual conference on Digital Humanities (2023), discussing the importance of using generic transcription models for HTR and how to create them. We use the case of the CREMMA datasets and the Manu McFrench models as an example.
Thibault Clérice, Alix Chagué and Hugo Scheithauer. 2023. Workshop HTR-United: metadata, quality control and sharing process for HTR training data. In DH 2023 - Digital Humanities Conference: Collaboration as Opportunity. Graz, Austria.

Workshop for ADHO's 2023 conference on Digital Humanities, introducing HTR-United's main features and demonstrating how to use them, on top of presenting essential Continuous Integration principles.
Alix Chagué and Thibault Clérice. 2023. ''I'm here to fight for ground truth'': HTR-United, a solution towards a common for HTR training data. In Digital Humanities 2023: Collaboration as Opportunity. Graz, Austria.

Short paper presentation for ADHO's annual conference on the Digital Humanities (DH2023), introducing the HTR-United infrastructure and the stakes of sharing training datasets for HTR of historical documents.
Sonal Sannigrahi and Rachel Bawden. 2023. Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation. pages 181–192. European Association for Machine Translation. Tampere, Finland.

Multilingual language models have shown impressive cross-lingual transfer ability across a diverse set of languages and tasks. To improve the cross-lingual ability of these models, some strategies include transliteration and finer-grained segmentation into characters as opposed to subwords. In this work, we investigate lexical sharing in multilingual machine translation (MT) from Hindi, Gujarati, Nepali into English. We explore the trade-offs that exist in translation performance between data sampling and vocabulary size, and we explore whether transliteration is useful in encouraging cross-script generalisation. We also verify how the different settings generalise to unseen languages (Marathi and Bengali). We find that transliteration does not give pronounced improvements and our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences even for relatively low-resource languages. Our code will be made publicly available. 1
Rachel Bawden and François Yvon. 2023. Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation. pages 157–170. European Association for Machine Translation. Tampere, Finland.

The NLP community recently saw the release of a new large open-access multilingual language model, BLOOM (BigScience et al., 2022) covering 46 languages. We focus on BLOOM's multilingual ability by evaluating its machine translation performance across several datasets (WMT, Flores-101 and DiaBLa) and language pairs (high- and low-resourced). Our results show that 0-shot performance suffers from overgeneration and generating in the wrong language, but this is greatly improved in the few-shot setting, with very good results for a number of language pairs. We study several aspects including prompt design, model sizes, cross-lingual transfer and the use of discursive context.
Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot and Rachel Bawden. 2023. Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pages 5394–5413. Association for Computational Linguistics. Toronto, Canada.

One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as images. However, recent work in multimodal MT (MMT) has shown that obtaining improvements from images is challenging, limited not only by the difficulty of building effective cross-modal representations, but also by the lack of specific evaluation and training data. We present a new MMT approach based on a strong text-only MT model, which uses neural adapters, a novel guided self-attention mechanism and which is jointly trained on both visually-conditioned masking and MMT. We also introduce CoMMuTE, a Contrastive Multilingual Multimodal Translation Evaluation set of ambiguous sentences and their possible translations, accompanied by disambiguating images corresponding to each translation. Our approach obtains competitive results compared to strong text-only models on standard English-to-French, English-to-German and English-to-Czech benchmarks and outperforms baselines and state-of-the-art MMT systems by a large margin on our contrastive test set. Our code and CoMMuTE are freely available.
Wissam Antoun, Benoît Sagot and Djamé Seddah. 2023. Data-Efficient French Language Modeling with CamemBERTa. In Findings of the Association for Computational Linguistics: ACL 2023. pages 5174–5185. Association for Computational Linguistics. Toronto, Canada.

Recent advances in NLP have significantly improved the performance of language models on a variety of tasks. While these advances are largely driven by the availability of large amounts of data and computational power, they also benefit from the development of better training methods and architectures. In this paper, we introduce CamemBERTa, a French DeBERTa model that builds upon the DeBERTaV3 architecture and training objective. We evaluate our model's performance on a variety of French downstream tasks and datasets, including question answering, part-of-speech tagging, dependency parsing, named entity recognition, and the FLUE benchmark, and compare against CamemBERT, the state-of-the-art monolingual model for French. Our results show that, given the same amount of training tokens, our model outperforms BERT-based models trained with MLM on most tasks. Furthermore, our new model reaches similar or superior performance on downstream tasks compared to CamemBERT, despite being trained on only 30% of its total number of input tokens. In addition to our experimental results, we also publicly release the weights and code implementation of CamemBERTa, making it the first publicly available DeBERTaV3 model outside of the original paper and the first openly available implementation of a DeBERTaV3 training objective. https://gitlab.inria.fr/almanach/CamemBERTa
El Haff Karim, Wissam Antoun, Florence Le Ber and Véronique Pitchon. 2023. Reconnaissance des entités nommées pour l'analyse des pharmacopées médiévales. In Proceedings of EGC 2023 - Extraction et Gestion des Connaissances. pages 329–336. Lyon, France.

Today, many projects focus on the application of linguistic technologies on modern medical corpora, especially in the field of Named Entity Recognition. Besides, ancient pharmacopoeias are being explored with manual data entry by specialists in history and biology in order to extract knowledge. These analyses are carried out without necessarily going through the automatic recognition of named entities which could accelerate the exploration of the manuscripts. Therefore, we propose here a link between the two practices by: (1) creating a named entity recognition dataset for English translations of medieval Arabic pharmacopoeias and (2) training and evaluating language models that are pre-trained on multiple domains.

Communications

Hugo Scheithauer, Sarah Bénière, Jean-Philippe Moreux and Laurent Romary. 2023. DataCatalogue : rétro-structuration automatique des catalogues de vente. In Webinaire Culture-Inria. Paris, France.

Hugo Scheithauer. 2023. DataCatalogue : Un projet pour la restructuration automatique de catalogues de vente. In Traitements automatiques pour les humanités numériques - corpus d'histoire de l'art, d'enseignement, d'urbanisme. Nanterre, France.

Chahan Vidal-Gorène, Jean-Baptiste Camps and Thibault Clérice. 2023. Synthetic lines from historical manuscripts: an experiment using GAN and style transfer. In Visual Processing of Digital Manuscripts: Workflows, Pipelines, Best Practices. ICIAP 2023 Workshops. ICIAP 2023. Udine, Italy.

Given enough data of sufficient quality, HTR systems can achieve high accuracy, regardless of language, script or medium. Despite growing pooling of datasets, the question of the required quantity of training material still remains crucial for the transfer of models to out-of-domain documents, or the recognition of new scripts and under-resourced character classes. We propose a new data augmentation strategy, using generative adversarial networks (GAN). Inspired by synthetic lines generation for printed documents, our objective is to generate handwritten lines in order to massively produce data for a given style or under-resourced character class. Our approach, based on a variant of ScrabbleGAN, demonstrates the feasibility for various scripts, either in the presence of a high number and variety of abbreviations (Latin) and spellings or letter forms (Old French), in a situation of data scarcity (Armenian), or in the instance of a very cursive script (Arabic Maghribi). We then study the impact of synthetic line generation on HTR, by evaluating the gain for out-of-domain documents and under-resourced classes.
Ana Salgado, Rute Costa, Sara Carvalho, Anas Fahad Khan, Bruno Almeida, Margarida Ramos, Raquel Silva, Mohamed Khemakhem, Laurent Romary and Toma Tasovac. 2023. Domain labelling in the Morais dictionary: bringing structure to unstructured lexicographic data. In 24th Biennial Dictionary Society of North America Conference (DSNA). Boulder, United States.

This article provides a detailed analysis on the use of domain labels, i.e., special markersidentifying a specialised field of knowledge, in successive editions of the Morais dictionary.Morais is a historical Portuguese language dictionary, commonly known by and disseminated under the name of António de Morais Silva. This monolingual dictionary has relevance for the Portuguese lexicographic tradition as it inaugurates modern Portuguese lexicography and serves as a model for all subsequent lexicographic production throughout the 19th and 20th centuries. The domain labels were retrieved from the abbreviation lists of its various editions. This work is part of an ongoing Portuguese national linguistic project. It has two goals: 1) to encode the first three editions of the Morais dictionary to make them available online (as well as publishing them as lexical resources using two different standards for structured lexicographic datasets) and 2) to provide a description of the lexicographic components of these editions following a rigorous linguistic treatment. This project is not merely of a lexicographic nature, but it also explores the convergence between lexicography and other research domains, such as terminology, ontologies, linked data, and digital humanities. This article analyzes the domain labelling system in Morais from an evolutionary and diachronic perspective, in line with previous works that highlight the theoretical assumptions and methodological aspects of the lexicographical tradition around domain labelling. To organize lexicographic content, it is helpful to establish a hierarchical structure in general language dictionaries to systematize the included terminological information. Each table of abbreviations has two distinct columns: one with the abbreviation and the other with the complete domain designations. Given the importance of domain labels, we conducted a survey of all domain labels found. We identify and demonstrate the previous and newly added domains. After reviewing the flat domain list, we evaluated whether there was a discernible knowledge organizational approach that identified possible generic domains and subdomains. In the organization of domains, we propose three possible levels: superdomain, domain, and subdomain. The superdomain corresponds to the broadest taxonomic grouping followed by a domain, whereas the subdomain is part of a broader domain. To facilitate the analysis and to focus on interoperability issues, we generated a metalabel, a tag that identifies the English equivalent of the corresponding domain. The lists of domains included in general dictionaries’ outside matter follow alphabetical ordering, without any concern for the relationships that can be established between those types of labels. This article describes both onomasiological and semasiological approaches to treating specialized lexicographic content. Following terminological principles and an onomasiological approach, we organize and conceptualize specialized knowledge using structured data formats, such as Text Encoding Initiative, also considering future alignments between different lexicographic resources. The project will contribute towards a more significant presence of lexicographic digital content in Portuguese through open tools and standards.

Tech reports

Yannick Parmentier, Sylvain Pogodalla, Rachel Bawden, Matthieu Labeau and Iris Eshkol-Taravella. 2023. Procédure de diffusion des publications de l'ATALA sur les archives ouvertes. Technical report.

Other

Alix Chagué and Thibault Clérice. 2023. Deploying eScriptorium online: notes on CREMMA's server specifications.

Alix Chagué and Thibault Clérice. 2023. 017 - Deploying eScriptorium online: notes on CREMMA's server specifications.

Laurent Romary. 2023. Monitoring an APC policy - lessons learned and perspective after 7 years.

As part of its open science policy, articulated around a deposit mandate on the French publication repository HAL, Inria decided several years ago to provide internal supervision and support for article processing charges (APC). These charges, which for publishers provide a way of covering publication costs are now part of an ethical debate surrounding open access. We introduced a policy for covering APCs based upon a central budget and forbidding the payment of APCs for hybrid venues. Each request for funding for a publication through APCs is analysed, focusing on raising awareness, providing support and making recommendations, targeting so-called 'ethical' journals. We will present the results of this policy over a period of several years and elicit some of the further directions we want to follow in the future.

Preprints

Paul-Ambroise Duquenne, Kevin Heffernan, Alexandre Mourachko, Benoît Sagot and Holger Schwenk. 2023. SONAR EXPRESSIVE: Zero-shot Expressive Speech-to-Speech Translation. Preprint.

Massively multilingual and multimodal sentence representations like SONAR are usually trained to capture only the meaning of the encoded text or speech. We complement this semantic embedding by a generic speech characteristic embedding which captures the expressive properties of a speech signal. We describe an iterative training procedure which aims to disentangle the semantics and expressive speech properties, and which does not need labeled data. We show the effectiveness of our method on the FLEURS and mEXPRESSO benchmark test sets using multiple metrics which aim to measure the preservation of the meaning and prosody for zero-shot speech-to-speech translation from five languages into English.
Beatrice Biancardi, Mathieu Chollet and Chloé Clavel. 2023. Introducing the 3MT_French Dataset to Investigate the Timing of Public Speaking Judgements. Preprint.

Abstract In most public speaking datasets, judgements are given after watching the entire performance, or on thin slices randomly selected from the presentations, without focusing on the temporal location of these slices. This does not allow to investigate how people's judgements develop over time during presentations. This contrasts with primacy and recency theories, which suggest that some moments of the speech could be more salient than others and contribute disproportionately to the perception of the speaker's performance.To provide novel insights on this phenomenon, we present the 3MT_French dataset. It contains a set of public speaking annotations collected on a crowd-sourcing platform through a novel annotation scheme and protocol. Global evaluation, persuasiveness, perceived self-confidence of the speaker and audience engagement were annotated on different time windows (i.e., the beginning, middle or end of the presentation, or the full video). This new resource will be useful to researchers working on public speaking assessment and training. It will allow to fine-tune the analysis of presentations under a novel perspective relying on socio-cognitive theories rarely studied before in this context, such as first impressions and primacy and recency theories. An exploratory correlation analysis on the annotations provided in the dataset suggests that the early moments of a presentation have a stronger impact on the judgements.
Alix Chagué and Thibault Clérice. 2023. Données ouvertes, données propres, et autres vies : Testaments de Poilus et CREMMA. Preprint.

Niyati Bafna, Cristina España-Bonet, Josef van Genabith, Benoît Sagot and Rachel Bawden. 2023. A Simple Method for Unsupervised Bilingual Lexicon Induction for Data-Imbalanced, Closely Related Language Pairs. Preprint.

Existing approaches for unsupervised bilingual lexicon induction (BLI) often depend on good quality static or contextual embeddings trained on large monolingual corpora for both languages. In reality, however, unsupervised BLI is most likely to be useful for dialects and languages that do not have abundant amounts of monolingual data. We introduce a simple and fast method for unsupervised BLI for low-resource languages with a related mid-to-high resource language, only requiring inference on the higher-resource language monolingual BERT. We work with two low-resource languages ($<5M$ monolingual tokens), Bhojpuri and Magahi, of the severely under-researched Indic dialect continuum, showing that state-of-the-art methods in the literature show near-zero performance in these settings, and that our simpler method gives much better results. We repeat our experiments on Marathi and Nepali, two higher-resource Indic languages, to compare approach performances by resource range. We release automatically created bilingual lexicons for the first time for five languages of the Indic dialect continuum.
Nathan Godey, Eric Villemonte de La Clergerie and Benoît Sagot. 2023. Headless Language Models: Learning without Predicting with Contrastive Weight Tying. Preprint.

Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies. In this study, we propose an innovative method that shifts away from probability prediction and instead focuses on reconstructing input embeddings in a contrastive fashion via Constrastive Weight Tying (CWT). We apply this approach to pretrain Headless Language Models in both monolingual and multilingual contexts. Our method offers practical advantages, substantially reducing training computational requirements by up to 20 times, while simultaneously enhancing downstream performance and data efficiency. We observe a significant +1.6 GLUE score increase and a notable +2.7 LAMBADA accuracy improvement compared to classical LMs within similar compute budgets.
Paul-Ambroise Duquenne, Holger Schwenk and Benoît Sagot. 2023. SONAR: Sentence-Level Multimodal and Language-Agnostic Representations. Preprint.

We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space. Our single text encoder, covering 200 languages, substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. Our encoders outperform existing speech encoders on similarity search tasks. We also provide a text decoder for 200 languages, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations. Our text-to-text results are competitive compared to the state-of-the-art NLLB 1B model, despite the fixed-size bottleneck representation. Our zero-shot speech-to-text translation results compare favorably with strong supervised baselines such as Whisper.
Nathan Godey, Eric Villemonte de La Clergerie and Benoît Sagot. 2023. Is Anisotropy Inherent to Transformers? Preprint.

The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which makes them unexpectedly close to each other in terms of angular distance (cosine-similarity). Some recent works tend to show that anisotropy is a consequence of optimizing the cross-entropy loss on long-tailed distributions of tokens. We show in this paper that anisotropy can also be observed empirically in language models with specific objectives that should not suffer directly from the same consequences. We also show that the anisotropy problem extends to Transformers trained on other modalities. Our observations tend to demonstrate that anisotropy might actually be inherent to Transformers-based models.
Alix Chagué and Hippolyte Souvay. 2023. Image Acquisition and Layout Analysis. Preprint.

Presentation of key information and processes to work with images in the context of automatic text recognition pipelines and in particular for the detection of the layout, using the eScriptorium application as example.
Floriane Chiffoleau. 2023. TEI Publisher, a platform for sustainable digital editions. Preprint.

Alix Chagué and Floriane Chiffoleau. 2023. What can you do next? Choice of output and reuse of your transcription. Preprint.

Alix Chagué and Floriane Chiffoleau. 2023. ATR: What can eScriptorium do for you? Preprint.

C. Annemieke Romein, Tobias Hodel, Femke Gordijn, Joris Zundert, Alix Chagué, Milan Van Lange, Helle Strandgaard Jensen, Andy Stauder, Jake Purcell, Melissa Terras, Pauline van Den Heuvel, Carlijn Keijzer, Achim Rabus, Chantal Sitaram, Aakriti Bhatia, Katrien Depuydt, Mary Aderonke Afolabi-Adeolu, Anastasiia Anikina, Elisa Bastianello, Lukas Vincent Benzinger, Arno Bosse, David Brown, Ash Charlton, André Nilsson Dannevig, Klaas Van Gelder, Sabine C.P.J. Go, Marcus J.C. Goh, Silvia Gstrein, Sewa Hasan, Stefan von Der Heide, Maximilian Hindermann, Dorothee Huff, Ineke Huysman, Ali Idris, Liesbeth Keijzer, Simon Kemper, Sanne Koenders, Erika Kuijpers, Lisette Rønsig Larsen, Sven Lepa, Tommy Link, Annelies Van Nispen, Joe Nockels, Laura Noort, Joost Johannes Oosterhuis, Vivien Popken, María Estrella Puertollano, Joosep Puusaag, Ahmed Sheta, Lex Stoop, Ebba Strutzenbladh, Nicoline van Der Sijs, Jan Paul van Der Spek, Barry Benaissa Trouw, Geertrui van Synghel, Vladimir Vučković, Heleen Wilbrink, Sonia Weiss, David Joseph Wrisley and Riet Zweistra. 2023. Exploring Data Provenance in Handwritten Text Recognition Infrastructure: Sharing and Reusing Ground Truth Data, Referencing Models, and Acknowledging Contributions. Starting the Conversation on How We Could Get It Done. Preprint.

This paper discusses best practices for sharing and reusing Ground Truth in Handwritten Text Recognition infrastructures, as well as ways to reference and acknowledge contributions to the creation and enrichment of data within these systems. We discuss how one can place Ground Truth data in a repository and, subsequently, inform others through HTR-United. Furthermore, we want to suggest appropriate citation methods for ATR data, models, and contributions made by volunteers. Moreover, when using digitised sources (digital facsimiles), it becomes increasingly important to distinguish between the physical object and the digital collection. These topics all relate to the proper acknowledgement of labour put into digitising, transcribing, and sharing Ground Truth HTR data. This also points to broader issues surrounding the use of machine learning in archival and library contexts, and how the community should begin to acknowledge and record both contributions and data provenance.
Tu Anh Nguyen, Maureen De Seyssel, Robin Algayres, Patricia Rozé, Ewan Dunbar and Emmanuel Dupoux. 2023. Are word boundaries useful for unsupervised language learning? Preprint.

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina Mcmillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco de Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro von Werra, Leon Weber, Long Phan, Loubna Ben Allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-Shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh Hajihosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Ferrandis, Danish Contractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael Mckenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel de Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-Aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada and Thomas Wolf. 2023. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. Preprint.

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

2022

PhD theses and Habiliations

Benjamin Muller. 2022. How Can We Make Language Models Better at Handling the Diversity and Variability of Natural Languages ? PhD thesis. Sorbonne Université.

Deep Learning for NLP has led to impressive empirical progress in recent years. In essence, this progress is based on better contextualized representations that can be easily used for a wide variety of tasks. However, these models usually require substantial computing power and large amounts of raw textual data. This makes language’s inherent diversity and variability a vivid challenge in NLP. We focus on the following: How can we make language models better at handling the variability and diversity of natural languages?. First, we explore the generalizability of language models by building and analyzing one of the first large-scale replication of a BERT model for a non-English language. Our results raise the question of using these language models on highly-variable domains such as these found online. Focusing on lexical normalization, we show that this task can be approached with BERT-like models. However, we show that it only partially helps downstream performance. In consequence, we focus on adaptation techniques using what we refer to as representation transfer and explore challenging settings such as the zero-shot setting, low-resource languages. We show that multilingual language models can be adapted and used efficiently with low-resource languages, even with the ones unseen during pretraining, and that the script is a critical component in this adaptation.
Clémentine Fourrier. 2022. Neural Approaches to Historical Word Reconstruction. PhD thesis. École Pratique des Hautes Études (PSL).

In historical linguistics, cognates are words that descend in direct line from a common ancestor, called their proto-form, andtherefore are representative of their respective languages evolutions through time, as well as of the relations between theselanguages synchronically. As they reflect the phonetic history of the languages they belong to, they allow linguists to betterdetermine all manners of synchronic and diachronic linguistic relations (etymology, phylogeny, sound correspondences).Cognates of related languages tend to be linked through systematic phonetic correspondence patterns, which neuralnetworks could well learn to model, being especially good at learning latent patterns. In this dissertation, we seek tomethodically study the applicability of machine translation inspired neural networks to historical word prediction, relyingon the surface similarity of both tasks. We first create an artificial dataset inspired by the phonetic and phonotactic rules ofRomance languages, which allow us to vary task complexity and data size in a controlled environment, therefore identifyingif and under which conditions neural networks were applicable. We then extend our work to real datasets (after havingupdated an etymological database to gather a correct amount of data), study the transferability of our conclusions toreal data, then the applicability of a number of data augmentation techniques to the task, to try to mitigate low-resourcesituations. We finally investigat in more detail our best models, multilingual neural networks. We first confirm that, onthe surface, they seem to capture language relatedness information and phonetic similarity, confirming prior work. Wethen discover, by probing them, that the information they store is actually more complex: our multilingual models actuallyencode a phonetic language model, and learn enough latent historical information to allow decoders to reconstruct the(unseen) proto-form of the studied languages as well or better than bilingual models trained specifically on the task. Thislatent information is likely the explanation for the success of multilingual methods in the previous works
Pedro Ortiz Suarez. 2022. A Data-driven Approach to Natural Language Processing for Contemporary and Historical French. PhD thesis. Sorbonne Université.

In recent years, neural methods for Natural Language Processing (NLP) have consistently and repeatedly improved the state of the art in a wide variety of NLP tasks. One of the main contributing reasons for this steady improvement is the increased use of transfer learning techniques. These methods consist in taking a pre-trained model and reusing it, with little to no further training, to solve other tasks. Even though these models have clear advantages, their main drawback is the amount of data that is needed to pre-train them. The lack of availability of large-scale data previously hindered the development of such models for contemporary French, and even more so for its historical states.In this thesis, we focus on developing corpora for the pre-training of these transfer learning architectures. This approach proves to be extremely effective, as we are able to establish a new state of the art for a wide range of tasks in NLP for contemporary, medieval and early modern French as well as for six other contemporary languages. Furthermore, we are able to determine, not only that these models are extremely sensitive to pre-training data quality, heterogeneity and balance, but we also show that these three features are better predictors of the pre-trained models' performance in downstream tasks than the pre-training data size itself. In fact, we determine that the importance of the pre-training dataset size was largely overestimated, as we are able to repeatedly show that such models can be pre-trained with corpora of a modest size.

Journal articles

Manuela Sanguinetti, Cristina Bosco, Lauren Cassidy, Özlem Çetinoğlu, Alessandra Teresa Cignarella, Teresa Lynn, Ines Rehbein, Josef Ruppenhofer, Djamé Seddah and Amir Zeldes. 2022. Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations. Language Resources and Evaluation 57 pages 493–544. Springer Verlag.

Abstract This article presents a discussion on the main linguistic phenomena which cause difficulties in the analysis of user-generated texts found on the web and in social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework of syntactic analysis. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this article is twofold: (1) to provide a condensed, though comprehensive, overview of such treebanks—based on available literature—along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The overarching goal of this article is to provide a common framework for researchers interested in developing similar resources in UD, thus promoting cross-linguistic consistency, which is a principle that has always been central to the spirit of UD.
Alix Chagué. 2022. eScriptorium~: une application libre pour la transcription automatique des manuscrits. Arabesques page 25. Agence bibliographique de l'enseignement supérieur (ABES).

Alix Chagué and Laurent Romary. 2022. L'intelligence artificielle, une ouverture du champ des possibles. Arabesques pages 4–5. Agence bibliographique de l'enseignement supérieur (ABES).

Robin Algayres, Tristan Ricoul, Julien Karadayi, Hugo Laurençon, Salah Zaiem, Abdelrahman Mohamed, Benoît Sagot and Emmanuel Dupoux. 2022. DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon. Transactions of the Association for Computational Linguistics 10 pages 1051–1065. The MIT Press.

Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-theart in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark. 1
Tu Anh Nguyen, Benoit Sagot and Emmanuel Dupoux. 2022. Are Discrete Units Necessary for Spoken Language Modeling? IEEE Journal of Selected Topics in Signal Processing 16 pages 1415–1423.

Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels. The approach relies first on transforming the audio into a sequence of discrete units (or pseudo-text) and then training a language model directly on such pseudo-text. Is such a discrete bottleneck necessary, potentially introducing irreversible errors in the encoding of the speech signal, or could we learn a language model without discrete units at all? In this work, we study the role of discrete versus continuous representations in spoken language modeling. We show that discretization is indeed essential for good results in spoken language modeling. We show that discretization removes linguistically irrelevant information from the continuous features, helping to improve language modeling performances. On the basis of this study, we train a language model on the discrete units of the HuBERT features, reaching new state-of-the-art results in the lexical, syntactic and semantic metrics of the Zero Resource Speech Challenge 2021 (Track 1-Speech Only).
Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Barone, Jindřich Helcl and Alexandra Birch. 2022. Survey of Low-Resource Machine Translation. Computational Linguistics 48 pages 673–732. The MIT Press.

We present a survey covering the state of the art in low-resource machine translation (MT) research. There are currently around 7,000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available. We present a summary of this topical research field and provide a description of the techniques evaluated by researchers in several recent shared tasks in low-resource MT.
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Rubungo Andre Niyongabo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Balli, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal and Mofetoluwa Adeyemi. 2022. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics 10 pages 50–72. The MIT Press.

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.
Jack Bowers, Axel Herold, Laurent Romary and Toma Tasovac. 2022. TEI Lex-0 Etym–towards terse recommendations for the encoding of etymological information. Journal of the Text Encoding Initiative TEI Consortium.

The present paper describes the etymological component of the TEI Lex-0 initiative which aims at defining a terser subset of the TEI guidelines for the representation of etymological features in dictionary entries. Going beyond the basic provision of etymological mechanisms in the TEI guidelines, TEI Lex-0 Etym proposes a systematic representation of etymological and cognate descriptions by means of embedded constructs based on the <etym> (for etymologies) and <cit> (for etymons and cognates) elements. In particular, given that all the potential contents of etymons are highly analogous to those of dictionary entries in general, the contents presented herein heavily re-use many of the corresponding features and constraints introduced in other components of the TEI Lex-0 to the encoding of etymologies and etymons. The TEI Lex-0 Etym model is also closely aligned to ISO 24613-3 on modelling etymological data and the corresponding TEI serialisation available in ISO 24613-4.

Conference proceedings

Anna Chepaikina, Robert Bossy, Catherine Roussey and Stephan Bernard. 2022. Thesaurus Enrichment via Coordination Extraction. In 16th International Conference on Metadata and Semantics Research (MTSR 2022). 1789 pages 191–202. London, United Kingdom.

We advance a method of thesaurus enrichment, based on the extraction of coordinations in a domain-related corpus. Our hypothesis is that there is a semantic homogeneity between the conjuncts located in a coordination. We conducted an experiment that allowed us to evaluate the effectiveness of our method. This experiment aims to enrich the concept hierarchy of a French agricultural thesaurus named French Crop Usage (FCU), thanks to the texts of the Plant Health Bulletins (PHB). The FCU thesaurus is published on the Web using the SKOS model.
Tom Kocmi, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Novák, Martin Popel, Maja Popović and Mariya Shmatova. 2022. Findings of the 2022 Conference on Machine Translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT). pages 1–45. Abu Dhabi, United Arab Emirates.

This paper presents the results of the General Machine Translation Task organised as part of the Conference on Machine Translation (WMT) 2022. In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of four different domains. We evaluate system outputs with human annotators using two different techniques: reference-based direct assessment and (DA) and a combination of DA and scalar quality metric (DA+SQM).
Mariana Neves, Antonio Jimeno Yepes, Amy Siu, Roland Roller, Philippe Thomas, Maika Vicente Navarro, Lana Yeganova, Dina Wiemann, Giorgio Maria Di Nunzio, Federica Vezzani, Christel Gérardin, Rachel Bawden, Darryl Johan Estrada, Salvador Lima-López, Eulàlia Farré-Maduell, Martin Krallinger, Cristian Grozea and Aurélie Névéol. 2022. Findings of the WMT 2022 Biomedical Translation Shared Task: Monolingual Clinical Case Reports. In Proceedings of the Seventh Conference on Machine Translation (WMT). pages 694–723. Association for Computational Linguistics. Abu Dhabi, United Arab Emirates.

In the seventh edition of the WMT Biomedical Task, we addressed a total of seven language pairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian. This year’s test sets covered three types of biomedical text genre. In addition to scientific abstracts and terminology items used in previ- ous editions, we released test sets of clinical cases. The evaluation of clinical cases translations were given special attention by involving clinicians in the preparation of reference translations and manual evaluation. For the main MEDLINE test sets, we received a total of 609 submissions from 37 teams. For the ClinSpEn sub-task, we had the participation of five teams.
Omer Goldman, Francesco Tinner, Hila Gonen, Benjamin Muller, Victoria Basmov, Shadrack Kirimi, Lydia Nishimwe, Benoît Sagot, Djamé Seddah, Reut Tsarfaty and Duygu Ataman. 2022. The MRL 2022 Shared Task on Multilingual Clause-level Morphology. In 1st Shared Task on Multilingual Clause-level Morphology. Abu Dhabi, United Arab Emirates.

The 2022 Multilingual Representation Learning (MRL) Shared Task was dedicated to clause-level morphology. As the first ever benchmark that defines and evaluates morphology outside its traditional lexical boundaries, the shared task on multilingual clause-level morphology sets the scene for competition across different approaches to morphological modeling, with 3 clause-level sub-tasks: morphological inflection, reinflection and analysis, where systems are required to generate, manipulate or analyze simple sentences centered around a single content lexeme and a set of morphological features characterizing its syntactic clause. This year's tasks covered eight typologically distinct languages: English, French, German, Hebrew, Russian, Spanish, Swahili and Turkish. The tasks has received submissions of four systems from three teams which were compared to two baselines implementing prominent multilingual learning methods. The results show that modern NLP models are effective in solving morphological tasks even at the clause level. However, there is still room for improvement, especially in the task of morphological analysis.
Nathan Godey, Roman Castagné, Éric de la Clergerie and Benoît Sagot. 2022. MANTa: Efficient Gradient-Based Tokenization for End-to-End Robust Language Modeling. In Findings of the Association for Computational Linguistics: EMNLP 2022. pages 2859–2870. Association for Computational Linguistics. Abu Dhabi, United Arab Emirates.

Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this work, we propose MANTa, a Module for Adaptive Neural TokenizAtion. MANTa is a differentiable tokenizer trained end-to-end with the language model. The resulting system offers a trade-off between the expressiveness of byte-level models and the speed of models trained using subword tokenization. In addition, our tokenizer is highly explainable since it produces an explicit segmentation of sequences into blocks. We evaluate our pretrained model on several English datasets from different domains as well as on synthetic noise. We find that MANTa improves robustness to character perturbations and out-of-domain data. We then show that MANTa performs comparably to other models on the general-domain GLUE benchmark. Finally, we show that it is considerably faster than strictly byte-level models.
Syrielle Montariol, Arij Riabi and Djamé Seddah. 2022. Multilingual Auxiliary Tasks Training: Bridging the Gap between Languages for Zero-Shot Transfer of Hate Speech Detection Models. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022. pages 347–363. Association for Computational Linguistics. Online.

Zero-shot cross-lingual transfer learning has been shown to be highly challenging for tasks involving a lot of linguistic specificities or when a cultural gap is present between languages, such as in hate speech detection. In this paper, we highlight this limitation for hate speech detection in several domains and languages using strict experimental settings. Then, we propose to train on multilingual auxiliary tasks -- sentiment analysis, named entity recognition, and tasks relying on syntactic information -- to improve zero-shot transfer of hate speech detection models across languages. We show how hate speech detection models benefit from a cross-lingual knowledge proxy brought by auxiliary tasks fine-tuning and highlight these tasks' positive impact on bridging the hate speech linguistic and cultural gap between languages.
Syrielle Montariol, Étienne Simon, Arij Riabi and Djamé Seddah. 2022. Fine-tuning and Sampling Strategies for Multimodal Role Labeling of Entities under Class Imbalance. In Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations. pages 55–65. Association for Computational Linguistics. Dublin, Ireland.

We propose our solution to the multimodal semantic role labeling task from the CON-STRAINT’22 workshop. The task aims at clas-sifying entities in memes into classes such as “hero” and “villain”. We use several pre-trained multi-modal models to jointly encode the text and image of the memes, and implement three systems to classify the role of the entities. We propose dynamic sampling strategies to tackle the issue of class imbalance. Finally, we per-form qualitative analysis on the representations of the entities.
Jesujoba Alabi, Lydia Nishimwe, Benjamin Muller, Camille Rey, Benoît Sagot and Rachel Bawden. 2022. Inria-ALMAnaCH at WMT 2022: Does Transcription Help Cross-Script Machine Translation? In Proceedings of the Seventh Conference on Machine Translation (WMT). pages 233–243. Association for Computational Linguistics. Abu Dhabi, United Arab Emirates (Hybrid).

This paper describes the Inria ALMAnaCH team submission to the WMT 2022 general translation shared task. Participating in the language directions {cs,ru,uk}→en and cs↔uk, we experiment with the use of a dedicated Latin-script transcription convention aimed at representing all Slavic languages involved in a way that maximises character-and word-level correspondences between them as well as with the English language. Our hypothesis was that bringing the source and target language closer could have a positive impact on machine translation results. We provide multiple comparisons, including bilingual and multilingual baselines, with and without transcription. Initial results indicate that the transcription strategy was not successful, resulting in lower results than baselines. We nevertheless submitted our multilingual, transcribed models as our primary systems, and in this paper provide some indications as to why we got these negative results.
Paul-Ambroise Duquenne, Hongyu Gong, Benoît Sagot and Holger Schwenk. 2022. T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. pages 5794–5806. Association for Computational Linguistics. Abu Dhabi, United Arab Emirates.

We present a new approach to perform zeroshot cross-modal transfer between speech and text for translation tasks. Multilingual speech and text are encoded in a joint fixed-size representation space. Then, we compare different approaches to decode these multimodal and multilingual fixed-size representations, enabling zero-shot translation between languages and modalities. All our models are trained without the need of cross-modal labeled translation data. Despite a fixed-size representation, we achieve very competitive results on several text and speech translation tasks. In particular, we outperform the state of the art for zero-shot speech translation on Must-C. We also introduce the first results for zero-shot direct speechto-speech and text-to-speech translation.
Louis Martin, Angela Fan, Éric Villemonte de la Clergerie, Antoine Bordes and Benoît Sagot. 2022. MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. pages 1651–1664. European Language Resources Association. Marseille, France.

Progress in sentence simplification has been hindered by a lack of labeled parallel simplification data, particularly in languages other than English. We introduce MUSS, a Multilingual Unsupervised Sentence Simplification system that does not require labeled simplification data. MUSS uses a novel approach to sentence simplification that trains strong models using sentencelevel paraphrase data instead of proper simplification data. These models leverage unsupervised pretraining and controllable generation mechanisms to flexibly adjust attributes such as length and lexical complexity at inference time. We show that this paraphrase data can be mined in any language from Common Crawl using semantic sentence embeddings, thus removing the need for labeled data. We evaluate our approach on English, French, and Spanish simplification benchmarks and closely match or outperform the previous best supervised results, despite not using any labeled simplification data. We push the state of the art further by incorporating labeled simplification data.
Robin Algayres, Adel Nabli, Benoît Sagot and Emmanuel Dupoux. 2022. Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning. In Proceedings of the 23rd Annual Conference of the International Speech Communication Association. pages 2123–2127. Incheon, South Korea.

We introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search. We show that when built on top of recent self-supervised audio representations [1, 2, 3], this method can be applied iteratively and yield competitive SSE as evaluated on two tasks: query-by-example of random sequences of speech, and spoken term discovery. On both tasks our method pushes the state-of-the-art by a significant margin across 5 different languages. Finally, we establish a benchmark on a query-byexample task on the LibriSpeech dataset to monitor future improvements in the field.
Ghazi Felhi, Joseph Le Roux and Djamé Seddah. 2022. Exploiting Inductive Bias in Transformers for Unsupervised Disentanglement of Syntax and Semantics with VAEs. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 5763–5776. Association for Computational Linguistics. Seattle, United States.

We propose a generative model for text generation, which exhibits disentangled latent representations of syntax and semantics. Contrary to previous work, this model does not need syntactic information such as constituency parses, or semantic information such as paraphrase pairs. Our model relies solely on the inductive bias found in attention-based architectures such as Transformers. In the attention of Transformers, keys handle information selection while values specify what information is conveyed. Our model, dubbed QKVAE, uses Attention in its decoder to read latent variables where one latent variable infers keys while another infers values. We run experiments on latent representations and experiments on syntax/semantics transfer which show that QKVAE displays clear signs of disentangled syntax and semantics. We also show that our model displays competitive syntax transfer capabilities when compared to supervised models and that comparable supervised models need a fairly large amount of data (more than 50K samples) to outperform it on both syntactic and semantic transfer. The code for our experiments is publicly available.
Loïc Grobol, Mathilde Regnault, Pedro Ortiz Suarez, Benoît Sagot, Laurent Romary and Benoit Crabbé. 2022. BERTrade: Using Contextual Embeddings to Parse Old French. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. pages 1104–1113. European Language Resources Association. Marseille, France.

The successes of contextual word embeddings learned by training large-scale language models, while remarkable, have mostly occurred for languages where significant amounts of raw texts are available and where annotated data in downstream tasks have a relatively regular spelling. Conversely, it is not yet completely clear if these models are also well suited for lesser-resourced and more irregular languages. We study the case of Old French, which is in the interesting position of having relatively limited amount of available raw text, but enough annotated resources to assess the relevance of contextual word embedding models for downstream NLP tasks. In particular, we use POS-tagging and dependency parsing to evaluate the quality of such models in a large array of configurations, including models trained from scratch from small amounts of raw text and models pre-trained on other languages but fine-tuned on Medieval French data.
Simon Gabay, Pedro Ortiz Suarez, Rachel Bawden, Alexandre Bartz, Philippe Gambette and Benoît Sagot. 2022. Le projet FREEM : ressources, outils et enjeux pour l'étude du français d'Ancien Régime (The FREEM project: Resources, tools and challenges for the study of Ancien Régime French). In Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale. pages 154–165. ATALA. Avignon, France.

Despite their undoubted quality, the resources and tools available for the analysis of Ancien Régime French are no longer able to meet the challenges of research in linguistics and literature for this period. After having precisely defined the chronological framework, we present the corpora made available and the results obtained with them for several NLP tasks, fundamental to the study of language and literature.
Arij Riabi, Syrielle Montariol and Djamé Seddah. 2022. Tâches Auxiliaires Multilingues pour le Transfert de Modèles de Détection de Discours Haineux (Multilingual Auxiliary Tasks for Zero-Shot Cross-Lingual Transfer of Hate Speech Detection). In Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale. pages 413–423. ATALA. Avignon, France.

La tâche de détection de contenus haineux est ardue, car elle nécessite des connaissances culturelles et contextuelles approfondies ; les connaissances nécessaires varient, entre autres, selon la langue du locateur ou la cible du contenu. Or, des données annotées pour des domaines et des langues spécifiques sont souvent absentes ou limitées. C’est là que les données dans d’autres langues peuvent être exploitées ; mais du fait de ces variations, le transfert cross-lingue est souvent difficile. Dans cet article, nous mettons en évidence cette limitation pour plusieurs domaines et langues et montrons l’impact positif de l’apprentissage de tâches auxiliaires multilingues - analyse de sentiments, reconnaissance des entités nommées et tâches reposant sur des informations morpho-syntaxiques - sur le transfert cross-lingue zéro-shot des modèles de détection de discours haineux, afin de combler ce fossé culturel.
Benjamin Muller, Antonios Anastasopoulos, Benoît Sagot and Djamé Seddah. 2022. Quand être absent de mBERT n'est que le commencement : Gérer de nouvelles langues à l'aide de modèles de langues multilingues (When Being Unseen from mBERT is just the Beginning : Handling New Languages With Multilingual Language Models). In Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale. pages 450–451. ATALA. Avignon, France.

Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. We show that transliterating those languages significantly improves the potential of large-scale multilingual language models on downstream tasks. This result provides a promising direction towards making these massively multilingual models useful for a new set of unseen languages.
Simon Gabay, Rachel Bawden, Philippe Gambette, Jonathan Poinhos, Eleni Kogkitsidou and Benoît Sagot. 2022. Le changement linguistique au XVIIe s. : nouvelles approches scriptométriques. In Actes du 8e Congrès Mondial de Linguistique Française. 138 pages 02006.1–14. EDP Sciences. Orléans, France.

Linguistic change in 17th c. France: new scriptometric approaches The end of the 17th c. remains a blind spot of the research on the spelling system, despite its importance for French at this period, during which a strict norm, still (more or less) in place, was created and imposed. Focusing on a practical rather than a theoretical approach, we propose to lay the foundation for a computational scriptometric study of early modern French and analyse the evolution of the spelling system over the 17th c. To do so, we measure and evaluate the distance between the early modern and the contemporary versions of the language, thanks to two automatic normalisers: one rule-based and another one neural-based.
Thibault Charmet, Inès Cherichi, Matthieu Allain, Urszula Czerwinska, Amaury Fouret, Benoît Sagot and Rachel Bawden. 2022. Complex Labelling and Similarity Prediction in Legal Texts: Automatic Analysis of France's Court of Cassation Rulings. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. pages 4754–4766. European Language Resources Association. Marseille, France.

Detecting divergences in the applications of the law (where the same legal text is applied differently by two rulings) is an important task. It is the mission of the French Cour de Cassation. The first step in the detection of divergences is to detect similar cases, which is currently done manually by experts. They rely on summarised versions of the rulings (syntheses and keyword sequences), which are currently produced manually and are not available for all rulings. There is also a high degree of variability in the keyword choices and the level of granularity used. In this article, we therefore aim to provide automatic tools to facilitate the search for similar rulings. We do this by (i) providing automatic keyword sequence generation models, which can be used to improve the coverage of the analysis, and (ii) providing measures of similarity based on the available texts and augmented with predicted keyword sequences. Our experiments show that the predictions improve correlations of automatically obtained similarities against our specially colelcted human judgments of similarity.
Francesco De Toni, Christopher Akiki, Javier De La Rosa, Clémentine Fourrier, Enrique Manjavacas, Stefan Schweter and Daniel Van Strien. 2022. Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0. In Proceedings of BigScience Episode #5–Workshop on Challenges & Perspectives in Creating Large Language Models. pages 75–83. Association for Computational Linguistics. virtual+Dublin.

In this work, we explore whether the recently demonstrated zero-shot abilities of the T0 model extend to Named Entity Recognition for out-of-distribution languages and time periods. Using a historical newspaper corpus in 3 languages as test-bed, we use prompts to extract possible named entities. Our results show that a naive approach for prompt-based zero-shot multilingual Named Entity Recognition is error-prone, but highlights the potential of such an approach for historical languages lacking labeled datasets. Moreover, we also find that T0-like models can be probed to predict the publication date and language of a document, which could be very relevant for the study of historical texts.
Clémentine Fourrier and Syrielle Montariol. 2022. Caveats of Measuring Semantic Change of Cognates and Borrowings using Multilingual Word Embeddings. In Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change. pages 97–112. Association for Computational Linguistics. Dublin, Ireland.

Cognates and borrowings carry different aspects of etymological evolution. In this work, we study semantic change of such items using multilingual word embeddings, both static and contextualised. We underline caveats identified while building and evaluating these embeddings. We release both said embeddings and a newly-built historical words lexicon, containing typed relations between words of varied Romance languages.
Clémentine Fourrier and Benoît Sagot. 2022. Probing Multilingual Cognate Prediction Models. In Findings of the Association for Computational Linguistics: ACL 2022. pages 3786–3801. Association for Computational Linguistics. Dublin, Ireland.

Character-based neural machine translation models have become the reference models for cognate prediction, a historical linguistics task. So far, all linguistic interpretations about latent information captured by such models have been based on external analysis (accuracy, raw results, errors). In this paper, we investigate what probing can tell us about both models and previous interpretations, and learn that though our models store linguistic and diachronic information, they do not achieve it in previously assumed ways.
Simon Gabay, Pedro Ortiz Suarez, Alexandre Bartz, Alix Chagué, Rachel Bawden, Philippe Gambette and Benoît Sagot. 2022. From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. pages 3367–3374. European Language Resources Association. Marseille, France.

Language models for historical states of language are becoming increasingly important to allow the optimal digitisation and analysis of old textual sources. Because these historical states are at the same time more complex to process and more scarce in the corpora available, specific efforts are necessary to train natural language processing (NLP) tools adapted to the data. In this paper, we present our efforts to develop NLP tools for Early Modern French (historical French from the 16th to the 18th centuries). We present the FreEMmax corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on FreEMmax. We evaluate the usefulness of D'AlemBERT by fine-tuning it on a part-of-speech tagging task, outperforming previous work on the test set. Importantly, we find evidence for the transfer learning capacity of the language model, since its performance on lesser-resourced time periods appears to have been boosted by the more resourced ones. We release D'AlemBERT and the open-sourced subpart of the FreEMmax corpus.
Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou, Philippe Gambette, Benoît Sagot and Simon Gabay. 2022. Automatic Normalisation of Early Modern French. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. pages 3354–3366. European Language Resources Association. Marseille, France.

Spelling normalisation is a useful step in the study and analysis of historical language texts, whether it is manual analysis by experts or automatic analysis using downstream natural language processing (NLP) tools. Not only does it help to homogenise the variable spelling that often exists in historical texts, but it also facilitates the use of off-the-shelf contemporary NLP tools, if contemporary spelling conventions are used for normalisation. We present FREEMnorm, a new benchmark for the normalisation of Early Modern French (from the 17th century) into contemporary French and provide a thorough comparison of three different normalisation methods: ABA, an alignment-based approach and MT-approaches, (both statistical and neural), including extensive parameter searching, which is often missing in the normalisation literature.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng-Xin Yong, Harshit Pandey, Michael Mckenna, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf and Alexander M. Rush. 2022. Multitask Prompted Training Enables Zero-Shot Task Generalization. In Proceedings of the The Tenth International Conference on Learning Representations. Online.

Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models’ pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pre-trained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at https://github.com/bigscience-workshop/t-zero, and all prompts are available at https://github.com/bigscience-workshop/promptsource.
Julien Abadji, Pedro Ortiz Suarez, Laurent Romary and Benoît Sagot. 2022. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. pages 4344–4355. European Language Resources Association. Marseille, France.

The need for raw large raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods to Natural Language Processing. And while there have been some recent attempts to manually curate the amount of data necessary to train large language models, the main way to obtain this data is still through automatic web crawling. In this paper we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR that could prove more suitable to pre-train large generative language models as well as hopefully other applications in Natural Language Processing and Digital Humanities.

Communications

Anas Fahad Khan, Ana Salgado, Rute Costa, Sara Carvalho, Laurent Romary, Bruno Almeida, Margarida Ramos, Mohamed Khemakhem, Raquel Silva and Toma Tasovac. 2022. Interlinking lexicographic data in the MORDigital project. In LLODREAM2022 - LLOD approaches for language data research and management. Mykolas Romeris University. Vilnius, Lithuania.

Rute Costa, Ana Salgado, Margarida Ramos, Fahad Khan, Sara Carvalho, Toma Tasovac, Bruno Almeida, Mohamed Khemakhem, Laurent Romary and Raquel Silva. 2022. Integrating Terminological and Ontological Principles into a Lexicographic Resource. In 1st International Conference on «Multilingual digital terminology today. Design, representation formats and management systems»; Vol-3161 CEUR-WS.org. Padova, Italy.

In this paper we will present the research that is taking place at the NOVA CLUNL where an international team is working on a financed project MORDigital. MORDigital's goal is to encode the selected editions of Diccinario de Lingua Portugueza by António de Morais Silva (MOR), first published in 1789.
Yves Rychener, Xavier Renard, Djamé Seddah, Pascal Frossard and Marcin Detyniecki. 2022. On the Granularity of Explanations in Model Agnostic NLP Interpretability. In XKDD 2022 - ECML PKDD 2022 International Workshop on eXplainable Knowledge Discovery in Data Mining. Grenoble, France.

Current methods for Black-Box NLP interpretability, like LIME or SHAP, are based on altering the text to interpret by removing words and modeling the Black-Box response. In this paper, we outline limitations of this approach when using complex BERT-based classifiers: The word-based sampling produces texts that are out-of-distribution for the classifier and further gives rise to a high-dimensional search space, which can't be sufficiently explored when time or computation power is limited. Both of these challenges can be addressed by using segments as elementary building blocks for NLP interpretability. As illustration, we show that the simple choice of sentences greatly improves on both of these challenges. As a consequence, the resulting explainer attains much better fidelity on a benchmark classification task.
Benoît Sagot, Laurent Romary, Rachel Bawden, Pedro Javier Ortiz Suárez, Kelly Christensen, Simon Gabay, Ariane Pinche and Jean-Baptiste Camps. 2022. Gallic(orpor)a : Extraction, annotation et diffusion de l'information textuelle et visuelle en diachronie longue. In DataLab de la BnF : Restitution des travaux 2022. Paris, France.

Restitution des travaux du Projet BNF DataLab Gallic(orpor)a
Aurélia Rostaing and Hugo Scheithauer. 2022. LectAuRep : Un projet de recherche et développement pour la transcription automatique de répertoires de notaires. In La reconnaissance des écritures manuscrites et ses usages dans les archives. Pierrefitte-sur-Seine, France.

Chadi Helwe, Simon Coumes, Chloé Clavel and Fabian Suchanek. 2022. TINA: Textual Inference with Negation Augmentation. In The 2022 Conference on Empirical Methods in Natural Language Processing ( EMNLP 2022 ). Abu Dhabi, United Arab Emirates.

Transformer-based language models achieve state-of-the-art results on several natural language processing tasks. One of these is textual entailment, i.e., the task of determining whether a premise logically entails a hypothesis. However, the models perform poorly on this task when the examples contain negations. In this paper, we propose a new definition of textual entailment that captures also negation. This allows us to develop TINA (Textual Inference with Negation Augmentation), a principled technique for negated data augmentation that can be combined with the unlikelihood loss function. Our experiments with different transformer-based models show that our method can significantly improve the performance of the models on textual entailment datasets with negation-without sacrificing performance on datasets without negation.
Chadi Helwe, Chloé Clavel and Fabian Suchanek. 2022. LogiTorch: A PyTorch-based library for logical reasoning on natural language. In The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Abu Dhabi, United Arab Emirates.

Logical reasoning on natural language is one of the most challenging tasks for deep learning models. There has been an increasing interest in developing new benchmarks to evaluate the reasoning capabilities of language models such as BERT. In parallel, new models based on transformers have emerged to achieve ever better performance on these datasets. However, there is currently no library for logical reasoning that includes such benchmarks and models. This paper introduces LogiTorch, a PyTorch-based library that includes different logical reasoning benchmarks, different models, as well as utility functions such as co-reference resolution. This makes it easy to directly use the preprocessed datasets, to run the models, or to finetune them with different hyperparameters. LogiTorch is open source and can be found on GitHub .
Simon Gabay, Rachel Bawden, Benoît Sagot and Philippe Gambette. 2022. Vers l'étude linguistique sur données artificielles. In Variation(s) en français. Nancy, France.

Depuis désormais des décennies, plusieurs disciplines ont pris l'habitude de travailler sur des données dites « synthétiques » plutôt que « réelles », c’est-à-dire sur des données générées par une simulation computationnelle reflétant le monde réel. Notre présentation se propose d'expérimenter cette méthode en linguistique diachronique par la génération de corpus pseudo-anciens. Nous reviendrons donc sur cette approche, tant du point de vue méthodologique que technique, en prenant comme cas d'étude celui de la variation graphique du français et de son évolution pendant l'Ancien Régime.
Aurélia Rostaing and Hugo Scheithauer. 2022. LectAuRep (2018-2021) :Projet de lecture automatique de répertoires de notaires. In Segmenter et annoter les images : déconstruire pour reconstruire. Paris, France.

You Zuo, Houda Mouzoun, Samir Ghamri Doudane, Kim Gerdes and Benoît Sagot. 2022. Patent Classification using Extreme Multi-label Learning: A Case Study of French Patents. In SIGIR 2022 - PatentSemTech workshop - 3rd Workshop on Patent Text Mining and Semantic Technologies. Madrid, Spain.

Most previous patent classification methods have treated the task as a general text classification task, and others have tried to implement XML (extreme multi-label learning) methods designed to handle vast numbers of classes. However, they focus only on the IPC subclass level, which has fewer than 700 labels and is far from "extreme." This paper presents a French Patents corpus INPI-CLS extracted from the INPI internal database. It contains all parts of patent texts (title, abstract, claims, description) published from 2002 to 2021, with IPC labels at all levels. We test different XML methods and other classification models at the subclass and group levels of the INPI-CLS dataset with about 600 and 7k labels, respectively, demonstrating the XML approach's validity to patent classification.
You Zuo, Yixuan Li, Alma Parias García and Kim Gerdes. 2022. Technological taxonomies for hypernym and hyponym retrieval in patent texts. In ToTh 2022 - Terminology & Ontology: Theories and applications. Chambéry, France.

This paper presents an automatic approach to creating taxonomies of technical terms based on the Cooperative Patent Classification (CPC). The resulting taxonomy contains about 170k nodes in 9 separate technological branches and is freely available. We also show that a Text-to-Text Transfer Transformer (T5) model can be fine-tuned to generate hypernyms and hyponyms with relatively high precision, confirming the manually assessed quality of the resource. The T5 model opens the taxonomy to any new technological terms for which a hypernym can be generated, thus making the resource updateable with new terms, an essential feature for the constantly evolving field of technological terminology.
Laurent Romary and Hugo Scheithauer. 2022. DataCatalogue : enjeux et réalisations. In Un outil numérique pour interroger les catalogues de vente : le projet DataCatalogue. Paris, France.

Aurélia Rostaing and Hugo Scheithauer. 2022. Enrichir le patrimoine écrit archivistique grâce aux technologies numériques : Ingénierie du projet LectAuRep (Lecture automatique de répertoires). In DHNord 2022 - Travailler en Humanités Numériques : collaborations, complémentarités et tensions. Online, France.

Floriane Chiffoleau and Hugo Scheithauer. 2022. From a collection of documents to a published edition : how to use an end-to-end publication pipeline. In TEI 2022 - Text Encoding Initiative 2022 Conference. Newcastle, United Kingdom.

The goal of the workshop is to demonstrate how a corpus could be processed for publication with TEI Publisher. The workshop participants will learn to experiment with a ready-to-use solution that provides an easy and quick publication of a corpus. They will also get tips and shortcuts to help speed up the creation of a digital edition. Moreover, by the end of the session, this workshop will provide the participants with a visualization of their respective corpus, with side by side transformed text and original image; all of which then showing what can be achieved while working with TEI in the context of an end-to-end publication pipeline.
Ariane Pinche, Kelly Christensen and Simon Gabay. 2022. Between automatic and manual encoding. In TEI 2022 conference : Text as data. Newcastle, United Kingdom.

Cultural heritage institutions today aim to digitise their collections of prints andmanuscripts (Bermès 2020) and are generating more and more digital images (Gray2009). To enrich these images, many institutions work with standardised formats such asIIIF, preserving as much of the source’s information as possible. To take full advantage oftextual documents, an image alone is not enough. Thanks to automatic text recognitiontechnology, it is now possible to extract images’ content on a large scale. The TEI seemsto provide the perfect format to capture both an image’s formal and textual data (Janèset al. 2021). However, this poses a problem. To ensure compatibility with a range ofuse cases, TEI XML files must guarantee IIIF or RDF exports and therefore must bebased on strict data structures that can be automated. But a rigid structure contradictsthe basic principles of philology, which require maximum flexibility to cope with varioussituations. The solution proposed by the Gallic(orpor)a project1 attempted to deal with such acontradiction, focusing on French historical documents produced between the 15th andthe 18th c. It aims to enrich the digital facsimiles distributed by the French NationalLibrary (BnF).
Alix Chagué, Hugo Scheithauer, Lucas Terriel, Floriane Chiffoleau and Yves Tadjo-Takianpi. 2022. Take a sip of TEI and relax: a proposition for an end-to-end workflow to enrich and publish data created with automatic text recognition. In Digital Humanities 2022 : Responding to Asian Diversity. Tokyo, Japan.

Alix Chagué and Thibault Clérice. 2022. Sharing HTR datasets with standardized metadata: the HTR-United initiative. In Documents anciens et reconnaissance automatique des écritures manuscrites. Paris, France.

Hugo Scheithauer. 2022. LectAuRep : Données d'archives en français des XIXe et XXe siècles. In Transkribus / eScriptorium : Transcrire, annoter et éditer numériquement des documents d'archives. Paris, France.

Alix Chagué. 2022. Corpus, méthodes et ressources pour la transcription automatique des documents manuscrits patrimoniaux francophones contemporains. In 89e Congrès de l'Acfas, Section 310 - Le numérique dans les sciences humaines : édition et visualisation. Montréal, Canada.

Résumé en 5 minutes du projet de recherche doctorale intitulé "Corpus, méthodes et ressources pour la transcription automatique des documents manuscrits patrimoniaux francophones contemporains" débuté en novembre 2021 et récompensé par le Bourse d'Excellence 2022 du GREN. La communication replaçait le projet dans le contexte de la disponibilité actuelle des logiciels grand public pour l'application de la transcription automatique de documents manuscrits et le manque de ressources conceptuelles et méthodologiques permettant d'en tirer pleinement parti. L'une des principales difficultés évoquées était celle de la convergence des pratiques vers les modèles et des données interopérables.
Florence Clavaud, Laurent Romary, Pauline Charbonnier, Lucas Terriel, Gaetano Piraino and Vincent Verdese. 2022. NER4Archives (named entity recognition for archives) : Conception et réalisation d'un outil de détection, de classification et de résolution des entités nommées dans les instruments de recherche archivistiques encodés en XML/EAD. In Atelier Culture-INRIA. Pierrefitte sur Seine, France.

Hugo Scheithauer, Laurent Romary, Frédérique Duyrat and Federico Nurra. 2022. DataCatalogue : présentation du projet. In Atelier Culture-Inria. Pierrefitte-sur-Seine, France.

Presentation on the DataCatalogue project, jointly led by Inria, the National Library of France (BnF) and the National Institute for Art History (INHA), at the "journée Atelier culture-Inria," held at the Archives nationales on 03/22/2022.
Ghazi Felhi, Joseph Le Roux and Djamé Seddah. 2022. Towards Unsupervised Content Disentanglement in Sentence Representations via Syntactic Roles. In CtrlGen: Controllable Generative Modeling in Language and Vision. virtual, France.

Linking neural representations to linguistic factors is crucial in order to build and analyze NLP models interpretable by humans. Among these factors, syntactic roles (e.g. subjects, direct objects,.. .) and their realizations are essential markers since they can be understood as a decomposition of predicative structures and thus the meaning of sentences. Starting from a deep probabilistic generative model with attention, we measure the interaction between latent variables and realizations of syntactic roles, and show that it is possible to obtain, without supervision, representations of sentences where different syntactic roles correspond to clearly identified different latent variables. The probabilistic model we propose is an Attention-Driven Variational Autoencoder (ADVAE). Drawing inspiration from Transformer-based machine translation models, ADVAEs enable the analysis of the interactions between latent variables and input tokens through attention. We also develop an evaluation protocol to measure disentanglement with regard to the realizations of syntactic roles. This protocol is based on attention maxima for the encoder and on disturbing individual latent variables for the decoder. Our experiments on raw English text from the SNLI dataset show that i) disentanglement of syntactic roles can be induced without supervision, ii) ADVAE separates more syntactic roles than classical sequence VAEs, iii) realizations of syntactic roles can be separately modified in sentences by mere intervention on the associated latent variables. Our work constitutes a first step towards unsupervised controllable content generation. The code for our work is publicly available 1 .

Book chapters

Alix Chagué, Victoria Le Fourner, Manuela Martini and Eric Villemonte de La Clergerie. 2022. Deux siècles de sources disparates sur l'industrie textile en France : comment automatiser les traitements d'un corpus non-uniforme ? In La fabrique numérique des corpus en sciences humaines et sociales. Presses Universitaires du Septentrion.

Victoria Le Fourner, Alix Chagué, Manuela Martini and Anaïs Albert. 2022. Structurer automatiquement un corpus homogène issu de la reconnaissance d'écriture manuscrite : les jugements du Conseil des prud'hommes des tissus parisiens. In La fabrique numérique des corpus en sciences humaines et sociales. page https://www.septentrion.com/livre/?GCOI=27574100990460. Presses Universitaires du Septentrion.

Jack Bowers. 2022. Pathways and patterns of metaphor and metonymy in Mixtepec-Mixtec body-part terms. In The Grammar of Body-Part Expressions: A view from the Americas. pages 91–135. Roberto Zariquiey.

Tech reports

Benoît Sagot, Laurent Romary, Rachel Bawden, Pedro Ortiz Suarez, Kelly Christensen, Simon Gabay, Ariane Pinche and Jean-Baptiste Camps. 2022. Gallic(orpor)a: Extraction, annotation et diffusion de l'information textuelle et visuelle en diachronie longue. Technical report.

Restitution des travaux du Projet BNF DataLab Gallic(orpor)a

Other

Alix Chagué, Pérez Gilles and Pascal Dubourg Glatigny. 2022. Peraire Ground Truth.

First release of the dataset.
Alix Chagué. 2022. Intelligence Artificielle et intelligence collective : des nouveaux eldorados pour rendre les textes patrimoniaux plus accessibles ?

Alix Chagué. 2022. Conditions de la mutualisation : les principes FAIR et HTR-United.

Preprints

Alix Chagué, Thibault Clérice and Laurent Romary. 2022. HTR-United : un écosystème pour une approche mutualisée de la transcription automatique des écritures manuscrites. Preprint.

Handwritten Text Recognition (HTR) is a computer process that aims to obtain digital text equivalent to the content of the image of a physical handwritten document. Based on Github, HTR-United invites the community of users to decompartmentalize data sourced from different HTR platforms in order to reduce the costs of producing such data. This solution proposes an operational model that could offer a framework for the construction of data papers for HTR, and even the beginnings of a standardization for this type of publication.
Hugo Scheithauer, Alix Chagué and Laurent Romary. 2022. Which TEI representation for the output of automatic transcriptions and their metadata? An illustrated proposition. Preprint.

The recent and fast development of automatic transcription software is accompanied by a growing heterogeneity of formats to save the output of such a task. TEI P5 can be helpful to simplify workflows and bring in more coherence in digitization pipelines. We present a twofold modelization in TEI which brings together essential information resulting from the transcription phase with the editorial layers. The usefulness of this modelization is illustrated with several examples showing how such an approach can be leveraged at different stages of a digitization pipeline.
Yu Lu Liu, Rachel Bawden, Thomas Scialom, Benoît Sagot and Jackie Chi Kit Cheung. 2022. MaskEval: Weighted MLM-Based Evaluation for Text Summarization and Simplification. Preprint.

In text summarization and simplification, system outputs must be evaluated along multiple dimensions such as relevance, factual consistency, fluency, and grammaticality, and a wide range of possible outputs could be of high quality. These properties make the development of an adaptable, reference-less evaluation metric both necessary and challenging. We introduce MaskEval, a reference-less metric for text summarization and simplification that operates by performing masked language modeling (MLM) on the concatenation of the candidate and the source texts. It features an attention-like weighting mechanism to modulate the relative importance of each MLM step, which crucially allows it to be adapted to evaluate different quality dimensions. We demonstrate its effectiveness on English summarization and simplification in terms of correlations with human judgments, and explore transfer scenarios between the two tasks.
Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoît Sagot, Abdelrahman Mohamed and Emmanuel Dupoux. 2022. Generative Spoken Dialogue Language Modeling: preprint version. Preprint.

We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. It is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces naturalistic turn taking. Generation samples can be found at: https://speechbot.github.io/dgslm.
Floriane Chiffoleau and Anne Baillot. 2022. Le projet DAHN : une pipeline pour l'édition numérique de documents d'archives. Preprint.

Angelina Mcmillan-Major, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Francesco de Toni, Gérard Dupont, Hady Elsahar, Chris Emezue, Alham Fikri Aji, Suzana Ilić, Nurulaqilla Khamis, Colin Leong, Maraim Masoud, Aitor Soroa, Pedro Ortiz Suarez, Zeerak Talat, Daniel van Strien and Yacine Jernite. 2022. Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources. Preprint.

In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficient documentation and tools for analysis. Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons. We present our development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types; and our lessons learned in this endeavor.
Sabrina J. Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y. Lee, Benoît Sagot and Samson Tan. 2022. Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP. Preprint.

What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete and atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in many areas, enabling small vocabularies while still allowing for fast inference. Is the end of the road character-level model or byte-level processing? In this survey, we connect several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications and that thinking seriously about tokenization remains important for many applications.

2021

PhD theses and Habiliations

Louis Martin. 2021. Automatic sentence simplification using controllable and unsupervised methods. PhD thesis. Sorbonne Université.

In this thesis we study the task of automatic sentence simplification. We first study the different methods used to evaluate simplification models, highlight several shortcomings of current approaches, and propose new contributions. We then propose to train sentence simplification models that can be adapted to the target user, allowing for greater simplification flexibility. Finally, we extend the scope of sentence simplification to several languages, by proposing methods that do not require annotated training data, but that nevertheless achieve very strong performance.

Journal articles

Frank Uiterwaal, Franco Niccolucci, Sheena Bassett, Steven Krauwer, Hella Hollander, Femmy Admiraal, Laurent Romary, George Bruseker, Carlo Meghini, Jennifer Edmond and Mark Hedges. 2021. From disparate disciplines to unity in diversity How the PARTHENOS project has brought European humanities Research Infrastructures together. International Journal of Humanities and Arts Computing 15 pages 101–116. Edinburgh University Press.

Since the first ESFRI roadmap in 2006, multiple humanities Research Infrastructures (RIs) have been set up all over the European continent, supporting archaeologists (ARIADNE), linguists (CLARIN-ERIC), Holocaust researchers (EHRI), cultural heritage specialists (IPERION-CH) and others. These examples only scratch the surface of the breadth of research communities that have benefited from close cooperation in the European Research Area.While each field developed discipline-specific services over the years, common themes can also be distinguished. All humanities RIs address, in varying degrees, questions around research data management, the use of standards and the desired interoperability of data across disciplinary boundaries.This article sheds light on how cluster project PARTHENOS developed pooled services and shared solutions for its audience of humanities researchers, RI managers and policymakers. In a time where the convergence of existing infrastructure is becoming ever more important – with the construction of a European Open Science Cloud as an audacious, ultimate goal – we hope that our experiences inform future work and provide inspiration on how to exploit synergies in interdisciplinary, transnational, scientific cooperation.
Rachel Bawden. 2021. [Book Review] Understanding Dialogue: Language Use and Social Interaction. Computational Linguistics Massachusetts Institute of Technology Press (MIT Press).

Luca Foppiano, Sae Dieb, Akira Suzuki, Pedro Baptista de Castro, Suguru Iwasaki, Azusa Uzuki, Miren Garbine Esparza Echevarria, Yan Meng, Kensei Terashima, Laurent Romary, Yoshihiko Takano and Masashi Ishii. 2021. SuperMat: Construction of a linked annotated dataset from superconductors-related publications. Science and Technology of Advanced Materials: Methods 1 Taylor & Francis.

A growing number of papers are published in the area of superconducting materials science. However, novel text and data mining (TDM) processes are still needed to efficiently access and exploit this accumulated knowledge, paving the way towards data-driven materials design. Herein, we present SuperMat (Superconductor Materials), an annotated corpus of linked data derived from scientific publications on superconductors, which comprises 142 articles, 16052 entities, and 1398 links that are characterised into six categories: the names, classes, and properties of materials; links to their respective superconducting critical temperature (Tc); and parametric conditions such as applied pressure or measurement methods. The construction of SuperMat resulted from a fruitful collaboration between computer scientists and material scientists, and its high quality is ensured through validation by domain experts. The quality of the annotation guidelines was ensured by satisfactory Inter Annotator Agreement (IAA) between the annotators and the domain experts. SuperMat includes the dataset, annotation guidelines, and annotation support tools that use automatic suggestions to help minimise human errors.
Naomi Truan and Laurent Romary. 2021. Building, Encoding, and Annotating a Corpus of Parliamentary Debates in XML-TEI: A Cross-Linguistic Account. Journal of the Text Encoding Initiative TEI Consortium.

This data paper introduces an integrative and comprehensive method for the linguistic annotation of parliamentary discourse. Initially conceived as a documentation for a specific and rather small-scale research project, the annotation scheme takes into account national specificities and is geared to proposing an annotation scheme that is both highly standardised and adaptable to other research contexts. The paper reads as a specific application of the Text Encoding Initiative (TEI) framework applied to a subset of parliamentary debates. This strategy has two main applications: first, to develop a model for the encoding of parliamentary corpora by providing a systematic way of annotating both elements within the text (e.g. turns, incidents, interruptions) and the metadata associated with it (e.g. variables pertaining to the speaker or the speech event); second, to provide a cross-linguistic empirical basis for further annotation projects.

Conference proceedings

Hugh Cayless, Thibault Clérice and Jonathan Robie. 2021. Introducing Citation Structures. In Balisage: The Markup Conference 2021. 26 Washington, United States.

Text Encoding Initiative documents are notoriously heterogeneous in structure, since the Guidelines are intended to permit the encoding on any type of text, from tax receipts written on papyrus to Shakespeare plays or novels. Citation Structures are a new feature in the TEI Guidelines that provide a way for documents to declare their own internal structure along with a way to resolve citations conforming to that structure. This feature will allow systems ike the Distributed Text Services (DTS) API, which process heterogeneous TEI documents to handle tasks like automated table of contents generation, the extraction of structural metadata, and the resolution of citations without prior knowledge of document structure.
José Carlos Rosales Núñez, Djamé Seddah and Guillaume Wisniewski. 2021. Understanding the Impact of UGC Specificities on Translation Quality. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021). pages 189–198. Association for Computational Linguistics. Online.

This work takes a critical look at the evaluation of user-generated content automatic translation, the well-known specificities of which raise many challenges for MT. Our analyses show that measuring the average-case performance using a standard metric on a UGC test set falls far short of giving a reliable image of the UGC translation quality. That is why we introduce a new data set for the evaluation of UGC translation in which UGC specificities have been manually annotated using a fine-grained typology. Using this data set, we conduct several experiments to measure the impact of different kinds of UGC specificities on translation quality, more precisely than previously possible.
José Carlos Rosales Núñez, Guillaume Wisniewski and Djamé Seddah. 2021. Noisy UGC Translation at the Character Level: Revisiting Open-Vocabulary Capabilities and Robustness of Char-Based Models. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021). pages 199–211. Association for Computational Linguistics. Online.

This work explores the capacities of character-based Neural Machine Translation to translate noisy User-Generated Content (UGC) with a strong focus on exploring the limits of such approaches to handle productive UGC phenomena, which almost by definition, cannot be seen at training time. Within a strict zero-shot scenario, we first study the detrimental impact on translation performance of various user-generated content phenomena on a small annotated dataset we developed, and then show that such models are indeed incapable of handling unknown letters, which leads to catastrophic translation failure once such characters are encountered. We further confirm this behavior with a simple, yet insightful, copy task experiment and highlight the importance of reducing the vocabulary size hyper-parameter to increase the robustness of character-based models for machine translation.
Ghazi Felhi, Joseph Le Roux and Djamé Seddah. 2021. Challenging the Semi-Supervised VAE Framework for Text Classification. In Proceedings of the Second Workshop on Insights from Negative Results in NLP. pages 136–143. Association for Computational Linguistics. Online and Punta Cana, Dominican Republic.

Semi-Supervised Variational Autoencoders (SSVAEs) are widely used models for data efficient learning. In this paper, we question the adequacy of the standard design of sequence SSVAEs for the task of text classification as we exhibit two sources of overcomplexity for which we provide simplifications. These simplifications to SSVAEs preserve their theoretical soundness while providing a number of practical advantages in the semi-supervised setup where the result of training is a text classifier. These simplifications are the removal of (i) the Kullback-Liebler divergence from its objective and (ii) the fully unobserved latent variable from its probabilistic model. These changes relieve users from choosing a prior for their latent variables, make the model smaller and faster, and allow for a better flow of information into the latent variables. We compare the simplified versions to standard SSVAEs on 4 text classification tasks. On top of the above-mentioned simplification, experiments show a speed-up of 26%, while keeping equivalent classification scores. The code to reproduce our experiments is public.
Arij Riabi, Benoît Sagot and Djamé Seddah. 2021. Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios? In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021). pages 423–436. Association for Computational Linguistics. Online.

Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high-resource languages. Building language models and, more generally, NLP systems for non-standardized and low-resource languages remains a challenging task. In this work, we focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data displaying a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set-tings.
Lana Yeganova, Dina Wiemann, Mariana Neves, Federica Vezzani, Amy Siu, Inigo Jauregi Unanue, Maite Oronoz, Nancy Mah, Aurélie Névéol, David Martinez, Rachel Bawden, Giorgio Maria Di Nunzio, Roland Roller, Philippe Thomas, Cristian Grozea, Olatz Perez-de-Viñaspre, Maika Vicente Navarro and Antonio Jimeno Yepes. 2021. Findings of the WMT 2021 Biomedical Translation Shared Task: Summaries of Animal Experiments as New Test Set. In Proceedings of the Sixth Conference on Machine Translation. pages 664–683. Association for Computational Linguistics. Online.

In the sixth edition of the WMT Biomedical Task, we addressed a total of eight language pairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian, and English/Basque. Further, our tests were composed of three types of textual test sets. New to this year, we released a test set of summaries of animal experiments, in addition to the test sets of scientific abstracts and terminologies. We received a total of 107 submissions from 15 teams from 6 countries.
Lionel Tadonfouet Tadjou, Fabrice Bourge, Tiphaine Marie, Laurent Romary and Éric de la Clergerie. 2021. Building A Corporate Corpus For Threads Constitution. In Proceedings of the Student Research Workshop Associated with RANLP 2021. pages 193–202. INCOMA Ltd. Online.

In this paper we describe the process of building a corporate corpus that will be used as a reference for modelling and computing threads from conversations generated using communication and collaboration tools. The overall goal of the reconstruction of threads is to be able to provide value to the collorator in various use cases, such as higlighting the important parts of a running discussion, reviewing the upcoming commitments or deadlines, etc. Since, to our knowledge, there is no available corporate corpus for the French language which could allow us to address this problem of thread constitution, we present here a method for building such corpora including different aspects and steps which allowed the creation of a pipeline to pseudo-anonymise data. Such a pipeline is a response to the constraints induced by the General Data Protection Regulation GDPR in Europe and the compliance to the secrecy of correspondence.
Simon Gabay, Barbara Topalov, Caroline Corbières, Lucie Rondeau Du Noyer, Béatrice Joyeux-Prunel and Laurent Romary. 2021. Automating Artl@s–extracting data from exhibition catalogues. In EADH 2021 - Second International Conference of the European Association for Digital Humanities. Krasnoyarsk, Russia.

Catalogues, which have been published for centuries, are an extremely precious resource for scholars. Using the Artl@s database as an example, where exhibition catalogues are transformed into a georeferenced database, we question the possibility of an (almost) automatic transformation of pdfs into semantically annotated data. To do so, we present and analyse the graphic organisation of exhibition catalogues, before exploring a possible modeling into TEI (involving possible enhancement of the guidelines).
Julien Abadji, Pedro Javier Ortiz Suárez, Laurent Romary and Benoît Sagot. 2021. Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus. In CMLC 2021 - 9th Workshop on Challenges in the Management of Large Corpora. Limerick / Virtual, Ireland.

Since the introduction of large language models in Natural Language Processing, large raw corpora have played a crucial role in Computational Linguistics. However, most of these large raw corpora are either available only for English or not available to the general public due to copyright issues. Nevertheless, there are some examples of freely available multilingual corpora for training Deep Learning NLP models, such as the OSCAR and Paracrawl corpora. However, they have quality issues, especially for low-resource languages. Moreover, recreating or updating these corpora is very complex. In this work, we try to reproduce and improve the goclassy pipeline used to create the OSCAR corpus. We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data.Also, unlike OSCAR, the metadata information is at the document level. We release our pipeline under an open source license and publish the corpus under a research-only license.
Syrielle Montariol and Alexandre Allauzen. 2021. Transport Optimal pour le Changement Sémantique à partir de Plongements Contextualisés (Optimal Transport for Semantic Change Detection using Contextualised Embeddings ). In Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale. pages 81–90. ATALA. Lille, France.

Plusieurs méthodes de détection des changements sémantiques utilisant des plongements lexicaux contextualisés sont apparues récemment. Elles permettent une analyse fine du changement d’usage des mots, en agrégeant les plongements contextualisés en clusters qui reflètent les différents usages d’un mot. Nous proposons une nouvelle méthode basée sur le transport optimal. Nous l’évaluons sur plusieurs corpus annotés, montrant un gain de précision par rapport aux autres méthodes utilisant des plongements contextualisés, et l’illustrons sur un corpus d’articles de journaux.
Benjamin Muller, Antonios Anastasopoulos, Benoît Sagot and Djamé Seddah. 2021. When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 448–462. Association for Computational Linguistics. Online.

Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-theart performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. We show that transliterating those languages significantly improves the potential of large-scale multilingual language models on downstream tasks. This result provides a promising direction towards making these massively multilingual models useful for a new set of unseen languages.
Clémentine Fourrier, Rachel Bawden and Benoît Sagot. 2021. Can Cognate Prediction Be Modelled as a Low-Resource Machine Translation Task? In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. pages 847–861. Association for Computational Linguistics. Online.

Cognate prediction is the task of generating, in a given language, the likely cognates of words in a related language, where cognates are words in related languages that have evolved from a common ancestor word. It is a task for which little data exists and which can aid linguists in the discovery of previously undiscovered relations. Previous work has applied machine translation (MT) techniques to this task, based on the tasks' similarities, without, however, studying their numerous differences or optimising architectural choices and hyper-parameters. In this paper, we investigate whether cognate prediction can benefit from insights from low-resource MT. We first compare statistical MT (SMT) and neural MT (NMT) architectures in a bilingual setup. We then study the impact of employing data augmentation techniques commonly seen to give gains in low-resource MT: monolingual pretraining, backtranslation and multilinguality. Our experiments on several Romance languages show that cognate prediction behaves only to a certain extent like a standard lowresource MT task. In particular, MT architectures, both statistical and neural, can be successfully used for the task, but using supplementary monolingual data is not always as beneficial as using additional language data, contrarily to what is observed for MT.
Benjamin Muller, Yanai Elazar, Benoît Sagot and Djamé Seddah. 2021. First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. pages 2214–2231. Association for Computational Linguistics. Online.

Multilingual pretrained language models have demonstrated remarkable zero-shot crosslingual transfer capabilities. Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning. Despite promising results, we still lack a proper understanding of the source of this transfer. Using a novel layer ablation technique and analyses of the model's internal representations, we show that multilingual BERT, a popular multilingual language model, can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a taskspecific language-agnostic predictor. While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during finetuning, the task predictor has little importance on the transfer and can be reinitialized during fine-tuning. We present extensive experiments with three distinct tasks, seventeen typologically diverse languages and multiple domains to support our hypothesis.
Rute Costa, Ana Salgado, Anas Fahad Khan, Sara Carvalho, Laurent Romary, Bruno Almeida, Margarida Ramos, Mohamed Khemakhem, Raquel Silva and Toma Tasovac. 2021. MORDigital: The Advent of a New Lexicographical Portuguese Project. In eLex 2021 - Seventh biennial conference on electronic lexicography. Brno, Czech Republic.

MORDigital is a newly funded Portuguese lexicographical project that aims to produce highquality and searchable digital versions of the first three editions (1789; 1813; 1823) of the Diccionario da Lingua Portugueza by António de Morais Silva, preserving and making accessible this important work of European heritage. This paper will describe the current state of the art, the project, its objectives and the methodology proposed, the latter of which is based on a rigorous linguistic analysis and will also include steps necessary for the ontologisation of knowledge contained in and relating to the text. A section will be dedicated to the various investigation domains of the project description. The output of the project will be made available via a dedicated platform.
Antoine Gérard, Benoît Sagot and Emilie Pons. 2021. Le Traitement Automatique des Langues au service du vin. In Dataquitaine 2021 - IA, Recherche Opérationnelle & Data Science. Bordeaux / Virtual, France.

Dans cette présentation, nous proposons de détailler une collaboration fructueuse entre l'institut de recherche Inria et une startup bordelaise : Winespace. Nous nous intéresserons alors à l'analyse sémantique de commentaires de dégustation dans le but de recommander des vins présentant des caractéristiques similaires.
Farid Arthaud, Rachel Bawden and Alexandra Birch. 2021. Few-shot learning through contextual data augmentation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. pages 1049–1062. Association for Computational Linguistics. Online.

Machine translation (MT) models used in industries with constantly changing topics, such as translation or news agencies, need to adapt to new data to maintain their performance over time. Our aim is to teach a pre-trained MT model to translate previously unseen words accurately, based on very few examples. We propose (i) an experimental setup allowing us to simulate novel vocabulary appearing in human-submitted translations, and (ii) corresponding evaluation metrics to compare our approaches. We extend a data augmentation approach using a pre-trained language model to create training examples with similar contexts for novel words. We compare different fine-tuning and data augmentation approaches and show that adaptation on the scale of one to five examples is possible. Combining data augmentation with randomly selected training sentences leads to the highest BLEU score and accuracy improvements. Impressively, with only 1 to 5 examples, our model reports better accuracy scores than a reference system trained with on average 313 parallel examples.
Arij Riabi, Thomas Scialom, Rachel Keraron, Benoît Sagot, Djamé Seddah and Jacopo Staiano. 2021. Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pages 7016–7030. Association for Computational Linguistics. Online and Punta Cana, Dominican Republic.

Coupled with the availability of large scale datasets, deep learning architectures have enabled rapid progress on the Question Answering task. However, most of those datasets are in English, and the performances of state-of-the-art multilingual models are significantly lower when evaluated on non-English data. Due to high data collection costs, it is not realistic to obtain annotated data for each language one desires to support. We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a cross-lingual fashion. We show that the proposed method allows to significantly outperform the baselines trained on English data only. We report a new state-of-the-art on four multilingual datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).

Communications

Alix Chagué. 2021. CREMMA : Une infrastructure mutualisée pour la reconnaissance d'écritures manuscrites et la patrimonialisation numérique. In Sciences du patrimoine - sciences du texte. Confrontation des méthodes. Paris, France.

Hugo Scheithauer, Alix Chagué, Aurélia Rostaing, Lucas Terriel, Laurent Romary, Marie-Françoise Limon-Bonnet, Benjamin Davy, Gaetano Piraino, Franck Beltrami, Danis Habib, Nathalie Denis and Marc Durand. 2021. Production d'un modèle affiné de reconnaissance d'écriture manuscrite avec eScriptorium et évaluation de ses performances. In Les Futurs Fantastiques - 3e Conférence Internationale sur l'Intelligence Artificielle appliquée aux Bibliothèques, Archives et Musées, AI4LAM. Paris, France.

For this workshop, participants will take part in the fine-tuning of a handwritten text recognition (HTR) model with eScriptorium. Fine-tuning a model means retraining an initial generic model with a new dataset in order to specialize it in a particular domain.
Hugo Scheithauer, Alix Chagué and Laurent Romary. 2021. From eScriptorium to TEI Publisher. In Brace your digital scholarly edition! Berlin, Germany.

Lucas Terriel. 2021. Atelier : Production d'un modèle affiné de reconnaissance d'écriture manuscrite avec eScriptorium et évaluation de ses performances. Évaluer son modèle HTR/OCR avec KaMI (Kraken as Model Inspector). In Les Futurs Fantastiques - 3e Conférence Internationale sur l'Intelligence Artificielle appliquée aux Bibliothèques, Archives et Musées. Paris, France.

Pauline Charbonnier, Lucas Terriel, Florence Clavaud, Laurent Romary, Gaetano Piraino and Vincent Verdese. 2021. NER4Archives (named entity recognition for archives) : méthodes et outils semi-automatiques pour reconnaître les entités nommées dans les instruments de recherche archivistiques encodés en XML/EAD. In Les Futurs Fantastiques - 3e Conférence Internationale sur l'Intelligence Artificielle appliquée aux Bibliothèques, Archives et Musées. Paris, France.

Alix Chagué and Rostaing Aurélia. 2021. LECTAUREP : Lecture Automatique des Répertoires de Notaires Parisiens. In Fantastic Futures 2021 / Futures Fantastiques 2021. Paris, France.

Alix Chagué and Aurélia Rostaing. 2021. LECTAUREP: Paris Notary Record Books Automated Reading. In Fantastic Futures 2021 / Futures Fantastiques 2021. Paris, France.

Floriane Chiffoleau, Anne Baillot and Manon Ovide. 2021. A TEI-based publication pipeline for historical egodocuments - the DAHN project. In Next Gen TEI, 2021 - TEI Conference and Members' Meeting. Virtual, United States.

Alix Chagué, Thibault Clérice and Laurent Romary. 2021. HTR-United : Mutualisons la vérité de terrain ! In DHNord2021 - Publier, partager, réutiliser les données de la recherche : les data papers et leurs enjeux. Lille, France.

Hugo Scheithauer, Alix Chagué, Simon Gabay, Laurent Romary, Juliette Janes and Claire Jahan. 2021. From page to content–which TEI representation for HTR output? In Next Gen TEI, 2021 - TEI Conference and Members' Meeting. Weaton (virtual), United States.

Alexandre Bartz, Juliette Janes, Laurent Romary, Philippe Gambette, Rachel Bawden, Pedro Ortiz Suarez, Benoît Sagot and Simon Gabay. 2021. Expanding the content model of annotationBlock. In Next Gen TEI, 2021 - TEI Conference and Members' Meeting. Virtual, United States.

Simon Gabay, Philippe Gambette, Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou and Benoît Sagot. 2021. Variation graphique dans les documents d'Ancien Régime : Nouvelles approches scriptométriques. In Journée d'étude : « Pour une histoire de la langue ‘par en bas': textes privés et variation des langues dans le passé »; Paris, France.

Chadi Helwe, Chloé Clavel and Fabian Suchanek. 2021. Reasoning with Transformer-based Models: Deep Learning, but Shallow Reasoning. In 2021 International Conference on Automated Knowledge Base Construction (AKBC). Virtual, United States.

Recent years have seen impressive performance of transformer-based models on different natural language processing tasks. However, it is not clear to what degree the transformers can reason on natural language. To shed light on this question, this survey paper discusses the performance of transformers on different reasoning tasks, including mathematical reasoning, commonsense reasoning, and logical reasoning. We point out successes and limitations, of both empirical and theoretical nature.
Jean-Damien Généro, Alix Chagué, Victoria Le Fourner and Marie Puren. 2021. Transcribing and editing digitized sources on work in the textile industry. In Rémunérations et usages du temps des hommes et des femmes dans le textile en France de la fin du XVIIe au début du XXe siècle. Lyon, France.

Historians have been using digital tools for several decades. Time-Us project has been part ofthis long tradition by developing experimental methods of automatic transcription (ORC) andstructuring (XML) of handwritten archival documents and book collections. The sets chosen toillustrate this work are the minutes of the Conseil des prud'hommes de Paris (1847-1848, 1858,1878) and the monographs of the Ouvriers des deux mondes (1857-1913, 1930). Two stageswill be exposed. The first is the process of analysis and reproduction of logical structures(minutes of the labor court hearings and sections of the monographs), conducted on a ridgebetween the machine (automation of tasks) and the human hand (manual verifications andcorrections). The second is the extraction of textile-related information from the monographsand its availability to researchers. Finally, proposals will be made regarding the possible usesof digital technology in research programs.
Simon Gabay and Pedro Javier Ortiz Suárez. 2021. A dataset for automatic detection of places in (early) modern French texts. In Proceedings of the 50th Annual North American Society for Seventeenth-Century French Literature Conference. Online.

Alix Chagué and Floriane Chiffoleau. 2021. An accessible and transparent pipeline for publishing historical egodocuments. In WPIP21 - What's Past is Prologue: The NewsEye International Conference. Virtual, Austria.

The automatization of the processing of documents oriented towards online publication and exploration by the humanities increases the rapidity of treatments like the transcription, but they should also be an opportunity to make the experimentation and the resulting corpora sustainable and reusable. The DAHN project (Dispositif de soutien à l’Archivistique et aux Humanités Numériques) relies on a joint interdisciplinary collaboration between Inria, the EHESS and the University of Le Mans. By taking the example of egodocuments, the project aims to create a ready-to-use digital and scientific publishing pipeline going from the material archive to an online publication. In this presentation, we introduce our method and guidelines for the processing of non-digital-native textual documents using open-source and easily hackable tools that guarantee visibility across an accessible pipeline, thus challenging the notions of a black box or scattered tools which tend to be hard to maintain in the long run.
Alix Chagué and Aurélia Rostaing. 2021. Présentation du projet Lectaurep (Lecture automatique de répertoires). In Atelier sur la transcription des écritures manuscrites - BnF DataLab. Paris, France.

Tech reports

Julien Launay, Elena Tommasone, Baptiste Pannier, François Boniface, Amélie Chatelain, Alessandro Cappelli, Iacopo Poli and Djamé Seddah. 2021. PAGnol: An Extra-Large French Generative Model. Technical report.

Access to large pre-trained models of varied architectures, in many different languages, is central to the democratization of NLP. We introduce PAGnol, a collection of French GPT models. Using scaling laws, we efficiently train PAGnol-XL (1.5B parameters) with the same computational budget as CamemBERT, a model 13 times smaller. PAGnol-XL is the largest model trained to date for the French language. We plan to train increasingly large and performing versions of PAGnol, exploring the capabilities of French extreme-scale models. For this first release, we focus on the pre-training and scaling calculations underlining PAGnol. We fit a scaling law for compute for the French language, and compare it with its English counterpart. We find the pre-training dataset significantly conditions the quality of the outputs, with common datasets such as OSCAR leading to low-quality offensive text. We evaluate our models on discriminative and generative tasks in French, comparing to other state-of-the-art French and multilingual models, and reaching the state of the art in the abstract summarization task. Our research was conducted on the public GENCI Jean Zay supercomputer, and our models up to the Large are made publicly available.
Toma Tasovac, Laurent Romary, Erzsébet Tóth-Czifra and Irena Marinski. 2021. Lexicographic Data Seal of Compliance. Technical report.

Other

Alix Chagué. 2021. Comment faire lire des gribouillis à mon ordinateur ?

Preprints

Floriane Chiffoleau. 2021. Keeping it open: a TEI-based publication pipeline for historical documents. Preprint.

Following the emergence of numerous projects to exploit historical archives, books or similar contents, as well as the exponential needs for digital tools tailored for those tasks, the DAHN project (Dispositif de soutien à l'Archivistique et aux Humanités Numériques) developed a complete open-source pipeline made of tools and methods making it possible to present a digital scholarly edition of scanned handwritten material. Composed of six steps (digitization, segmentation, transcription, post-OCR processing, encoding, and publication) and centered on historical documents, and more particularly on ego documents, this pipeline has been built around TEI, which works as a pivot format, to ensure its robustness, sustainability, and reusability. More than just encoding in TEI, we also choose tools compatible with it, such as eScriptorium for segmentation/transcription or TEI Publisher for the publication. To further help the people working with the pipeline, we also heavily documented the development of the pipeline, as well as its steps, to ease its reuse.
Laurent Romary. 2021. Normes et patrimoine numérique. Preprint.

Thomas Scialom, Louis Martin, Jacopo Staiano, Eric Villemonte de La Clergerie and Benoît Sagot. 2021. Rethinking Automatic Evaluation in Sentence Simplification. Preprint.

Automatic evaluation remains an open research question in Natural Language Generation. In the context of Sentence Simplification, this is particularly challenging: the task requires by nature to replace complex words with simpler ones that shares the same meaning. This limits the effectiveness of n-gram based metrics like BLEU. Going hand in hand with the recent advances in NLG, new metrics have been proposed, such as BERTScore for Machine Translation. In summarization, the QuestEval metric proposes to automatically compare two texts by questioning them. In this paper, we first propose a simple modification of QuestEval allowing it to tackle Sentence Simplification. We then extensively evaluate the correlations w.r.t. human judgement for several metrics including the recent BERTScore and QuestEval, and show that the latter obtain state-of-the-art correlations, outperforming standard metrics like BLEU and SARI. More importantly, we also show that a large part of the correlations are actually spurious for all the metrics. To investigate this phenomenon further, we release a new corpus of evaluated simplifications, this time not generated by systems but instead, written by humans. This allows us to remove the spurious correlations and draw very different conclusions from the original ones, resulting in a better understanding of these metrics. In particular, we raise concerns about very low correlations for most of traditional metrics. Our results show that the only significant measure of the Meaning Preservation is our adaptation of QuestEval.
Alix Chagué and Floriane Chiffoleau. 2021. An accessible and transparent pipeline for publishing historical egodocuments. Preprint.

The automatization of the processing of documents oriented towards online publication and exploration by the humanities increases the rapidity of treatments like the transcription, but they should also be an opportunity to make the experimentation and the resulting corpora sustainable and reusable. The DAHN project (Dispositif de soutien à l’Archivistique et aux Humanités Numériques) relies on a joint interdisciplinary collaboration between Inria, the EHESS and the University of Le Mans. By taking theexample of egodocuments, the project aims to create a ready-to-use digital and scientific publishing pipeline going from the material archive to an online publication.In this presentation, we introduce our method and guidelines for the processing of non-digital-native textual documents using open-source and easily hackable tools that guarantee visibility across an accessible pipeline, thus challenging the notions of a black box or scattered tools which tend to be hard to maintain in the long run.
Benjamin Muller, Yanai Elazar, Benoît Sagot and Djamé Seddah. 2021. First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT. Preprint.

Multilingual pretrained language models have demonstrated remarkable zero-shot cross-lingual transfer capabilities. Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning. Despite promising results, we still lack a proper understanding of the source of this transfer. Using a novel layer ablation technique and analyses of the model's internal representations, we show that multilingual BERT, a popular multilingual language model, can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a task-specific language-agnostic predictor. While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during fine-tuning, the task predictor has little importance on the transfer and can be reinitialized during fine-tuning. We present extensive experiments with three distinct tasks, seventeen typologically diverse languages and multiple domains to support our hypothesis.
Benjamin Muller, Benoît Sagot and Djamé Seddah. 2021. Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi. Preprint.

Building natural language processing systems for non standardized and low resource languages is a difficult challenge. The recent success of large-scale multilingual pretrained language models provides new modeling tools to tackle this. In this work, we study the ability of multilingual language models to process an unseen dialect. We take user generated North-African Arabic as our case study, a resource-poor dialectal variety of Arabic with frequent code-mixing with French and written in Arabizi, a non-standardized transliteration of Arabic to Latin script. Focusing on two tasks, part-of-speech tagging and dependency parsing, we show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect, specifically in two extreme cases: (i) across scripts, using Modern Standard Arabic as a source language, and (ii) from a distantly related language, unseen during pretraining, namely Maltese. Our results constitute the first successful transfer experiments on this dialect, paving thus the way for the development of an NLP ecosystem for resource-scarce, non-standardized and highly variable vernacular languages.
Louis Martin, Angela Fan, Eric Villemonte de La Clergerie, Antoine Bordes and Benoît Sagot. 2021. Multilingual Unsupervised Sentence Simplification. Preprint.

Progress in Sentence Simplification has been hindered by the lack of supervised data, particularly in languages other than English. Previous work has aligned sentences from original and simplified corpora such as English Wikipedia and Simple English Wikipedia, but this limits corpus size, domain, and language. In this work, we propose using unsupervised mining techniques to automatically create training corpora for simplification in multiple languages from raw Common Crawl web data. When coupled with a controllable generation mechanism that can flexibly adjust attributes such as length and lexical complexity, these mined paraphrase corpora can be used to train simplification systems in any language. We further incorporate multilingual unsupervised pretraining methods to create even stronger models and show that by training on mined data rather than supervised corpora, we outperform the previous best results. We evaluate our approach on English, French, and Spanish simplification benchmarks and reach state-of-the-art performance with a totally unsupervised approach. We will release our models and code to mine the data in any language included in Common Crawl.

2020

PhD theses and Habiliations

Mohamed Khemakhem. 2020. Standard-based lexical models for automatically structured dictionnaries. PhD thesis. Université Paris Cité.

Dictionaries could be considered as the most comprehensive reservoir of human knowledge, which carry not only the lexical description of words in one or more languages, but also the commun awareness of a certain community about every known piece of knowledge in a time frame. Print dictionaries are the principle resources which enable the documentation and transfer of such knowledge. They already exist in abundant numbers, while new ones are continuously compiled, even with the recent strong move to digital resources. However, a majority of these dictionaries, even when available digitally, is still not fully structured due to the absence of scalable methods and techniques that can cover the variety of corresponding material. Moreover, the relatively few existing structured resources present limited exchange and query alternatives, given the discrepancy of their data models and formats. In this thesis we address the task of parsing lexical information in print dictionaries through the design of computer models that enable their automatic structuring. Solving this task goes hand in hand with finding a standardised output for these models to guarantee a maximum interoperability among resources and usability for downstream tasks. First, we present different classifications of the dictionaric resources to delimit the category of print dictionaries we aim to process. Second, we introduce the parsing task by providing an overview of the processing challenges and a study of the state of the art. Then, we present a novel approach based on a top-down parsing of the lexical information. We also outline the architecture of the resulting system, called GROBID-Dictionaries, and the methodology we followed to close the gap between the conception of the system and its applicability to real-world scenarios. After that, we draw the landscape of the leading standards for structured lexical resources. In addition, we provide an analysis of two ongoing initiatives, TEI-Lex-0 and LMF, that aim at the unification of modelling the lexical information in print and electronic dictionaries. Based on that, we present a serialisation format that is inline with the schemes of the two standardisation initiatives and fits the approach implemented in our parsing system. After presenting the parsing and standardised serialisation facets of our lexical models, we provide an empirical study of their performance and behaviour. The investigation is based on a specific machine learning setup and series of experiments carried out with a selected pool of varied dictionaries. We try in this study to present different ways for feature engineering and exhibit the strength and the limits of the best resulting models. We also dedicate two series of experiments for exploring the scalability of our models with regard to the processed documents and the employed machine learning technique. Finally, we sum up this thesis by presenting the major conclusions and opening new perspectives for extending our investigations in a number of research directions for parsing entry-based documents.
Jack Bowers. 2020. Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec. PhD thesis. École Pratique des Hauts Études.

This dissertation concerns a language documentation project covering the Mixtepec-Mixtec variety of Mixtec (ISO 639-3: mix). Mixtepec-Mixtec is an Oto-Manguean spoken by roughly 9000- 10000 people in San Juan Mixtepec Municipality in the Juxtlahuaca district of Oaxaca, Mexico and by several thousand speakers living in Baja California, Tlaxiaco, Santiago Juxtlahuaca. There are also significant populations in the United States, most notably in California, around Santa Maria and Oxnard, as well as in Oregon, Florida, and Arkansas.The core facets of the work are: the creation a body of linguistic resources for the MIX language and community; the evaluation the current tools, standards and practices used in language documentation; an account of how the TEI and related XML technologies can be used as the primary encoding, metadata, and annotation format for multi-dimensional linguistic projects, including under-resourced languages. The concrete resources produced are: a multilingual TEI dictionary; a collection of audio recordings published and archived on Harvard Dataverse; a corpus of texts derived from a combination of spoken language transcriptions and texts encoded and annotated in TEI, as well as linguistic and lexicographic descriptions and analyses of the Mixtepec-Mixtec language.Due to the array of different data and resources produced, this project has components that equally fall within the fields of: digital humanities, language documentation, language description and corpus linguistics. Because of this overlapping relevance, over the processes of attempting to carry out this work in line with best practices in each sub-field, this work addresses the need to further bring together the intersecting interests, technologies, practices and standards relevant to, and used in each of these related fields.
Loïc Grobol. 2020. Coreference resolution for spoken French. PhD thesis. Université Sorbonne Nouvelle - Paris 3.

A coreference chain is the set of linguistic expressions — or mentions — that refer to the same entity or discourse object in a given document. Coreference resolution consists in detecting all the mentions in a document and partitioning their set into coreference chains. Coreference chainsplay a central role in the consistency of documents and interactions, and their identification has applications to many other fields in natural language processing that rely on an understanding of language, such as information extraction, question answering or machine translation. Natural language processing systems that perform this task exist for many languages, but none for French — which suffered until recently from a lack of suitable annotated resources — and none for spoken language.In this thesis, we aim to fill this gap by designing a coreference resolution system for spoken French. To this end, we propose a knowledge-poor system based on an end-to-end neural network architecture, which obviates the need for the preprocessing pipelines common in existing systems, while maintaining performances comparable to the state-of-the art. We then propose extensions on that baseline, by augmenting our system with external knowledge obtained from resources and preprocessing tools designed for written French. Finally, we propose a new standard representation for coreference annotation in corpora of written and spoken languages, and demonstrate its use in a new version of ANCOR, the first coreference corpus of spoken French.

Journal articles

Xinying Chen and Kim Gerdes. 2020. Dependency Distances and Their Frequencies in Indo-European Language. Journal of Quantitative Linguistics pages 1–20. Taylor & Francis (Routledge).

The present study investigates the relationship between two features of dependencies, namely, dependency distances and dependency frequencies. The study is based on the analysis of a parallel dependency treebank that includes 10 Indo-European languages. Two corresponding random dependency treebanks are generated as baselines for comparison. After computing the values of dependency distances and their frequencies in these treebanks, for each lan-guage, we fit four functions, namely quadratic, exponent, logarithm, and power-law func-tions, to its original and random datasets. The preliminary result shows that there is a rela-tion between the two dependency features for all 10 Indo-European languages. The relation can be further formalized as a power-law function which can distinguish the observed data from randomly generated datasets.
Laurent Romary. 2020. Découpler gestion des manuscrits de publication et évaluation par les pairs : la plateforme de gestion de revues Épisciences. I2D -- Information, données & documents A.D.B.S.

Fondée sur un modèle original, la plateforme Épisciences, qui contient actuellement 15 revues, propose un outil complet pour la gestion d’une revue, son hébergement et la diffusion de ses contenus. Elle assure l’hébergement de revues en open access (épi-revues) et le processus de soumission des articles à ces revues, via un dépôt dans une archive ouverte telle que HAL. Les personnels documentaires jouent ici un rôle d’accompagnement décisif.
Andrea Bertino, Luca Foppiano, Laurent Romary and Pierre Mounier. 2020. Leveraging Concepts in Open Access Publications. Journal of Data Mining and Digital Humanities 2019 INRIA.

This paper addresses the integration of a Named Entity Recognition and Disambiguation (NERD) service within a group of open access (OA) publishing digital platforms and considers its potential impact on both research and scholarly publishing. The software powering this service, called entity-fishing, was initially developed by Inria in the context of the EU FP7 project CENDARI and provides automatic entity recognition and disambiguation using the Wikipedia and Wikidata data sets. The application is distributed with an open-source licence, and it has been deployed as a web service in DARIAH's infrastructure hosted by the French HumaNum. In the paper, we focus on the specific issues related to its integration on five OA platforms specialized in the publication of scholarly monographs in the social sciences and humanities (SSH), as part of the work carried out within the EU H2020 project HIRMEOS (High Integration of Research Monographs in the European Open Science infrastructure). In the first section, we give a brief overview of the current status and evolution of OA publications, considering specifically the challenges that OA monographs are encountering. In the second part, we show how the HIRMEOS project aims to face these challenges by optimizing five OA digital platforms for the publication of monographs from the SSH and ensuring their interoperability. In sections three and four we give a comprehensive description of the entity-fishing service, focusing on its concrete applications in real use cases together with some further possible ideas on how to exploit the annotations generated. We show that entity-fishing annotations can improve both research and publishing process. In the last chapter, we briefly present further possible application scenarios that could be made available through infrastructural projects.
Luca Foppiano and Laurent Romary. 2020. Entity-fishing: a DARIAH entity recognition and disambiguation service. Journal of the Japanese Association for Digital Humanities 5 pages 22–60. Japanese Association for Digital Humanities.

This paper presents an attempt to provide a generic named-entity recognition and disambiguation module (NERD) called entity-fishing as a stable online service that demonstrates the possible delivery of sustainable technical services within DARIAH, the European digital research infrastructure for the arts and humanities. Deployed as part of the national infrastructure Huma-Num in France, this service provides an efficient state-of-the-art implementation coupled with standardised interfaces allowing an easy deployment on a variety of potential digital humanities contexts. Initially developed in the context of the FP9 EU project CENDARI, the software was well received by the user community and continued to be further developed within the H2020 HIRMEOS project where several open access publishers have integrated the service to their collections of published monographs as a means to enhance retrieval and access. entity-fishing implements entity extraction as well as disambiguation against Wikipedia and Wikidata entries. The service is accessible through a REST API which allows easier and seamless integration, language independent and stable convention and a widely used service-oriented architecture (SOA) design. Input and output data are carried out over a query data model with a defined structure providing flexibility to support the processing of partially annotated text or the repartition of text over several queries. The interface implements a variety of functionalities, like language recognition, sentence segmentation and modules for accessing and looking up concepts in the knowledge base. The API itself integrates more advanced contextual parametrisation or ranked outputs, allowing for the resilient integration in various possible use cases. The entity-fishing API has been used as a concrete use case to draft the experimental stand-off proposal, which has been submitted for integration into the TEI guidelines. The representation is also compliant with the Web Annotation Data Model (WADM). In this paper we aim at describing the functionalities of the service as a reference contribution to the subject of web-based NERD services. In this paper, we detail the workflow from input to output and unpack each building box in the processing flow. Besides, with a more academic approach, we provide a transversal schema of the different components taking into account non-functional requirements in order to facilitate the discovery of bottlenecks, hotspots and weaknesses. We also describe the underlying knowledge base, which is set up on the basis of Wikipedia and Wikidata content. We conclude the paper by presenting our solution for the service deployment: how and which the resources where allocated. The service has been in production since Q3 of 2017, and extensively used by the H2020 HIRMEOS partners during the integration with the publishing platforms.

Conference proceedings

Hila Gonen, Ganesh Jawahar, Djamé Seddah and Yoav Goldberg. 2020. Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pages 538–555. Association for Computational Linguistics. Online.

The problem of comparing two bodies of text and searching for words that differ in their usage between them arises often in digital humanities and computational social science. This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large. However, these methods often require extensive filtering of the vocabulary to perform well, and-as we show in this work-result in unstable, and hence less reliable, results. We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word. The method is simple, interpretable and stable. We demonstrate its effectiveness in 9 different setups, considering different corpus splitting criteria (age, gender and profession of tweet authors, time of tweet) and different languages (English, French and Hebrew).
Gaël Guibon, Marine Courtin, Kim Gerdes and Bruno Guillaume. 2020. When Collaborative Treebank Curation Meets Graph Grammars. In Proceedings of the Twelfth Language Resources and Evaluation Conference. pages 5291–5300. European Language Resources Association. Marseille, France.

In this paper we present Arborator-Grew, a collaborative annotation tool for treebank development. Arborator-Grew combines the features of two preexisting tools: Arborator and Grew. Arborator is a widely used collaborative graphical online dependency treebank annotation tool. Grew is a tool for graph querying and rewriting specialized in structures needed in NLP, i.e. syntactic and semantic dependency trees and graphs. Grew also has an online version, Grew-match, where all Universal Dependencies treebanks in their classical, deep and surface-syntactic flavors can be queried. Arborator-Grew is a complete redevelopment and modernization of Arborator, replacing its own internal database storage by a new Grew API, which adds a powerful query tool to Arborator's existing treebank creation and correction features. This includes complex access control for parallel expert and crowd-sourced annotation, tree comparison visualization, and various exercise modes for teaching and training of annotators. Arborator-Grew opens up new paths of collectively creating, updating, maintaining, and curating syntactic treebanks and semantic graph banks.
Pedro Javier Ortiz Suárez, Yoann Dupont, Gaël Lejeune and Tian Tian. 2020. SinNer@Clef-Hipe2020 : Sinful adaptation of SotA models for Named Entity Recognition in French and German. In CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. Thessaloniki / Virtual, Greece.

In this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers. The challenge proposed various tasks for three languages, among them we focused on Named Entity Recognition in French and German texts. The best system we proposed ranked third for these two languages, it uses FastText em-beddings and Elmo language models (FrELMo and German ELMo). We show that combining several word representations enhances the quality of the results for all NE types and that the segmentation in sentences has an important impact on the results.
Robin Algayres, Mohamed Salah Zaiem, Benoît Sagot and Emmanuel Dupoux. 2020. Evaluating the Reliability of Acoustic Speech Embeddings. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020). pages 4621–4625.

Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to un-supervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimize the quality of these embeddings in a task-neutral way. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods, ranging from supervised to fully unsu-pervised, and using different loss functions (autoencoders, cor-respondance autoencoders, siamese). Then we use the ABX and MAP to predict performances on a new downstream task: the unsupervised estimation of the frequencies of speech segments in a given corpus. We find that overall, ABX and MAP correlate with one another and with frequency estimation. However, substantial discrepancies appear in the fine-grained distinctions across languages and/or embedding methods. This makes it un-realistic at present to propose a task-independent silver bullet method for computing the intrinsic quality of speech embed-dings. There is a need for more detailed analysis of the metrics currently used to evaluate such embeddings.
Tanti Kristanti and Laurent Romary. 2020. DeLFT and entity-fishing : Tools for CLEF HIPE 2020 Shared Task. In CLEF 2020 - Conference and Labs of the Evaluation Forum. 2696 CEUR. Thessaloniki / Virtual, Greece.

This article presents an overview of approaches and results during our participation in the CLEF HIPE 2020 NERC-COARSE-LIT and EL-ONLY tasks for English and French. For these two tasks, we use two systems: 1) DeLFT, a Deep Learning framework for text processing; 2) entity-fishing, generic named entity recognition and disambiguation service deployed in the technical framework of INRIA.
Fernando Alva-Manchego, Louis Martin, Antoine Bordes, Carolina Scarton, Benoît Sagot and Lucia Specia. 2020. ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pages 4668–4679. Association for Computational Linguistics. Online.

In order to simplify a sentence, human editors perform multiple rewriting transformations: they split it into several shorter sentences , paraphrase words (i.e. replacing complex words or phrases by simpler synonyms), reorder components, and/or delete information deemed unnecessary. Despite these varied range of possible text alterations, current models for automatic sentence simplification are evaluated using datasets that are focused on a single transformation, such as lexical paraphrasing or splitting. This makes it impossible to understand the ability of simplification models in more realistic settings. To alleviate this limitation, this paper introduces ASSET, a new dataset for assessing sentence simplification in English. ASSET is a crowdsourced multi-reference corpus where each simplification was produced by executing several rewriting transformations. Through quantitative and qualitative experiments, we show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task. Furthermore, we motivate the need for developing better methods for automatic evaluation using ASSET, since we show that current popular metrics may not be suitable when multiple simplification transformations are performed.
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de La Clergerie, Djamé Seddah and Benoît Sagot. 2020. CamemBERT: a Tasty French Language Model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pages 7203–7219. Online.

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the con-catenation of data in multiple languages. This makes practical use of such models-in all languages except English-very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.
Djamé Seddah, Farah Essaidi, Amal Fethi, Matthieu Futeral, Benjamin Muller, Pedro Javier Ortiz Suárez, Benoît Sagot and Abhishek Srivastava. 2020. Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pages 1139–1150. Association for Computational Linguistics. Online.

We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. Made of 1500 sentences, fully annotated in morpho-syntax and Universal Dependency syntax, with full translation at both the word and the sentence levels, this treebank is made freely available. It is supplemented with 50k unlabeled sentences collected from Common Crawl and web-crawled data using intensive data-mining techniques. Preliminary experiments demonstrate its usefulness for POS tagging and dependency parsing. We believe that what we present in this paper is useful beyond the low-resource language community. This is the first time that enough unlabeled and annotated data is provided for an emerging user-generated content dialectal language with rich morphology and code switching, making it an challenging test-bed for most recent NLP approaches.
Pedro Javier Ortiz Suárez, Laurent Romary and Benoît Sagot. 2020. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pages 1703–1714. Association for Computational Linguistics. Online.

We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.
Clémentine Fourrier. 2020. Évolution phonologique des langues et réseaux de neurones : travaux préliminaires (Sound change and neural networks: preliminary experiments ). In Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 3 : Rencontre des Étudiants Chercheurs en Informatique pour le TAL. pages 110–122. ATALA et AFCP. Nancy, France.

Cognate prediction is a key task in historical linguistics that presents a number of similarities withmachine translation. However, although neural methods are now widespread in machine translation,they are still largely unused in historical linguistics. In this paper, we study the performance ofneural methods (more specifically encoder-decoder networks) for the task of cognate prediction. Wefocus in particular on the types of data that can be used for this task, and compare the performanceof statistical and neural methods. We show that sound correspondances can only be learned usingcognate datasets, and that statistical and neural methods seem to have complementary strengths andweaknesses regarding what they learn about the data.
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Benoît Sagot and Djamé Seddah. 2020. Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l'hétérogénéité des données d'entrainement (CAMEMBERT Contextual Language Models for French: Impact of Training Data Size and Heterogeneity ). In Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles. pages 54–65. ATALA et AFCP. Nancy, France.

Contextual word embeddings have become ubiquitous in Natural Language Processing. Until recently,most available models were trained on English data or on the concatenation of corpora in multiplelanguages. This made the practical use of models in all languages except English very limited.The recent release of monolingual versions of BERT (Devlinet al., 2019) for French establisheda new state-of-the-art for all evaluated tasks. In this paper, based on experiments on CamemBERT(Martinet al., 2019), we show that pretraining such models on highly variable datasets leads to betterdownstream performance compared to models trained on more uniform data. Moreover, we show thata relatively small amount of web crawled data (4GB) leads to downstream performances as good as amodel pretrained on a corpus two orders of magnitude larger (138GB)
Murielle Fabre, Pedro Javier Ortiz Suárez, Benoît Sagot and Éric Villemonte de La Clergerie. 2020. French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus. In CMLC-8 - 8th Workshop on the Challenges in the Management of Large Corpora. Marseille, France.

This paper describes and compares the impact of different types and size of training corpora on language models like ELMO. By asking the fundamental question of quality versus quantity we evaluate four French corpora for training on parsing scores, POS-tagging and named-entities recognition downstream tasks. The paper studies the relevance of a new corpus, CaBeRnet, featuring a representative range of language usage, including a balanced variety of genres (oral transcriptions, newspapers, popular magazines, technical reports, fiction, academic texts), in oral and written styles. We hypothesize that a linguistically representative and balanced corpora will allow the language model to be more efficient and representative of a given language and therefore yield better evaluation scores on different evaluation sets and tasks.
Louis Martin, Éric Villemonte de La Clergerie, Benoît Sagot and Antoine Bordes. 2020. Controllable Sentence Simplification. In Proceedings of the Twelfth Language Resources and Evaluation Conference. pages 4689–4698. European Language Resources Association. Marseille, France.

Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical. It is often considered an all-purpose generic task where the same simplification is suitable for all; however multiple audiences can benefit from simplified text in different ways. We adapt a discrete parametrization mechanism that provides explicit control on simplification systems based on Sequence-to-Sequence models. As a result, users can condition the simplifications returned by a model on attributes such as length, amount of paraphrasing, lexical complexity and syntactic complexity. We also show that carefully chosen values of these attributes allow out-of-the-box Sequence-to-Sequence models to outperform their standard counterparts on simplification benchmarks. Our model, which we call ACCESS (as shorthand for AudienCe-CEntric Sentence Simplification), establishes the state of the art at 41.87 SARI on the WikiLarge test set, a +1.42 improvement over the best previously reported score.
Clémentine Fourrier and Benoît Sagot. 2020. Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB-2.0. In Proceedings of the Twelfth Language Resources and Evaluation Conference. pages 3207–3216. European Language Resources Association. Marseille, France.

Diachronic lexical information was mostly used in its natural field, historical linguistics, until recently, when promising but not yet conclusive applications to low resource languages machine translation started extending its usage to NLP. There is therefore a new need for fine-grained, large-coverage and accurate etymological lexical resources. In this paper, we propose a set of guidelines to generate such resources, for each step of the life-cycle of an etymological lexicon: creation, update, evaluation, dissemination, and exploitation. To illustrate the guidelines, we introduce EtymDB 2.0, an etymological database automatically generated from the Wiktionary, which contains 1.8 million lexemes, linked by more than 700,000 fine-grained etymological relations, across 2,536 living and dead languages. We also introduce use cases for which EtymDB 2.0 could represent a key resource, such as phylogenetic tree generation, low resource machine translation and medieval languages study.
Gaël Guibon and Benoît Sagot. 2020. OFrLex: A Computational Morphological and Syntactic Lexicon for Old French. In Proceedings of the Twelfth Language Resources and Evaluation Conference. pages 3217–3225. European Language Resources Association. Marseille, France.

In this paper we describe our work on the development and enrichment of OFrLex, a freely available, large-coverage morphological and syntactic Old French lexicon. We rely on several heterogeneous language resources to extract structured and exploitable information. The extraction follows a semi-automatic procedure with substantial manual steps to respond to difficulties encountered while aligning lexical entries from distinct language resources. OFrLex aims at improving natural language processing tasks on Old French such as part-of-speech tagging and dependency parsing. We provide quantitative information on OFrLex and discuss its reliability. We also describe and evaluate a semi-automatic, word-embedding-based lexical enrichment process aimed at increasing the accuracy of the resource. Results of this extension technique will be manually validated in the near future, a step that will take advantage of OFrLex's viewing, searching and editing interface, which is already accessible online.
Fahad Khan, Laurent Romary, Ana Salgado, Jack Bowers, Mohamed Khemakhem and Toma Tasovac. 2020. Modelling Etymology in LMF/TEI: The Grande Dicionário Houaiss da Língua Portuguesa Dictionary as a Use Case. In Proceedings of the Twelfth Language Resources and Evaluation Conference. pages 3172–3180. European Language Resources Association. Marseille, France.

In this article, we will introduce two of the new parts of the new multi-part version of the Lexical Markup Framework (LMF) ISO standard, namely Part 3 of the standard (ISO 24613-3), which deals with etymological and diachronic data, and Part 4 (ISO 24613-4), which consists of a TEI serialisation of all of the prior parts of the model. We will demonstrate the use of both standards by describing the LMF encoding of a small number of examples taken from a sample conversion of the reference Portuguese dictionary Grande Dicionário Houaiss da Língua Portuguesa, part of a broader experiment comprising the analysis of different, heterogeneously encoded, Portuguese lexical resources. We present the examples in the Unified Modelling Language (UML) and also in a couple of cases in TEI.
Pedro Javier Ortiz Suárez, Yoann Dupont, Benjamin Muller, Laurent Romary and Benoît Sagot. 2020. Establishing a New State-of-the-Art for French Named Entity Recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference. pages 4631–4638. European Language Resources Association. Marseille, France.

The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no large-scale French corpus with named entity annotations contain referential information, which complement the type and the span of each mention with an indication of the entity it refers to. We have manually annotated the French TreeBank with such information, after an automatic pre-annotation step. We sketch the underlying annotation guidelines and we provide a few figures about the resulting annotations.
Clémentine Fourrier and Benoît Sagot. 2020. Comparing Statistical and Neural Models for Learning Sound Correspondences. In Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages. pages 79–83. European Language Resources Association (ELRA). Marseille, France.

Cognate prediction and proto-form reconstruction are key tasks in computational historical linguistics that rely on the study of sound change regularity. Solving these tasks appears to be very similar to machine translation, though methods from that field have barely been applied to historical linguistics. Therefore, in this paper, we investigate the learnability of sound correspondences between a proto-language and daughter languages for two machine-translation-inspired models, one statistical, the other neural. We first carry out our experiments on plausible artificial languages, without noise, in order to study the role of each parameter on the algorithms respective performance under almost perfect conditions. We then study real languages, namely Latin, Italian and Spanish, to see if those performances generalise well. We show that both model types manage to learn sound changes despite data scarcity, although the best performing model type depends on several parameters such as the size of the training data, the ambiguity, and the prediction direction.
Simon Gabay, Lucie Rondeau Du Noyer and Mohamed Khemakhem. 2020. Selling autograph manuscripts in 19th c. Paris: digitising the Revue des Autographes. In IX Convegno AIUCD. Milan, Italy.

In Paris, the manuscript market appears in the early 20's of the 19th c. Fixed-price catalogues and auction catalogues are regularly published, describing each document in detail. Such descriptions being highly formalised, it is possible to extract and structure them (almost) automatically, and thus create a database of sold manuscripts in 19th c. Paris.

Communications

Gabriela Elgarrista, Frédérique Mélanie-Becquet, Carmen Brando, Mohamed Khemakhem, Laurent Romary and Jean-Luc Pinol. 2020. Pipeline to process and analyze Paris's old property address directories (XIXe -XXe). In CLARIN Annual Conference. Paris (en ligne), France.

Mohamed Khemakhem, Simon Gabay, Béatrice Joyeux-Prunel, Laurent Romary, Léa Saint-Raymond and Lucie Rondeau Du Noyer. 2020. Information Extraction Workflow for Digitised Entry-based Documents. In DARIAH Annual event 2020. Zagreb / Virtual, Croatia.

Book chapters

Benoît Sagot. 2020. A new PIE root *\textith\textsubscript1er ‘(to be/become) dark red'. In Loanwords and Substrata. 164

Romain Garnier and Benoît Sagot. 2020. New results on a centrum substratum in Greek: the Lydian connection. In Loanwords and Substrata. 164

Jennifer Edmond, Frank Fischer, Laurent Romary and Toma Tasovac. 2020. 9. Springing the Floor for a Different Kind of Dance. In Digital Technology and the Practices of Humanities Research. pages 207–234. Open Book Publishers.

Jennifer Edmond and Laurent Romary. 2020. 3. Academic Publishing. In Digital Technology and the Practices of Humanities Research. pages 49–80. Open Book Publishers.

Tech reports

Floriane Chiffoleau. 2020. Rapport d'avancement sur le projet DAHN (avec le soutien du MESRI). Technical report.

Other

Laurent Romary. 2020. Eléments de sciences ouvertes.

Lucas Terriel. 2020. Le saviez-vous ? Les répertoires de notaires ne sont pas seulement des images numérisées !

This post provides an overview of the data associated with the documents of the project coordinated by INRIA (team project ALMAnaCH) and the National Archives LectAuRep - Automatic reading of directories - which consists in applying the handwritten text recognition techniques on notaries directories. This post is part of a larger reflection on the creation of a TEI pivot format to centralize metadata associated with documents and those generated during image processing with the eScriptorium transcription platform.
Jean-Damien Généro. 2020. Le corpus des Ouvriers des deux mondes : des images et des URLs.

Si les documents d’archives ont une part prépondérante dans le projet Time us, ils ne représentent pas pour autant l’intégralité de sa documentation. Les imprimés sont également présents, sous la forme de trois importants dossiers : la collection de la presse ancienne lyonnaise, divers imprimés portant sur le textile en France au XIXe siècle, et le corpus des Ouvriers des deux mondes. Les Ouvriers des deux mondes sont des enquêtes sociologiques réparties en 3 séries et 126 monographies. Initiée par le sociologue Frédéric Le Play (1806-1882), la publication est assurée par la Société internationale des études pratiques d’économie sociale de 1857 à 1928 et représente un total de 13 volumes. Ceux-ci sont aujourd’hui intégralement consultables sur le site Internet Archive. Nous allons nous intéresser dans ce billet aux fichiers de transcription de ces volumes et au lien entre ceux-ci et les images numérisées d’origine. Le script lse od2m, écrit par Alix Chagué, avait automatiquement segmenté et transcrit les images, puis encodé et structuré en xml-tei les textes bruts ainsi obtenus; la sortie avait résulté en 13 fichiers xml. Ces fichiers « sources » avaient ensuite été scindés en 222 fichiers xml correspondant à autant de divisions logiques des volumes : les monographies bien sûr, mais également les introductions, tables des matières et autres éléments de paratexte. Des opérations de vérification ont permis de réduire le nombre de fichiers à 192.
Alix Chagué, Lucas Terriel and Laurent Romary. 2020. Des images au texte : LECTAUREP, un projet de reconnaissance automatique d'écriture.

Laurent Romary. 2020. Les données de la recherche.

Dans le cadre de l'Open Access Week, présentation de l’actualité dans le domaine de la gestion des données, notamment dans le cadre du plan science ouverte du ministère.
Laurent Romary. 2020. Multilingual content management and standards with a view on AI developments.

Laurent Romary. 2020. An editorial and technical journey into Post Publication Peer Review (PPPR).

Laurent Romary. 2020. TEI guidelines: born to be open.

Open science has never been so high on the research agendas, and this is true in all fields, ranging from so-called hard sciences to the humanities. In this respect, those who have been dealing with the TEI guidelines for years, whether as users or designers of the standard, have experienced an environment which has always been open by construction and fostering openness for projects based upon its principles.We outline the main issues related to open science in the current scholarly landscape, whether political or technical, and show the various aspects where the TEI environment has been seminal in setting up an open agenda that may enlightened the humanities at large in terms of good practices, for, e.g., managing, documenting or disseminating scholarly sources and methods.

Preprints

Benjamin Muller, Antonis Anastasopoulos, Benoît Sagot and Djamé Seddah. 2020. When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models. Preprint.

Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. Transliterating those languages improves very significantly the ability of large-scale multilingual language models on downstream tasks.
Erzsébet Tóth-Czifra and Laurent Romary. 2020. The Heritage Data Reuse Charter: from principles to research workflows. Preprint.

There is a growing need to establish domain-or discipline-specific approaches to research data sharing workflows. A defining feature of data and data workflows in the arts and humanities domain is their dependence on cultural heritage sources hosted and curated in museums, libraries, galleries and archives. A major difficulty when scholars interact with heritage data is that the nature of the cooperation between researchers and Cultural Heritage Institutions (henceforth CHIs) is often constrained by structural and legal challenges but even more by uncertainties as to the expectations of both parties. The Heritage Data Reuse Charter aims to address these by designing a common environment that will enable all the relevant actors to work together to connect and improve access to heritage data and make transactions related to the scholarly use of cultural heritage data more visible and transparent. As a first step, a wide range of stakeholders on the Cultural Heritage and research sector agreed upon a set of generic principles, summarized in the Mission Statement of the Charter, that can serve as a baseline governing the interactions between CHIs, researchers and data centres. This was followed by a long and thorough validation process related to these principles through surveys 1 and workshops 2. As a second step, we now put forward a questionnaire template tool that helps researchers and CHIs to translate the 6 core principles into specific research project settings. It contains questions about access to data, provenance information, preferred citation standards, hosting responsibilities etc. on the basis of which the parties can arrive at mutual reuse agreements that could serve as a starting point for a FAIR-by-construction data management, right from the project planning/application phase. The questionnaire template and the resulting mutual agreements can be flexibly applied to projects of different scale and in platform-independent ways. Institutions can embed them into their own exchange protocols while researchers can add them to their Data Management Plans. As such, they can show evidence for responsible and fair conduct of cultural heritage data, and fair (but also FAIR) research data management practices that are based on partnership with the holding institution.

2019

Journal articles

Romain Garnier and Benoît Sagot. 2019. Metathesis of Proto-Indo-European Sonorants. Münchener Studien zur Sprachwissenschaft 73 pages 29–53. Verlag J.H. Röll GmbH.

Detlef Reineke and Laurent Romary. 2019. Bridging the gap between SKOS and TBX. edition - Die Fachzeitschrift für Terminologie 19 Deutscher Terminologie-Tag e.V. (DTT).

This article provides an in-depth comparison and proposal for mapping between Simple KnowledgeOrganization System (SKOS) and TermBase eXchange (TBX), two important exchangestandards within the knowledge and terminology landscape. The attempt to develop an interfaceor conversion routine between SKOS and TBX is rooted in a strong demand in the language andknowledge industries for resource leverage and based on the premise that the two formalisms aregoverned by similar data models, namely the description of concepts (rather than words).
Laurent Romary and Charles Riondet. 2019. Towards multiscale archival digital data. Umanistica digitale AIUCD - Associazione per l'Informatica Umanistica e la Cultura Digitale.

In this paper, we would like to present some ideas on the use of the archival standards in various contexts that exemplify the complexity of such standards and provide users with innovative ways to handle EAD content. Our main idea is that researchers, Cultural heritage institutions, archival portals and standards maintenance bodies could greatly benefit from a multiscale modelling of archival data, but also from multiscale representations and documentations. A first step is on the way to being cleared in the domain of the management of heterogeneous archival sources in one single environment, namely a federated portal, like in EHRI. We built a methodology based on a specification and customisation method inspired from the long lasting experience of the Text Encoding Initiative (TEI) community. In the TEI framework, one has the possibility of defining project-specific subsets or extensions of the TEI guidelines while maintaining both the technical (XML schemas) and editorial (documentation) specification within a single framework. Using the same framework for EAD data allows us to express precise content-oriented rules combined with some interesting possibilities of integrating the human readable documentation in the validation process.

Conference proceedings

Kim Gerdes, Bruno Guillaume, Sylvain Kahane and Guy Perrier. 2019. Pourquoi se tourner vers le SUD : L'importance de choisir un schéma d'annotation en dépendance surface-syntaxique. In LIFT 2019 - Journées scientifiques »;Linguistique informatique, formelle & de terrain»; Orléans, France.

Why you should turn SUD-The importance of choosing a Surface-Syntactic dependency annotation scheme. The article attempts to promote the Surface-Syntactic Universal Dependencies (SUD) annotation scheme to syntactic annotation projects, as an alternative to the standard Universal Dependencies (UD) scheme, particularly on oral or non-standard texts, conducted for comparative and typological studies.
Yoann Dupont. 2019. Un corpus libre, évolutif et versionné en entités nommées du français. In TALN 2019 - Traitement Automatique des Langues Naturelles. Toulouse, France.

A free, evolving and versioned french named entity recognition corpus. Annotated corpora are very hard resources to make because of the high human cost they imply. Once released, they are hardly modifiable and tend to not evolve through time. In this article we present a free and evolving corpus annotated in named entity recognition based on French Wikinews articles from 2016 to 2018, for a total of 1191 articles. We will briefly describe the annotation guidelines before comparing our corpus to various corpora of comparable nature. We will also give an intra-annotator-agreement to provide an estimation of the stability of the annotation as well as the overall process to develop the corpus.
Laurent Romary. 2019. The place of lexicography in (computer) science. In The Future of Academic Lexicography: Linguistic Knowledge Codification in the Era of Big Data and AI. Leiden, Netherlands.

Luca Foppiano, Laurent Romary, Masashi Ishii and Mikiko Tanifuji. 2019. Automatic Identification and Normalisation of Physical Measurements in Scientific Literature. In DocEng '19 - ACM Symposium on Document Engineering 2019. pages 1–4. ACM Press. Berlin, Germany.

We present Grobid-quantities, an open-source application for extracting and normalising measurements from scientific and patent literature. Tools of this kind, aiming to understand and make unstructured information accessible, represent the building blocks for large-scale Text and Data Mining (TDM) systems. Grobid-quantities is a module built on top of Grobid [6] [13], a machine learning framework for parsing and structuring PDF documents. Designed to process large quantities of data, it provides a robust implementation accessible in batch mode or via a REST API. The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm [12] for extracting quantities (atomic values, intervals and lists), units (such as length, weight) and different value representations (numeric, alphabetic or scientific notation). Identified measurementsare normalised according to the International System of Units (SI). Thanks to its stable recall and reliable precision, Grobid-quantities has been integrated as the measurement-extraction engine in various TDM projects, such as Marve (Measurement Context Extraction from Text), for extracting semantic measurements and meaning in Earth Science [10]. At the National Institute for Materials Science in Japan (NIMS), it is used in an ongoing project to discover new superconducting materials. Normalised materials characteristics (such as critical temperature, pressure) extracted from scientific literature are a key resource for materials informatics (MI) [9].
Benjamin Muller, Benoit Sagot and Djamé Seddah. 2019. Enhancing BERT for Lexical Normalization. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019). pages 297–306. Association for Computational Linguistics. Hong Kong, China.

Language model-based pre-trained representations have become ubiquitous in natural language processing. They have been shown to significantly improve the performance of neu-ral models on a great variety of tasks. However , it remains unclear how useful those general models can be in handling non-canonical text. In this article, focusing on User Generated Content (UGC) in a resource-scarce scenario , we study the ability of BERT (Devlin et al., 2018) to perform lexical normalisation. Our contribution is simple: by framing lexical normalisation as a token prediction task, by enhancing its architecture and by carefully fine-tuning it, we show that BERT can be a competitive lexical normalisation model without the need of any UGC resources aside from 3,000 training sentences. To the best of our knowledge , it is the first work done in adapting and analysing the ability of this model to handle noisy UGC data.
Hervé Bohbot, Francesca Frontini, Fahad Khan, Mohamed Khemakhem and Laurent Romary. 2019. Nénufar: Modelling a Diachronic Collection of Dictionary Editions as a Computational Lexical Resource. In ELEX 2019: smart lexicography. Sintra, Portugal.

The Petit Larousse Illustré (PLI) is a monolingual French dictionary which has been published every year since the 1906 edition and which is therefore a fundamental testimony of the evolution of the French language. As a consequence of the pre-1948 editions of the PLI entering the public domain in 2018 the Nénufar (“Nouvelle édition numérique de fac-similés de reference”) project was launched at the Praxiling laboratory in Montpellier with the aim of digitising and make these editions available electronically. The project is still ongoing; various selected editions per decade are going to be fully digitised (so far the 1906, 1924 and 1925 editions have been completed), and changes backtracked and dated per specific year.
Lucie Rondeau Du Noyer, Simon Gabay, Mohamed Khemakhem and Laurent Romary. 2019. Scaling up Automatic Structuring of Manuscript Sales Catalogues. In TEI 2019: What is text, really? TEI and beyond. Graz, Austria.

Manuscript Sales Catalogues (MSC) are highly important for authenticating documents and studying the reception of authors. Their regular publication throughout Europe since the beginning of the 19th c. has consequently raised the interest around scaling up the means for automatically structuring their contents. Following successful first encoding tests with GROBID-Dictionaries [1,2] on a single MSC collection [3], we aim in this paper to present the results of more advanced tests of the system’s capacity to handle a larger corpus with MSC ofdifferent dealers, and therefore multiple layouts.
Fernando Alva-Manchego, Louis Martin, Carolina Scarton and Lucia Specia. 2019. EASSE: Easier Automatic Sentence Simplification Evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations. pages 49–54. Association for Computational Linguistics. Hong Kong, China.

We introduce EASSE, a Python package aiming to facilitate and standardise automatic evaluation and comparison of Sentence Simplification (SS) systems. EASSE provides a single access point to a broad range of evaluation resources: standard automatic metrics for assessing SS outputs (e.g. SARI), word-level accuracy scores for certain simplification transformations, reference-independent quality estimation features (e.g. compression ratio), and standard test data for SS evaluation (e.g. TurkCorpus). Finally, EASSE generates easy-to-visualise reports on the various met-rics and features above and on how a particular SS output fares against reference simplifications. Through experiments, we show that these functionalities allow for better comparison and understanding of the performance of SS systems.
Mathilde Regnault, Sophie Prévost and Eric Villemonte de la Clergerie. 2019. Challenges of language change and variation: towards an extended treebank of Medieval French. In Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019). pages 144–150. Association for Computational Linguistics. Paris, France.

In order to automatically extend a treebank of Old French (9 th-13 th c.) with new texts in Old and Middle French (14 th-15 th c.), we need to adapt tools for syntactic annotation. However, these stages of French are subjected to great variation, and parsing historical texts remains an issue. We chose to adapt a symbolic system, the French Metagrammar (FRMG), and develop a lexicon comparable to the Lefff lexicon for Old and Middle French. The final goal of our project is to model the evolution of language through the whole period of Medieval French (9 th-15 th c.).
Benoit Crabbé, Murielle Fabre and Christophe Pallier. 2019. Variable beam search for generative neural parsing and its relevance for the analysis of neuro-imaging signal. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pages 1150–1160. Association for Computational Linguistics. Hong Kong, China.

This paper describes a method of variable beam size inference for Recurrent Neural Network Grammar (rnng) by drawing inspiration from sequential Monte-Carlo methods such as particle filtering. The paper studies the relevance of such methods for speeding up the computations of direct generative parsing for rnng. But it also studies the potential cognitive interpretation of the underlying representations built by the search method (beam activity) through analysis of neuro-imaging signal.
Yixuan Li, Gerdes Kim and Dong Chuanming. 2019. Character-level Annotation for Chinese Surface-Syntactic Universal Dependencies. In Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019). pages 216–226. Association for Computational Linguistics. Paris, France.

This paper presents a new schema to annotate Chinese Treebanks on the character level. The original Universal Dependencies (UD) and Surface-Syntactic Universal Dependencies (SUD) projects provide token-level resources with rich morphosyntactic language details. However, without any commonly accepted word definition for Chinese, the dependency parsing always faces the dilemma of word segmentation. Therefore we present a character-level annotation schema integrated into the existing Universal Dependencies schema as an extension.
Kim Gerdes, Sylvain Kahane and Xinying Chen. 2019. Rediscovering Greenberg's Word Order Universals in UD. In Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019). pages 124–131. Association for Computational Linguistics. Paris, France.

This paper discusses an empirical refoundation of selected Greenbergian word order univer-sals based on a data analysis of the Universal Dependencies project. The nature of the data we work on allows us to extract rich details for testing well-known typological universals and constitutes therefore a valuable basis for validating Greenberg's universals. Our results show that we can refine some Greenbergian universals in a more empirical and accurate way by means of a data-driven typological analysis.
Jack Bowers, Mohamed Khemakhem and Laurent Romary. 2019. TEI Encoding of a Classical Mixtec Dictionary Using GROBID- Dictionaries. In ELEX 2019: Smart Lexicography. Sintra, Portugal.

This paper presents the application of GROBID-Dictionaries (Khemakhem et al. 2017, Khemakhem et al. 2018a, Khemakhem et al. 2018b, Khemakhem et al. 2018c), an open source machine learning system for automatically structuring print dictionaries in digital format into TEI (Text Encoding Initiative) to a historical lexical resource of Colonial Mixtec 'Voces del Dzaha Dzahui' published by the Dominican fray Francisco Alvarado in the year 1593. The GROBID-Dictionaries application was applied to a reorganized and modernized version of the historical resource published by Jansen and Perez Jiménez (2009). The TEI dictionary produced will be integrated into a language documentation project dealing with Mixtepec-Mixtec (ISO 639-3: mix) (Bowers & Romary, 2017, 2018a, 2018b) an under-resourced indigenous language native to the Juxtlahuaca district of Oaxaca Mexico.
Ganesh Jawahar and Djamé Seddah. 2019. Contextualized Diachronic Word Representations. In Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change. pages 35–47. Association for Computational Linguistics. Florence, Italy.

Diachronic word embeddings play a key role in capturing interesting patterns about how language evolves over time. Most of the existing work focuses on studying corpora spanning across several decades, which is understandably still not a possibility when working on social media-based user-generated content. In this work, we address the problem of studying semantic changes in a large Twitter corpus collected over five years, a much shorter period than what is usually the norm in di-achronic studies. We devise a novel attentional model, based on Bernoulli word embeddings, that are conditioned on contextual extra-linguistic (social) features such as network, spatial and socioeconomic variables, which are associated with Twitter users, as well as topic-based features. We posit that these social features provide an inductive bias that helps our model to overcome the narrow time-span regime problem. Our extensive experiments reveal that our proposed model is able to capture subtle semantic shifts without being biased towards frequency cues and also works well when certain con-textual features are absent. Our model fits the data better than current state-of-the-art dynamic word embedding models and therefore is a promising tool to study diachronic semantic changes over small time periods.
Marco Dinarelli and Loïc Grobol. 2019. Modèles neuronaux hybrides pour la modélisation de séquences~: le meilleur de trois mondes. In Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume I : Articles longs. pages 127–142. ATALA. Toulouse, France.

We propose a neural architecture with the main characteristics of the most successful neural models of the last years : bidirectional RNNs, encoder-decoder, and the Transformer model. Evaluation on three sequence labelling tasks yields results that are close to the state-of-the-art for all tasks and better than it for some of them, showing the pertinence of this hybrid architecture for this kind of tasks.
Loïc Grobol. 2019. Neural Coreference Resolution with Limited Lexical Context and Explicit Mention Detection for Oral French. In Proceedings of the Second Workshop on Computational Models of Reference, Anaphora and Coreference. pages 8–14. Association for Computational Linguistics. Minneapolis, USA.

We propose an end-to-end coreference resolution system obtained by adapting neural models that have recently improved the state-of-the-art on the OntoNotes benchmark to make them applicable to other paradigms for this task. We report the performances of our system on ANCOR, a corpus of transcribed oral French-for which it constitutes a new baseline with proper evaluation.
Benoît Sagot. 2019. Développement d'un lexique morphologique et syntaxique de l'ancien français (Development of a morphological and syntactic lexicon of Old French). In Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts. pages 265–274. ATALA. Toulouse, France.

In this paper we describe our work on the development of a large-scale morphological and syntactic lexicon of Old French for natural language processing. We rely on dictionary and lexical resources, from which the extraction of structured and exploitable information required specific developments. In addition, matching information from these different sources posed difficulties. We provide quantitative information on the resulting lexicon, and discuss its reliability in its current version and the prospects for improvement allowed by the existence of a first version, in particular through the automatic analysis of textual data.
Pedro Javier Ortiz Suárez, Benoît Sagot and Laurent Romary. 2019. Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache. Cardiff, United Kingdom.

Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.
Mathilde Regnault. 2019. Adaptation d'une métagrammaire du français contemporain au français médiéval (Adapting an existing metagrammar for Contemporary French to Medieval French). In Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume III : RECITAL. pages 459–472. ATALA. Toulouse, France.

Adapting an existing metagrammar for Contemporary French to Medieval French Medieval French is characterized by strong language variation. Our purpose is to extend a corpus of Old French annotated with dependency syntax with new texts of this period and add texts of Middle French. In order to achieve this, we want to adapt existing tools instead of training a parser with annotated data. In this article, we present a state of the art for this project and our solution : adapting the French Metagrammar (FRMG) to former states of language.
Ganesh Jawahar, Benoît Sagot and Djamé Seddah. 2019. What Does BERT Learn about the Structure of Language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pages 3651–3657. Association for Computational Linguistics. Florence, Italy.

BERT is a recent language representation model that has surprisingly performed well in diverse language understanding benchmarks. This result indicates the possibility that BERT networks capture structural information about language. In this work, we provide novel support for this claim by performing a series of experiments to unpack the elements of English language structure learned by BERT. We first show that BERT's phrasal representation captures phrase-level information in the lower layers. We also show that BERT's intermediate layers encode a rich hierarchy of linguistic information, with surface features at the bottom, syntactic features in the middle and semantic features at the top. BERT turns out to require deeper layers when long-distance dependency information is required, e.g.~to track subject-verb agreement. Finally, we show that BERT representations capture linguistic information in a compositional way that mimics classical, tree-like structures.
Laurent Romary, Mohamed Khemakhem, Fahad Khan, Jack Bowers, Nicoletta Calzolari, Monte George, Mandy Pet and Piotr Bański. 2019. LMF Reloaded. In AsiaLex 2019: Past, Present and Future. Istanbul, Turkey.

Lexical Markup Framework (LMF) or ISO 24613 [1] is a de jure standard that provides a framework for modelling and encoding lexical information in retrodigitised print dictionaries and NLP lexical databases. An in-depth review is currently underway within the standardisation subcommittee , ISO-TC37/SC4/WG4, to find a more modular, flexible and durable follow up to the original LMF standard published in 2008. In this paper we will present some of the major improvements which have so far been implemented in the new version of LMF.
Anas Fahad Khan, Hervé Bohbot, Francesca Frontini, Mohamed Khemakhem and Laurent Romary. 2019. Historical Dictionaries as Digital Editions and Connected Graphs: the Example of Le Petit Larousse Illustré. In Digital Humanities 2019. Utrech, Netherlands.

Marco Dinarelli and Loïc Grobol. 2019. Seq2Biseq: Bidirectional Output-wise Recurrent Neural Networks for Sequence Modelling. In CICLing 2019 - 20th International Conference on Computational Linguistics and Intelligent Text Processing. La Rochelle, France.

During the last couple of years, Recurrent Neural Networks (RNN) have reached state-of-the-art performances on most of the sequence modelling problems. In particular, the sequence to sequence model and the neural CRF have proved to be very effective in this domain. In this article, we propose a new RNN architecture for sequence labelling, leveraging gated recurrent layers to take arbitrarily long contexts into account, and using two decoders operating forward and backward. We compare several variants of the proposed solution and their performances to the state-of-the-art. Most of our results are better than the state-of-the-art or very close to it and thanks to the use of recent technologies, our architecture can scale on corpora larger than those used in this work.
Jack Bowers and Laurent Romary. 2019. TEI and the Mixtepec-Mixtec corpus: data integration, annotation and normalization of heterogeneous data for an under-resourced language. In 6th International Conference on Language Documentation and Conservation (ICLDC). Honolulu, United States.

Communications

Alix Chagué, Victoria Le Fourner, Manuela Martini and Éric Villemonte de La Clergerie. 2019. Deux siècles de sources disparates sur l'industrie textile en France : comment automatiser les traitements d'un corpus non-uniforme ? In Colloque DHNord 2019 »;Corpus et archives numériques»; Lille, France.

Murielle Fabre, Yoann Dupont and Éric Villemonte de La Clergerie. 2019. Syntactic Parsing versus MWEs: What can fMRI signal tell us. In PARSEME-FR 2019 consortium meeting. Blois, France.

Xinying Chen and Kim Gerdes. 2019. The relation between dependency distance and frequency. In Quasy 2019, Quantitative Syntax 2019, Syntaxfest. Paris, France.

This present pilot study investigates the relationship between dependency distance and frequency based on the analysis of an English dependency treebank. The preliminary result shows that there is a non-linear relation between dependency distance and frequency. This relation between them can be further formalized as a power law function which can be used to predict the distribution of dependency distance in a treebank.
José Carlos Rosales Núñez, Djamé Seddah and Guillaume Wisniewski. 2019. Comparison between NMT and PBSMT Performance for Translating Noisy User-Generated Content. In Proceedings of the 22nd Nordic Conference on Computational Linguistics. pages 2–14. Linköping University Electronic Press. Turku, Finland.

This work compares the performances achieved by Phrase-Based Statistical Ma- chine Translation systems (PBSMT) and attention-based Neural Machine Transla- tion systems (NMT) when translating User Generated Content (UGC), as encountered in social medias, from French to English. We show that, contrary to what could be ex- pected, PBSMT outperforms NMT when translating non-canonical inputs. Our error analysis uncovers the specificities of UGC that are problematic for sequential NMT architectures and suggests new avenue for improving NMT models.
Géraldine Walther and Benoît Sagot. 2019. Morphological complexities. In Proceedings of the 16th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. Florence, Italy.

Mohamed Khemakhem, Ioana Galleron, Geoffrey Williams, Laurent Romary and Pedro Javier Ortiz Suárez. 2019. How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures. In 19th annual Conference and Members' Meeting of the Text Encoding Initiative Consortium (TEI) -What is text, really? TEI and beyond. Graz, Austria.

Pedro Javier Ortiz Suárez, Laurent Romary and Benoît Sagot. 2019. Preparing the Dictionnaire Universel for Automatic Enrichment. In 10th International Conference on Historical Lexicography and Lexicology (ICHLL). Leeuwarden, Netherlands.

The Dictionnaire Universel (DU) is an encyclopaedic dictionary originally written by Antoine Furetière around 1676-78, later revised and improved by the Protestant jurist Henri Basnage de Beauval who expanded, corrected and included terms of arts, crafts and sciences, into the Dictionnaire.The aim of the BASNUM project is to digitize the DU in its second edition rewritten by Basnage de Beauval, to analyse it with computational methods in order to better assess the importance of this work for the evolution of sciences and mentalities in the 18th century, and to contribute to the contemporary movement for creating innovative and data-driven computational methods for text digitization, encoding and analysis.Based on the experience acquired within the research group, an enrichment workflow based upon a series of Natural Language Processing processes is being set up to be applied to Basnage's work. This includes, among others, automatic identification of the dictionary structure (macro-, meso- and microstructure), named-entity recognition (in particular persons and locations), classification of dictionary entries, detection and study of polysemy markers, tracking and classification of quotation use (bibliographic references), scoring semantic similarity between the DU and other dictionaries. The main challenges being the lack of available annotated data in order to train machine learning models, decreased accuracy when using modern pre-trained models due to the differences between present-day and 18th century French, and even unreliable or low quality OCRisation. The paper describes methods that are useful to tackle these issues in order to prepare the the DU for automatic enrichment going beyond what current available tools like Grobid-dictionaries can do, thanks to the advent of deep learning NLP models. The paper also describes how these methods could be applied to other dictionaries or even other types of ancient texts.
Sheena Bassett, Leon Wessels, Steven Krauwer, Bente Maegaard, Hella Hollander, Femmy Admiraal, Laurent Romary and Frank Uiterwaal. 2019. Connecting the Humanities through Research Infrastructures. In 4th Digital Humanities in the Nordic Countries (DHN 2019). Copenhagen, Denmark.

Several Research Infrastructures(RIs)exist in the Humanities and Social Sciences, some –such as CLARIN, DARIAH and CESSDA –which address specific areas of interest, i.e. linguistic studies, digital humanities and social science data archives. RIs are also unique in their scope and application, largely tailored to their specific community needs. However, commonalities do exist and it is recognised that benefits are to be gained from these such as efficient use of resources, enabling multi-disciplinary research and sharing good practices. As such,a bridging project PARTHENOS has worked closely with CLARIN and DARIAH as well as ARIADNE (archaeology), CENDARI (history), EHRI (holocaust studies) and E-RIHS (heritage science) to iden-tify, develop and promote these commonalities. In this paper, we present some specif-ic examples of cross-discipline and trans-border applications arising from joint RI collaboration, allowing for entirely new avenues of research

Books

Kim Gerdes and Sylvain Kahane. 2019. Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019).

Book chapters

Kim Gerdes, Sylvain Kahane, Rachel Bawden, Julie Beliao, Éric Villemonte de La Clergerie and Ilaine Wang. 2019. Annotation tools for syntax. In Rhapsodie: A Prosodic and Syntactic Treebank for Spoken French. John Benjamins.

This chapter is devoted to the presentation of the tools and methods used for the different steps of the semi-automatic syntactic annotation: automatic preprocessing; microsyntactic parsing with the FRMG tool, correction of the parsing with the Arborator tool, agreement analysis, post-validation correction, and development of the final format of the Rhapsodie syntactic treebank. As FRMG is a parser for written French that was not configured to analyze disfluencies and reformulation, we used our manual pile marking to unfold the piles and produce a series of simplified “sentences” with only government relations. Despite having two annotators plus a validator for the corrections, we found a substantial number of errors in the post-validation procedure by using a set of rules to determine the well-formedness of the trees.
Sylvain Kahane, Paola Pietrandrea and Kim Gerdes. 2019. The annotation of list structures. In Rhapsodie: A Prosodic and Syntactic Treebank for Spoken French. John Benjamins.

This chapter presents phenomena we call “piles” or “lists”, which are characterized by the fact that a list of elements piles up in the same syntactic position. We therefore group the analysis of coordination together with the analysis of other phenomena such as reformulation, disfluency, partial answer, or negotiation. The elements of a pile are linked to one another by a relation that is both syntagmatic (they follow one another) and paradigmatic (they fill the same syntactic slot with respect to their common governor). The syntactic analysis of the other elements – junctors, paradigmatic adverbs, and list completers – is discussed. We also propose a typology of the different cases of pile structure and introduce the seven subcases of paradigmatic links taken into account in the annotation.
Sylvain Kahane, Kim Gerdes and Rachel Bawden. 2019. The microsyntactic annotation. In Rhapsodie~: A Prosodic and Syntactic Treebank for Spoken French. John Benjamins.

This chapter describes the microsyntactic analysis of the Rhapsodie corpus in terms of dependency syntax. Microsyntax studies the relations between words that are characterized by a strong syntactic cohesion, traditionally called “government”. The different steps in the annotation are presented: segmentation into words and labeling in parts of speech, dependency structure, and basic syntactic functions. We justify our decision to use a small set of tags without redundancies, but to introduce a predicative relation for elements that form a complex predicate including copula, auxiliaries, and some modal verbs. Complex cases of annotations such as extraction (relative and interrogative clauses, cleft sentences) and negation are also presented. In addition, we show how a constituent structure can be computed from the dependency structure.
Laurent Romary and Jennifer Edmond. 2019. A Tangential View on Impact for the Arts and Humanities through the Lens of the DARIAH-ERIC. In Stay Tuned To The Future - Impact of the Research Infrastructures for Social Sciences and Humanities. Leo S. Olschki Editore.

The reflections in this chapter stem from the perspective of the DARIAH-ERIC,a distributed infrastructure for the arts and humanities. They explore how impactcan take a variety of forms not always considered when the term is applied in astrictly technocratic sense, and the idea that focussing on the user of a research infrastructuremay not describe an optimal relationship from an impact perspective.The chapter concludes by presenting three frames of reference in which an infrastructurelike DARIAH can have impact: to foster excellence through impact on researchers,promote fluidity through impact on policymakers, and support efficiencythrough impact on our partner organisations.

Tech reports

Anas Alaoui M'Darhri, Vincent Baillet, Bastien Bourineau, Alessio Calantropio, Gabriella Carpentiero, Medhi Chayani, Livio De Luca, Iwona Dudek, Bruno Dutailly, Hélène Gautier, Eleonora Grilli, Valentin Grimaud, Christoph Hoffmann, Adeline Joffres, Nenad Jončić, Michel Jordan, Justin J.L. Kimball, Adeline Manuel, Patrick Mcinerney, Imanol Muñoz Pandiella, Ariane Néroulidis, Erica Nocerino, Anthony Pamart, Costas Papadopoulos, Marco Potenziani, Emilie Saubestre, Roberto Scopigno, Dorian Seillier, Sarah Tournon-Valiente, Martina Trognitz, Jean-Marc Vallet and Chiara Zuanni. 2019. Share - Publish - Store - Preserve. Methodologies, Tools and Challenges for 3D Use in Social Sciences and Humanities. Technical report.

Through this White Paper, which gathers contributions from experts of 3D data as well as professionals concerned with the interoperability and sustainability of 3D research data, the PARTHENOS project aims at highlighting some of the current issues they have to face, with possible specific points according to the discipline, and potential practices and methodologies to deal with these issues.During the workshop, several tools to deal with these issues have been introduced and confronted with the participants experiences, this White Paper now intends to go further by also integrating participants feedbacks and suggestions of potential improvements.Therefore, even if the focus is put on specific tools, the main goal is to contribute to the development of standardized good practices related to the sharing, publication, storage and long-term preservation of 3D data.

Other

Romain Garnier, Philippe Hattat and Benoît Sagot. 2019. What is Old is New again: PIE Secondary Roots with Fossilised Preverbs.

Prefixal productivity is attested in all Indo-European languages and is reconstructed in the proto-languages of all major Indo-European languages families. It is particularly important for under-standing the origin of many non-primary verbal roots in all these languages. Surprisingly, only a handful of etymons involving prefixation have been reconstructed at the PIE level. A systematic study of this word formation process in PIE remains to be carried out. It could result in a better understanding of the origin of a number of PIE roots, especially complex roots with limited attestation, and help explain attested words in daughter languages that need a convincing etymology. In our talk, we will show how a better understanding of the role of prefixes in (secondary) verbal root formation can result in new etymological insights. Such analyses have already been proposed for several examples. For instance, with the compensatory lenghtening *Ce=HC- &gt; *CV̄C-, Weiss (1993) analyses Lat. pālārī ‘to wander’ as reflecting *pe=h2lh2-ó- &gt; *pālH-āye/o-. Another classical example of the same prefix is Arm. p‘law &lt; *p‘ulaw &lt; *pōlH-to based on *pe=h3lh1-, as also P.- Germ. *fall-an- ‘to fall’ &lt; *pŏlle/o- (with Osthoff’s shorthening) &lt; *pōlle/o- &lt; *pōlH-é/ó- (Praust 2005; Neri 2007; Kroonen 2013: 125–6; Dunkel 2014 II: 82). Other examples include PIE *pro=h1ed- ‘to devour’ &gt; P.-Germ. *fr(a)-et-an-, (Scheungraber 2016: §4), PIE *kom=pro=h1ṇḱ- ‘to bring’ &gt; P.-Germ. *breng-an- (Kroonen 2013: 77), PIE *°kom=h1ep- ‘to give’ &gt; P.-Germ. *geb-an- (Kortlandt 1992) and *pe=h2ṛk- &gt; Lat. parcō (Weiss 1993; cf. Hitt. pē=ḫark- ‘to hold off’). We intend to show that this word formation process is by far more widespread than usually thought. We shall discuss several examples, thus unifying sets of semantically similar roots and proposing novel etymologies for several difficult words. These views may shed some light on the possibility that PIE may have been (once) a satellite-framed language.
Laurent Romary. 2019. Traçabilité des données d'expérience pour les matériaux anciens et patrimoniaux.

Murielle Fabre, Benoit Crabbe and Christophe Pallier. 2019. Variable beam search for generative neural parsing and its fit with neuro-imaging signal.

Murielle Fabre, Shohini Bhattasali, Christophe Pallier and John Hale. 2019. Modeling Conventionalization and Predictability in Multi-Word Expressions at Brain-level.

Linguistic expressions have been binarized as compositional and non-compositional given the lack of composionallinguistic analysis, Multi-word Expressions (MWEs) demonstrate finer-grained degrees of conventionalization and predictability in psycholinguisitcs, which canbe quantified through computational Association Measures, like Point-wise Mutual Information and Dice's Coefficient.In this study, fMRI recordings of naturalistic narrative comprehension is used to investigate to what extent these computational measures and the underlying cognitive processes they could reflect are observable during on-line naturalistic sentence processing.
Abhishek Srivastava, Benjamin Muller and Djamé Seddah. 2019. Unsupervised Learning for Handling Code-Mixed Data: A Case Study on POS Tagging of North-African Arabizi Dialect.

Language model pretrained representation are now ubiquitous in Natural Language Processing. In this work, we present some first results in adapting those models to Out-of-Domain textual data. Using Part-of-Speech tagging as our case study, we analyze the ability of BERT to model a complex North-African Dialect (Arabizi). What is Arabizi ? BERT and Arabizi We do our experiments on the released base multilingual version of BERT (Delvin et al. 2018) which was trained on a concatenation of Wikipedia of 104 languages. BERT has never seen any Arabizi. It is visible that Arabizi is related to French in BERT's embedding space Summary • Multilingual-BERT can be used to build a decent Part-of-Speech Tagger with a reasonable amount of annotated data • Unsupervised adaptation improves (+1) performance in downstream POS tagging Research questions • Is BERT able to model Out-of-Domain languages such as Arabizi ? • Can we adapt BERT in an unsupervised way to Arabizi ? Definitions • Dialectal Arabic is a variation of Classic Arabic that varies from one region to another that is spoken orally only. Darija is the one spoken in Maghreb (Algeria, Tunisia, Morocco). • Arabizi is the name given to the transliterated language of dialectal Arabic in Latin script mostly found online. Key Property : High Variability • No spelling, morphological or syntactic fixed norms • Strong influence from foreign languages • Code-Switching French / Darija Unsupervised Fine Tuning of BERT on Arabizi We fine-tune BERT (MLM objective) on the 200k Arabizi sentences Results Collecting and filtering raw Arabizi Data We bootstrap a data set for Arabizi starting from 9000 sentences collected by Cotterell et al. (2014). Using keywords scraping, we collect 1 million UGC sentences comprising French, English and Arabizi. We filter 200k Arabizi sentences out of the raw corpus (94% F1 score) using our language identifier (cf. Figure below). Lexical Normalization We train a clustering lexical normalizer using edit and word2vec distances. This degrades downstream performances in POS tagging. A new Treebank The first bottleneck in analyzing such a dialect is the lack of annotated resources. We developed a CoNLL-U Treebank** that includes Part-of-Speech, dependencies, and the translations of 1500 sentences (originally posted in Facebook, Echorouk newspaper…). Model Accuracy Baseline (udpipe) 73.7 Baseline + Normalization (udpipe) 72.4 BERT + POS tuning 77.3 BERT + POS tuning + Normalization (udpipe) 69.9 BERT + Unsupervised Domain fine tuning+ POS tuning 78.3 Final performance. Accuracy reported on the test set averaged over 5 runs Figure 2 : Validation accuracy while fine tuning BERT on Arabizi data (200k sentence) X1000 iteration Accuracy Masked Language Model French Wikipedia Arabizi vive mca w nchalah had l'3am championi Arabizi long live MCA and I hope that this year we will be champions English
Laurent Romary. 2019. The TEI as a modeling infrastructure: TEI beyond the TEI realms.

Whereas the Text Encoding Initiative (TEI) has become the reference standard for encoding textual material of all kinds in the humanities, the power of the underlying TEI modelling infrastructure to deal with professional document management scenarios or even non-TEI based vocabularies could deserve more attention. The aim of my presentation will be to show concrete projects where I have contributed to use the ODD (One document does it all) specification language of the TEI to deal with such applications as the management of patent documents, the modelling of lexical resources or the integration of heterogeneous archival descriptions in the EAD (Encoded Archival Description) standard. Starting from an introduction of the TEI as a standard, I will try to conclude on its potential bright future as a real infrastructure for the humanities.
Laurent Romary, Damien Biabiany, Klaus Illmayer, Marie Puren, Charles Riondet, Dorian Seillier and Lionel Tadjou. 2019. SSK by example - Make your Arts and Humanities research go standard.

Preprints

Laurent Romary, Dorian Seillier and Erzsébet Tóth-Czifra. 2019. Reuse agreement template between Cultural Heritage Institutions and researchers. Preprint.

A defining feature of data and data workflows in the arts and humanities domain is their dependence on cultural heritage sources hosted and curated in museums, libraries, galleries and archives. A major difficulty when scholars interact with heritage data is that the nature of the cooperation between researchers and Cultural Heritage Institutions and the researchers working in CHIs (henceforth CHIs) is often constrained by structural and legal challenges but even more by uncertainties as to the expectations of both parties.This recognition led several European organizations such as APEF, CLARIN, Europeana, E-RIHS to come together and join forces under the governance of DARIAH to set up principles and mechanisms for improving the conditions for the use and re-use of cultural heritage data issued by cultural heritage institutions and studied and enriched by researchers. As a first step of this joint effort is the Heritage Data Reuse Charter (https://datacharter.hypotheses.org/) establishes 6 basic principles for improving the use and re-use of cultural heritage resources by researchers and , to help all the relevant actors to work together to connect and improve access to heritage data. These are: Reciprocity, Interoperability, Citability, Openness, Stewardship and Trustworthiness.As a further step in translating these principles to actual data workflows the survey below serves as a template to frame exchanges around cultural heritage data by enabling both Cultural Heritage Institutions, infrastructure providers and researchers and to clarify their goals at the beginning and the project, to specify access to data, provenance information, preferred citation standards, hosting responsibilities etc. on the basis of which the parties can arrive at mutual reuse agreements that could serve as a starting point for a FAIR-by-construction data management, right from the project planning/application phase. In practice, the survey below can be flexibly applied in platform-independent ways in exchange protocols between Cultural Heritage Institutions and researchers, Institutions who sign the Charter could use it (and expect to use such surveys) in their own exchange protocols. Another direction of future developments is to set up a platform dedicated to such exchanges. On the other hand, researchers are encouraged to contact the CHIs during the initial stages of their project in order to explain their plans and figure details of transaction together. This mutual declaration can later be a powerful component in their Data Management Plans as it shows evidence for responsible and fair conduct of cultural heritage data, and fair (but also FAIR) research data management practices that are based on partnership with the holding institution. As enclosing a Research Data Management Plan to grant applications is becoming a more and more common requirement among research funders, we need to raise the funders’ awareness to the fact that such bi- or trilateral agreements and data reuse declarations among researchers, CHIs and infrastructure providers are crucial domain-specific components of FAIR data management.
Laurent Romary. 2019. Archives de hier et de demain. Preprint.

Alix Chagué, Victoria Le Fourner, Manuela Martini and Eric Villemonte de La Clergerie. 2019. Deux siècles de sources disparates sur l'industrie textile en France : comment automatiser les traitements d'un corpus non-uniforme ? Preprint.

Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de La Clergerie, Djamé Seddah and Benoît Sagot. 2019. CamemBERT: a Tasty French Language Model. Preprint.

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models—in all languages except English—very limited. Aiming to address this issue for French, we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging, dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and downstream applications for French NLP.
Louis Martin, Benoît Sagot, Éric Villemonte de La Clergerie and Antoine Bordes. 2019. Controllable Sentence Simplification. Preprint.

Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical. It is often considered an all-purpose generic task where the same simplification is suitable for all; however multiple audiences can benefit from simplified text in different ways. We adapt a discrete parametrization mechanism that provides explicit control on simplification systems based on Sequence-to-Sequence models. As a result, users can condition the simplifications returned by a model on parameters such as length, amount of paraphrasing, lexical complexity and syntactic complexity. We also show that carefully chosen values of these parameters allow out-of-the-box Sequence-to-Sequence models to outperform their standard counterparts on simplification benchmarks. Our model, which we call ACCESS (as shorthand for AudienCe-CEntric Sentence Simplification), increases the state of the art to 41.87 SARI on the WikiLarge test set, a +1.42 gain over previously reported scores.
Charlotte Rochereau, Benoît Sagot and Emmanuel Dupoux. 2019. Modeling German Verb Argument Structures: LSTMs vs. Humans. Preprint.

LSTMs have proven very successful at language modeling. However, it remains unclear to what extent they are able to capture complex morphosyntactic structures. In this paper, we examine whether LSTMs are sensitive to verb argument structures. We introduce a German grammaticality dataset in which ungrammatical sentences are constructed by manipulating case assignments (eg substituting nominative by accusative or dative). We find that LSTMs are better than chance in detecting incorrect argument structures and slightly worse than humans tested on the same dataset. Surprisingly, LSTMs are contaminated by heuristics not found in humans like a preference toward nominative noun phrases. In other respects they show human-similar results like biases for particular orders of case assignments.
Jack Bowers. 2019. Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec. Preprint.

This project concerns an ongoing language documentation project covering the Mixtepec-Mixtec variety of Mixtec (iso 639-3: mix). Mixtepec-Mixtec is an Otomonguean spoken by roughly 9000-10000 people in the Juxtlahuaca district of Oaxaca, and parts of the Guerrerro and Puebla states of Mexico as well as by communities living in California, Oregon, Washington and Arkansas. Among the primary facets of the work are to: create an open source body of reusable and extensible multimedia language resources encoded in TEI XML; create multi-lingual translations (English and Spanish), annotate the content according to sound theoretical linguistic principles; use the above in order to further the knowledge of all aspects of the language itself within the fields of linguistics and lexicography by producing empirical corpus-based descriptions and analyses of various aspects of the language’s features; demonstrate and evaluate the application of encoding and description standards on a collection of lexical and knowledge resources for an under-resourced non-indo-european language. In addition to providing a lasting and reusable set of resources for the MIX language, this work also aims to make strides towards bridging the gap between lexicography, language documentation, theoretical linguistics, computational linguistics and digital humanities.

2018

PhD theses and Habiliations

Benoît Sagot. 2018. Informatiser le lexique. Habilitation à diriger des recherches. Sorbonne Université.

Journal articles

Charles Riondet and Laurent Romary. 2018. The Standardization Survival Kit: for a Wider Use of Metadata Standards within Arts and Humanities. Archives et Bibliothèques de Belgique - Archief- en Bibliotheekwezen in België 106 pages 55–62. Archief.

Jack Bowers and Laurent Romary. 2018. Bridging the Gaps between Digital Humanities, Lexicography, and Linguistics: A TEI Dictionary for the Documentation of Mixtepec-Mixtec. Dictionaries: Journal of the Dictionary Society of North America 39 pages 79–106. Dictionary Society of North America.

This paper discusses the digital dictionary component in an ongoing language documentation project for the Mixtepec-Mixtec language (iso 639-3: mix). Mixtepec-Mixtec (Sa'an Savi 'rain language') is an Oto-monguean language spoken by roughly 9,000-10,000 people in the Juxtlahuaca district of Oaxaca Mexico. Creating a digital dictionary for an under-resourced language entails a number of challenges that require unique and nuanced encoding solutions in which a delicate balance between the linguistic content, data structure, potential linked resources, and editorial metadata must be found. Herein we demonstrate how we use TEI to create a reusable, extensible, and machine readable language resource with an emphasis on how our solutions using a combination of novel and established TEI dictionary structures enable us to address our specific needs for Mixtepec-Mixtec and also provide a relevant roadmap for similar under-resourced language projects.
Laurent Romary and Charles Riondet. 2018. EAD-ODD: A solution for project-specific EAD schemes. Archival Science Springer Verlag.

This article tackles the issue of integrating heterogeneous archival sources in one single data repository, namely the EHRI portal, whose aim is to support Holocaust research by providing online access to information about dispersed sources relating to the Holocaust (http://portal.ehri-project.eu). In this case, the problem at hand is to combine data coming from a network of archives in order to create an interoperable data space which can be used to search for, retrieve and disseminate content in the context of archival-based research. The central aspect of the work described in this paper is the assessment of the role of the Encoded Archival Description (EAD) standard as the basis for achieving the tasks described above. We have worked out how we could develop a real strategy of defining specific customization of EAD that could be used at various stages of the process of integrating heterogeneous sources. We have developed a methodology based on a specification and customization method inspired from the extensive experience of the Text Encoding Initiative (TEI) community. In the TEI framework, one has the possibility to model specific subsets or extensions of the TEI guidelines while maintaining both the technical (XML schemas) and editorial (documentation) content within a single framework. This work has led us quite far in anticipating that the method we have developed may be of a wider interest within similar environments, but also, as we believe, for the future maintenance of the EAD standard.
Sacha Beniamine, Olivier Bonami and Benoît Sagot. 2018. Inferring inflection classes with description length. Journal of Language Modelling 5 pages 465–525. Institute of Computer Science, Polish Academy of Sciences, Poland.

We discuss the notion of an inflection class system, a traditional ingredient of the description of inflection systems of nontrivial complexity. We distinguish systems of microclasses, which partition a set of lexemes in classes with identical behavior, and systems of macroclasses, which group lexemes that are similar enough in a few larger classes. On the basis of the intuition that macroclasses should contribute to a concise description of the system, we propose one algorithmic method for inferring macroclasses from raw inflectional paradigms, based on minimisation of the description length of the system under a given strategy of identifying morphological alternations in paradigms. We then exhibit classifications produced by our implementation on French and European Portuguese conjugation data and argue that they constitute an appropriate systematisation of traditional classifications. To arrive at such a convincing systematisation, it was crucial for us to use a local approach to inflection class similarity (based on pairwise comparisons of paradigm cells) rather than a global approach (based on the simultaneous comparison of all cells). We conclude that it is indeed possible to infer inflectional macroclasses objectively.
Alba Málaga Sabogal and Serge Troubetzkoy. 2018. Infinite ergodic index of the ehrenfest wind-tree model. Communications in Mathematical Physics 358 pages 995–1006. Springer Verlag.

The set of all possible configurations of the Ehrenfest wind-tree model endowed with the Hausdorff topology is a compact metric space. For a typical configuration we show that the wind-tree dynamics has infinite ergodic index in almost every direction. In particular some ergodic theorems can be applied to show that if we start with a large number of initially parallel particles their directions decorrelate as the dynamics evolve answering the question posed by the Ehrenfests.

Conference proceedings

Jack Bowers and Philip Stöckle. 2018. TEI and Bavarian dialect resources in Austria: updates from the DBÖ and WBÖ. In Second workshop on Corpus-Based Research in the Humanities (CRH-2). 1 Gerastree proceedings. Vienna, Austria.

In our paper, we present a large historical database of Bavarian dialects (from the Dictionary of Bavarian Dialects in Austria) and its conversion from handwritten paper slips via TUSTEP into TEI-XML while elaborating on the topics discussed by Bowers [2] with regards to enhancement of its contents. While the original purpose of the digitalization was to facilitate the writing of dictionary articles, our current TEI database will be used as a corpus from which the materials are being gathered to both write print dictionary articles as well as serving as a basis for a web-based lexicographic information system. Herein we trace the different steps that have already been taken to create our current digital database from a legacy data collection, discuss the challenges we are still facing, and describe the approaches we are taking and considering to address such challenges.
Jack Bowers, Axel Herold and Laurent Romary. 2018. TEI-Lex0 Etym -towards terse(r) recommendations for the encoding of etymological information. In TEI Conference and Members' Meeting. Tokyo, Japan.

Jack Bowers and Laurent Romary. 2018. Encoding Mixtepec-Mixtec Etymology in TEI. In TEI Conference and Members' Meeting. Tokyo, Japan.

Ganesh Jawahar, Benjamin Muller, Louis Fethi Amaland Martin, Éric Villemonte de la Clergerie, Benoît Sagot and Djamé Seddah. 2018. ELMoLex: Connecting ELMo and Lexicon Features for Dependency Parsing. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. pages 223–237. Association for Computational Linguistics. Brussels, Belgium.

During the last few years Recurrent Neural Networks (RNN) have reached state-of-the-art performances on most sequence modeling problems. In particular the sequence to sequence model and the neural CRF have proved very effective on this class of problems. In this paper we propose an alternative RNN for sequence labelling, based on label embeddings and memory networks, which makes possible to take arbitrary long contexts into account. Our results are better than those of state-of-the-art models in most cases, and close to them in all cases. Moreover, our solution is simpler than the best models in the literature. MOTS-CLÉS : Réseaux neuronaux récurrents, contexte global, Étiquetage de séquences.
Louis Martin, Samuel Humeau, Pierre-Emmanuel Mazaré, Antoine Bordes, Éric Villemonte de La Clergerie and Benoît Sagot. 2018. Reference-less Quality Estimation of Text Simplification Systems. In 1st Workshop on Automatic Text Adaptation (ATA). Tilburg, Netherlands.

The evaluation of text simplification (TS) systems remains an open challenge. As the task has common points with machine translation (MT), TS is often evaluated using MT metrics such as BLEU. However, such metrics require high quality reference data, which is rarely available for TS. TS has the advantage over MT of being a monolingual task, which allows for direct comparisons to be made between the simplified text and its original version. In this paper, we compare multiple approaches to reference-less quality estimation of sentence-level text simplification systems, based on the dataset used for the QATS 2016 shared task. We distinguish three different dimensions: gram-maticality, meaning preservation and simplicity. We show that n-gram-based MT metrics such as BLEU and METEOR correlate the most with human judgment of grammaticality and meaning preservation, whereas simplicity is best evaluated by basic length-based metrics.
Ganesh Jawahar, Benjamin Muller, Amal Fethi, Louis Martin, Éric Villemonte de La Clergerie, Benoît Sagot and Djamé Seddah. 2018. ELMoLex: Connecting ELMo and Lexicon features for Dependency Parsing. In CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Brussels, Belgium.

In this paper, we present the details of the neural dependency parser and the neu-ral tagger submitted by our team 'ParisNLP' to the CoNLL 2018 Shared Task on parsing from raw text to Universal Dependencies. We augment the deep Biaffine (BiAF) parser (Dozat and Manning, 2016) with novel features to perform competitively: we utilize an indomain version of ELMo features (Peters et al., 2018) which provide context-dependent word representations; we utilize disambiguated, embedded, morphosyntactic features from lexicons (Sagot, 2018), which complements the existing feature set. Henceforth , we call our system 'ELMoLex'. In addition to incorporating character embed-dings, ELMoLex leverage pre-trained word vectors, ELMo and morphosyntactic features (whenever available) to correctly handle rare or unknown words which are prevalent in languages with complex morphology. ELMoLex 1 ranked 11th by Labeled Attachment Score metric (70.64%), Morphology-aware LAS metric (55.74%) and ranked 9th by Bilexical dependency metric (60.70%). In an extrinsic evaluation setup, ELMoLex ranked 7 th for Event Extraction, Negation Resolution tasks and 11th for Opinion Analysis task by F1 score.
Andrea Bertino, Luca Foppiano, Laurent Romary and Pierre Mounier. 2018. Leveraging Concepts in Open Access Publications. In PUBMET 2018 - 5th Conference on Scholarly Publishing in the Context of Open Science. Zadar, Croatia.

Aim: This paper addresses the integration of a Named Entity Recognition and Disambiguation (NERD) service within a group of open access (OA) publishing digital platforms and considers its potential impact on both research and scholarly publishing. This application, called entity-fishing, was initially developed by Inria in the context of the EU FP7 project CENDARI (Lopez et al., 2014) and provides automatic entity recognition and disambiguation against Wikipedia and Wikidata. Distributed with an open-source licence, it was deployed as a web service in the DARIAH infrastructure hosted at the French HumaNum. Methods: In this paper, we focus on the specific issues related to its integration on five OA platforms specialized in the publication of scholarly monographs in social sciences and humanities as part of the work carried out within the EU H2020 project HIRMEOS (High Integration of Research Monographs in the European Open Science infrastructure). Results and Discussion: In the following sections, we give a brief overview of the current status and evolution of OA publications and how HIRMEOS aims to contribute to this. We then give a comprehensive description of the entity-fishing service, focusing on its concrete applications in real use cases together with some further possible ideas on how to exploit the generated annotations. Conclusions: We show that entity-fishing annotations can improve both research and publishing process. Entity-fishing annotations can be used to achieve a better and quicker understanding of the specific and disciplinary language of certain monographs and so encourage non-specialists to use them. In addition, a systematic implementation of the entity-fishing service can be used by publishers to generate thematic indexes within book collections to allow better cross-linking and query functions.
Loïc Grobol, Frédéric Landragin and Serge Heiden. 2018. XML-TEI-URS: using a TEI format for annotated linguistic resources. In CLARIN Annual Conference 2018. Pisa, Italy.

This paper discusses XML-TEI-URS, a recently introduced TEI-compliant XML format for theannotation of referential phenomenons in arbitrary corpora. We describe our experiments on usingthis format in different contexts, assess its perceived strengths and weaknesses, compare it withother similar efforts and suggest improvements to ease its use as standard for thedistribution of interoperable annotated linguistic resources.
Maëlle Brassier, Alexis Puret, Augustin Voisin-Marras and Loïc Grobol. 2018. Classification par paires de mention pour la résolution des coréférences en français parlé interactif (Mention-pair classification for corefence resolution on spontaneous spoken French). In Actes de la Conférence TALN. Volume 2 - Démonstrations, articles des Rencontres Jeunes Chercheurs, ateliers DeFT. pages 145–158. ATALA. Rennes, France.

Mention-pair classification for corefence resolution on spontaneous spoken French. This paper presents the first experiments conducted by our laboratory (LIFAT) on the question of the resolution of coreference on spontaneous spoken French. We have developed a mention-pair classifier, trained on the ANCOR French coreference corpus, which is based on various classification techniques among which support vector machines (SVM). The paper details several experimental studies that investigate several factors (classification model, interactivity degree, nature of the coreference…) that should affect the performances of the system.
Mohamed Khemakhem, Laurent Romary, Simon Gabay, Hervé Bohbot, Francesca Frontini and Giancarlo Luxardo. 2018. Automatically Encoding Encyclopedic-like Resources in TEI. In The annual TEI Conference and Members Meeting. Tokyo, Japan.

Mohamed Khemakhem, Carmen Brando, Laurent Romary, Frédérique Mélanie-Becquet and Jean-Luc Pinol. 2018. Fueling Time Machine: Information Extraction from Retro-Digitised Address Directories. In JADH2018 »;Leveraging Open Data»; Tokyo, Japan.

Djamé Seddah, Éric Villemonte de La Clergerie, Benoît Sagot, Hector Martinez Alonso and Marie Candito. 2018. Cheating a Parser to Death: Data-driven Cross-Treebank Annotation Transfer. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation. pages 4535–4539. European Language Resources Association (ELRA). Miyazaki, Japan.

We present an efficient and accurate method for transferring annotations between two different treebanks of the same language. This method led to the creation of a new instance of the French Treebank (Abeillé et al., 2003), which follows the Universal Dependency annotation scheme and which was proposed to the participants of the CoNLL 2017 Universal Dependency parsing shared task (Zeman et al., 2017). Strong results from an evaluation on our gold standard (94.75% of LAS, 99.40% UAS on the test set) demonstrate the quality of this new annotated data set and validate our approach.
Benoît Sagot. 2018. A multilingual collection of CoNLL-U-compatible morphological lexicons. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA). Miyazaki, Japan.

We introduce UDLexicons, a multilingual collection of morphological lexicons that follow the guidelines and format of the Universal Dependencies initiative. We describe the three approaches we use to create 53 morphological lexicons covering 38 languages, based on existing resources. These lexicons, which are freely available, have already proven useful for improving part-of-speech tagging accuracy in state-of-the-art architectures.
Amir More, Özlem \cCetino\uglu, \cCa\ugrı \cCöltekin, Nizar Habash, Benoît Sagot, Djamé Seddah, Dima Taji and Reut Tsarfaty. 2018. CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA). Miyazaki, Japan.

Following the development of the universal dependencies (UD) framework and the CoNLL 2017 Shared Task on end-to-end UD parsing, we address the need for a universal representation of morphological analysis which on the one hand can capture a range of different alternative morphological analyses of surface tokens, and on the other hand is compatible with the segmentation and morphological annotation guidelines prescribed for UD treebanks. We propose the CoNLL universal lattices (CoNLL-UL) format, a new annotation format for word lattices that represent morphological analyses, and provide resources that obey this format for a range of typologically different languages. The resources we provide are harmonized with the two-level representation and morphological annotation in their respective UD v2 treebanks, thus enabling research on universal models for morphological and syntactic parsing , in both pipeline and joint settings, and presenting new opportunities in the development of UD resources for low-resource languages.
Loïc Grobol, Isabelle Tellier, Éric de la Clergerie, Marco Dinarelli and Frédéric Landragin. 2018. ANCOR-AS: Enriching the ANCOR Corpus with Syntactic Annotations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA). Miyazaki, Japan.

This paper presents ANCOR-AS, an enriched version of the ANCOR corpus. This version adds syntactic annotations in addition to the existing coreference and speech transcription ones. This corpus is also released in a new TEI-compliant XML format.
Mohamed Khemakhem, Axel Herold and Laurent Romary. 2018. Enhancing Usability for Automatically Structuring Digitised Dictionaries. In Proceedings of the GLOBALEX workshop at LREC 2018. Miyazaki, Japan.

The last decade has seen a rapid development of the number of NLP tools which have been made available to the community. The usability of several e-lexicography tools represents a serious obstacle for researchers with little or no background in computer science. We present in this paper our efforts to overcome this issue in the case of a machine learning system for the automatic segmentation and semantic annotation of digitised dictionaries. Our approach is based on limiting the burdens of managing the tool's setup in different execution environments and lightening the complexity of the training process. We illustrate the possibility to reach this goal through the adaptation of existing functionalities and through using out of the box software deployment technology. We also report on the community's feedback after exposing the new setup to real users of different professional backgrounds.

Communications

Laurent Romary and Toma Tasovac. 2018. TEI Lex-0: A Target Format for TEI-Encoded Dictionaries and Lexical Resources. In TEI Conference and Members' Meeting. Tokyo, Japan.

Achieving consistent encoding within a given community of practice has been a recurrent issue for the TEI Guidelines. The topic is of particular importance for lexical data if we think of the potential wealth of content we could gain from pooling together the information available in the variety of highly structured, historical and contemporary lexical resources. Still, the encoding possibilities offered by the Dictionaries Chapter in the Guidelines are too numerous and too flexible to guarantee sufficient interoperability and a coherent model for searching, visualising or enriching multiple lexical resources.Following the spirit of TEI Analytics [Zillig, 2009], developed in the context of the MONK project, TEI Lex-0 aims at establishing a target format to facilitate the interoperability of heterogeneously encoded lexical resources. This is important both in the context of building lexical infrastructures as such [Ermolaev and Tasovac, 2012] and in the context of developing generic TEI-aware tools such as dictionary viewers and profilers. The format itself should not necessarily be one which is used for editing or managing individual resources, but one to which they can be univocally transformed to be queried, visualised, or mined in a uniform way. We are also aiming to stay as aligned as possible with the TEI subset developed in conjunction with the revision of the ISO LMF (Lexical Markup Framework) standard so that coherent design guidelines can be provided to the community (cf. [Romary, 2015]).The paper will provide an overview of the various domains covered by TEI Lex- 0 and the main decisions that were taken over the last 18 months: constraining the general structure of a lexical entry; offering mechanisms to overcome the limits of <entry> when used in retro-digitized dictionaries (by allowing, for instance, <pc> and <lbl> as children of <entry>); systematizing the representation of morpho-syntactic information [Bański et al., 2017]; providing a strict <sense>-based encoding of sense-related information; deprecating <hom>; dealing with internal and external references in dictionary entries, providing more advanced encodings of etymology (see submission by Bowers, Herold and Romary); as well as defining technical constraints on the systematic use of @xml:id at different levels of the dictionary microstructure. The activity of the group has already lead to changes in the Guidelines in response to specific GitHub tickets.
David Lindemann, Mohamed Khemakhem and Laurent Romary. 2018. Retro-digitizing and Automatically Structuring a Large Bibliography Collection. In European Association for Digital Humanities (EADH) Conference. Galway, Ireland.

Marie Puren, Alix Chagué, Manuela Martini, Éric Villemonte de La Clergerie and Charles Riondet. 2018. Creating gold data to understand the gender gap in the French textile trades (17th–20th century). Time-Us project. In Digital Humanities 2018: »;Puentes/ Bridges»; Mexico, Mexico.

Marie Puren, Dorian Seillier, Charles Riondet and Lionel Tadjou. 2018. Le Standardization Survival Kit (SSK). In Rencontres de la TGIR Huma-Num. Ecully, France.

Marie Puren, Charles Riondet, Laurent Romary, Dorian Seillier and Lionel Tadjou. 2018. The Standardization Survival Kit (SSK). In Digital Humanities Benelux 2018. Amsterdam, Netherlands.

Romain Garnier and Benoît Sagot. 2018. New results on a centum substratum in Greek: the Lydian connection. In International Colloquium on Loanwords and Substrata in Indo-European languages. Limoges, France.

Benoît Sagot. 2018. A new PIE root *h1er ‘(to be) dark red, dusk red': drawing the line between inherited and borrowed words for ‘red(ish)', ‘pea', ‘ore', ‘dusk' and ‘love' in daughter languages. In International Colloquium on Loanwords and Substrata in Indo-European languages. Limoges, France.

Hajer Maraoui, Kais Haddar and Laurent Romary. 2018. Segmentation tool for hadith corpus to generate TEI encoding. In 4th International Conference on Advanced Intelligent Systems and Informatics (AISI'18). Cairo, Egypt.

A segmentation tool for a hadith corpus is necessary to prepare the TEI hadith encoding process. In this context, we aim to develop a tool allowing the segmentation of hadith text from Sahih al-Bukhari corpus. To achieve this objective, we start by identifying different hadith structures. Then, we elaborate an automatic processing tool for hadith segmentation. This tool will be integrated in a prototype allowing the TEI encoding process. The experimentation and the evaluation of this tool is based on Sahih al-Bukhari corpus. The obtained results were encouraging despite some flaws related to exceptional cases of hadith structure.
Hervé Bohbot, Francesca Frontini, Giancarlo Luxardo, Mohamed Khemakhem and Laurent Romary. 2018. Presenting the Nénufar Project: a Diachronic Digital Edition of the Petit Larousse Illustré. In Proceedings of the GLOBALEX workshop at LREC 2018. Miyazaki, Japan.

This paper presents the Nénufar project, which aims to make several successive (free of copyright up to 1948) editions of the French Petit Larousse Illustré dictionary available in a digitised format. The corpus of digital editions will be made publicly available via a web-based querying interface, as well as distributed in a machine readable format, TEI-LEX0.

Book chapters

Tobias Blanke, Conny Kristel and Laurent Romary. 2018. Crowds for Clouds: Recent Trends in Humanities Research Infrastructures. In Cultural Heritage Digital Tools and Infrastructures. Routledge.

Humanities have convincingly argued that they need transnational research opportunities and through the digital transformation of their disciplines also have the means to proceed with it on an up to now unknown scale. The digital transformation of research and its resources means that many of the artifacts, documents, materials, etc. that interest humanities research can now be combined in new and innovative ways. Due to the digital transformations, (big) data and information have become central to the study of culture and society. Humanities research infrastructures manage, organise and distribute this kind of information and many more data objects as they becomes relevant for social and cultural research.

Tech reports

Laurent Romary, Eliza Papaki and Jennifer Edmond. 2018. DARIAH-EU Annual Report 2017. Technical report.

This is the DARIAH-EU Annual Report 2017, providing the main highlights of the DARIAH-EU activities in 2017.
Marie-Laurence Bonhomme. 2018. Répertoire des Notaires parisiens Segmentation automatique et reconnaissance d'écriture. Technical report.

Les répertoires des notaires de Paris conservés aux Archives nationales sont parmi les fonds les plus consultéspar le public, mais s’ils sont numérisés et disponibles sur la Salle des Inventaires Virtuelle, pour les exploiter les lecteurs doivent toujours en passer par un dépouillement méthodique car ces répertoires ne sont pas transcrits et on ne peut donc pas y effectuer de recherche en plein texte. Afin de les rendre plus aisément utilisables comme inventaires des minutes des notaires, et d’en permettre des exploitations nouvelles, appliquer les techniques de reconnaissance automatique d’écriture à ce volumineux corpus semble particulièrement opportun. La structure régulière des documents, et une certaine prévisibilité de leurs contenus constituent des atouts, tandis que la multiplicité des écritures rencontrées dans les répertoires est une difficulté qui ne peut pas être ignorée. Une phase d’expérimentation a produit des résultats encourageants quant aux performances de la reconnaissance automatique d’écriture sur ces documents, et offert des pistes quant aux moyens de les améliorer au cours d’un projet plus long et plus ambitieux.
Charles Riondet, Dorian Seillier, Lionel Tadjou and Laurent Romary. 2018. Standardization Survival Kit (Final). Technical report.

To support the digital evolution within Social Sciences and Humanities research, it is necessary to stabilize knowledge on standards and research good practices. The goal of the Standardization Survival Kit (SSK), developed within the PARTHENOS project, is to accompany researchers along this route, giving access to standards and best practices in a meaningful way, by the mediation of research scenarios. A research scenario is a (digital) workflow practiced by researchers, that can be repeatedly applied to a task that will help to gain material or insights in view of a research question. These scenarios are at the core of the SSK, as they embed resources with contextual information and relevant examples on standardized processes and methods in a research context. The SSK is an open tool where users are able to publish new scenarios or adapt existing ones. These scenarios can be seen as a living memory of what should be the best research practices in a given community, made accessible and reusable for other researchers.

Other

Laurent Romary. 2018. Fine tuning the interface between research and libraries: the data re-use charter.

We present how the European infrastructure DARIAH plays a central role in accompanying the swift evolution towards digital methods in the humanities. Beyond, the main focus around best practices, technological developments or open science principles, we will zoom in on the current development of the data re-use charter in collaboration with major other institutions in Europe and see how it may contribute further to an improvement of the relation between cultural heritage institutions, among which libraries play an important role, and research communities.
Marie Puren, Charles Riondet, Laurent Romary, Dorian Seillier and Lionel Tadjou. 2018. The SSK. Make your Arts and Humanities research go standard. TEI inside !

Guidelines and tools are easier to understand and use when presented through examples. The SSK provides a variety of standardized resources in a meaningful context, delivered by research use cases : the « scenarios ». FULL TEI TEI is the SSK underlying data model, designed to maximize scenario reuse and customization.
Detlef Reineke and Laurent Romary. 2018. SKOS and TBX vocabularies.

The present document is a compilation of the SKOS vocabulary and TBX-default data categories used for an ongoing research on the comparison between the SKOS and TBX data models. Nonetheless, the list are derived from [W3C 2009a] and [W3C 2009b] for SKOS and a combination of [ISO 30042] and [LTAC 2018] for TBX.
Laurent Romary. 2018. Open Access in France: how the call of Jussieu reflects our social, technical and political landscape.

Our presentation is centred on the description of the French higher education and research environment and the way it has developed in the scientific and technical information domain, up the recent presentation of the national open science policy by Minister Frédérique Vidal in July 2018. We will show that the focus on identifying a public service to universities and research institutions has lead to the setting up of centralised and coordinated services and decision-making processes that have put us in a strong position in the context of the on-going negotiations that are taking place with several monopolistic publishers. After a quick presentation of the general landscape, we will focus on the actual technical infrastructures that are currently in place and in particular how the existence of a national (cheap and efficient) publication repository has facilitated the development of a green open access orientation in the country. We will also present the general political support we have in this respect, with both the “Loi pour une République Numérique” (with articles related to open access and TDM) and the multi-institutional call of Jussieu aiming at providing a basis for an in-depth change of the publication eco-system. As an example of a concrete implementation of such policies, we will detail the open access strategy of Inria, which has had a deposit mandate in HAL since several years.
Aria Adli, Eric Engel, Laurent Romary and Fahime Same. 2018. A stand-off XML-TEI representation of reference annotation.

In this poster, we present an XML-TEI conformant stand-off representation of reference in discourse, building on the seminal workcarried out in the MATE project (Poesio, Bruneseaux & Romary 1999) and the earlier proposal on a reference annotation framework in Salmon-Alt & Romary (2005). We make a three-way distinction between markables (the referring expressions), discourse entities (referents in the textual or extra-textual world), and links (relations that hold between referents, e.g., part-whole). Our approach differs from previous suggestions in that (i) inherent properties of the referent itself (e.g., animacy) are disentangled from the expressions used to refer to that referent, (ii) existing annotations from other layers such as morphosyntax are cleanly separated from the annotation of reference, but can be combined in queries and (iii) ourproposal is integrated into the larger structure of existing TEI-ISO standards, thereby allowing for compatibility with existing TEI-encodedcorpora and data sustainability. The workflow of adding reference annotations to an existing corpus willbe demonstrated with concrete examples from ongoing work in the SFB 1252 (subprojects C01 and INF), where this representation ofreference is the backbone for the annotation of (sentence) topic chains in dialogue data and for queries of topics in various grammaticalconstructions.
Herve Bohbot, Alexandre Faucher, Francesca Frontini, Agata Jackiewicz, Giancarlo Luxardo, Agnès Steuckardt, Mohamed Khemakhem and Laurent Romary. 2018. A Diachronic Digital Edition of the Petit Larousse illustré.

Marie Puren, Charles Riondet, Laurent Romary, Dorian Seillier and Lionel Tadjou. 2018. SSK by example. Make your Arts and Humanities research go standard.

Frédéric Landragin, Marine Delaborde, Yoann Dupont and Loïc Grobol. 2018. Description et modélisation des chaînes de référence. Le projet ANR Democrat (2016-2020) et ses avancées à mi-parcours.

Le projet ANR Democrat vise à développer les recherches sur la langue et la structuration textuelle du français via l’analyse détaillée et contrastive des chaînes de références (instanciations successives d’une même entité) dans un corpus diachronique de textes écrits entre le 9ème et le 21ème siècle, avec des genres textuels variés. Il réunit des chercheurs issus des laboratoires Lattice, LiLPa, ICAR et IHRIM. Il a été lancé en mars 2016 et l’essentiel des efforts porte actuellement sur l’annotation (manuelle) d’un corpus. Plusieurs expérimentations d’annotation ont eu lieu, de manière à tester différentes procédures. La procédure retenue alterne des phases manuelles et des phases automatiques pour compléter les annotations, via le lancement de scripts.
Marie Puren, Charles Riondet, Dorian Seillier and Lionel Tadjou. 2018. The Standardization Survival Kit : For a wider use of standards within Arts and Humanities.

Charles Riondet. 2018. Stewardship of Cultural Heritage Data. In the shoes of a researcher.

Charles Riondet. 2018. TEI: de l'image au texte : Décrire son corpus grâce aux métadonnées.

Laurent Romary. 2018. Data Mining Technologies at the service of Open Knowledge.

The development of open access and open science principles has made it possible to have access to a wide variety of content online, which in turn can be seen as a wealth of reference information for further research. Information extraction and data mining technologies play an essential role in this respect and we present several projects and initiatives that have been carried out in the context of the EU infrastructure DARIAH as well as several national and European research projects. We will also identify the conditions for an adequate re-use of technologies and content, with involves standardisation, policies and training.

Preprints

Simon Gabay, Mohamed Khemakhem and Laurent Romary. 2018. Les catalogues et GROBID. Preprint.

Klaus Illmayer and Marie Puren. 2018. How to work together successfully with e-Humanities and e-Heritage Research Infrastructures (PARTHENOS Webinar). Preprint.

This webinar is dedicated to the phase of the research life cycle “Plan Research Project”. It first introduces the participants to an understanding of the advantages and practicalities of research collaboration in and with Research infrastructures. It then dives into details of project planning, touches upon the basics of the FAIR principles, and will focus especially on the importance of using standards in Digital Humanities and Cultural Heritage research and how to identify relevant standards for the participants’ own research. This webinar will give an introduction to the Standards Survival Kit that is developed within PARTHENOS. It will also cross-link to other materials developed within PARTHENOS and by the PARTHENOS Cluster Partners.
Charles Riondet. 2018. Traces de l'héroïsme. Le programme mémoriel de la résistance parisienne. Preprint.

La Libération est sans doute l'événement ponctuel qui a généré le plus de traces physiques dans Paris, des traces qui ont de plus été sanctuarisées dès le moment de leur création, dans un processus de patrimonialisation immédiate. Ce processus se manifeste par la réalisation d'un programme ambitieux d'hommages publics pensé et mis en oeuvre par la Résistance parisienne, au pouvoir au sein des institutions provisoires, au gouvernement, mais également en tant que formations politiques organisées. Toutefois, cet épisode de cristallisation de la Libération trouve ses racines dans la période clandestine. Une partie importante du martyrologe de la Résistance, est créé sous l'Occupation et Paris, en plus d'être la ville de l'insurrection, est présentée comme la capitale de la Résistance et la ville des 75 000 fusillés 1. Les hommages publics sont le reflet de cette histoire et la période insurrectionnelle ajoute une strate sans forcément effacer ce qui précède 2. Cet article dresse quelques pistes quant à la mise en place de ce programme, avant de s'intéresser à une de ces manifestations les plus visibles : les plaques commémoratives apposées dans l'espace public parisien. La Résistance clandestine et ses morts, premiers discrets hommages publics L'hommage aux « patriotes fusillés » est le principal travail mémoriel de la Résistance clandestine. La politique des otages, mise en place par l'occupant après les premières exécutions de soldats par des groupes de résistants armés (issus des Jeunesses Communistes essentiellement) à l'été 1941, a provoqué l'émoi de la population, entretenu par la Résistance qui s'est faite l'écho de ces exactions dans la presse clandestine et par la radio de Londres, contribuant à l'émergence de figures de martyrs. Les principales sont, et ce jusqu'à la fin de l'occupation, mais avec parfois des variantes régionales, Gabriel Péri, rédacteur en Chef de l'Humanité, fusillé le 15 décembre 1941 (en tant qu'otage) et Honoré d'Estienne d'Orves, agent de la France Libre, fusillé le 29 août 1941 (condamné à mort en mai pour espionnage, mais fusillé comme otage). Ces deux figures, mais également celle de Guy Môquet, et plus généralement les otages fusillés à Châteaubriant et Nantes le 22 octobre 1941, sont rapidement présentés comme des martyrs et célébrés comme tels. Toutefois, pour la Résistance, clandestine par essence, la pratique de l'hommage public pose problème, l'espace public étant interdit. La Résistance utilise trois modalités d'expression pour répondre à cette apparente contradiction. La presse clandestine, surtout la presse clandestine 1 MRN/7/Association des Familles de fusillés, 1945 2 Le retour des déportés et la découverte de l'univers concentrationnaire marque plus clairement une rupture, même si le terme générique, et donc abusif, de « fusillés » demeure pour un temps celui employé pour toutes les victimes résistantes.
Charles Riondet. 2018. À la recherche de l'archive clandestine. Preprint.

2017

Journal articles

Olivier Bonami and Benoît Sagot. 2017. Computational methods for descriptive and theoretical morphology: a brief introduction. Morphology 27 pages 1–7. Springer Verlag.

Romain Garnier and Benoît Sagot. 2017. A shared substrate between Greek and Italic. Indogermanische Forschungen 122 pages 29–60. De Gruyter.

The Greek lexicon is known for its significant proportion of words lacking a clear etymology. Previous attempts to explain these words range from the socalled “Pelasgian” hypotheses, which resort to an unattested satem Indo-European language, to Beekes’s (2010; 2014) non-Indo-European “Pre-Greek”. In this paper, we reconsider this long-disputed question, and adduce Latin and even Proto- Romance data to unveil a centum language which possibly served as the basis for borrowing in both Common Greek and, at a later date, Common Italic. We analyse several dozen difficult Greek and Italic words as borrowings from this newly identified language, for which we provide a set of phonetic laws that model its development from Proto-Indo-European. Important methodological strengths of our proposal include the systematic correspondence between Greek and Italic forms, the semantic plausibility of our etymologies, and their consistency with what is known about Proto-Indo-European word-formation patterns. Moreover, a computer implementation of these phonetic laws ensures its formal consistency and validates the chronological ordering we put forward. This is all the more important since most of our etymologies involve more than one of these phonetic laws, which is an additional confirmation of the plausibility of our proposal.
Jennifer Edmond, Frank Fischer, Michael Mertens and Laurent Romary. 2017. The DARIAH ERIC: Redefining Research Infrastructure for the Arts and Humanities in the Digital Age. ERCIM News ERCIM.

The Digital Research Infrastructure for the Arts and Humanities (DARIAH) was first conceptualised in late 2005 as a response to how this very different set of requirements was being addressed in the fast-moving environment of digitally-enhanced research. The infrastructure was later officially founded as a European Research Infrastructure Consortium (or ERIC) based in France, but with 17 national members contributing funds and in-kind contributions from their local digital humanities research communities. The knowledge base of the resulting network is further enhanced by contributions from funded research projects in which DARIAH is a partner, as well as the contributions of working groups, assembled by network members on a voluntary basis to address key gaps in infrastructural provision or key emerging challenges for the research communities.
Benoît Sagot. 2017. Représentation de l'information sémantique lexicale~: le modèle wordnet et son application au français. Revue Française de Linguistique Appliquée XXII Paris : Publications linguistiques.

Le modèle wordnet est le plus répandu des modèles de représentation de la sémantique lexicale reposant sur un inventaire de sens a priori. A la suite du Princeton WordNet de l’anglais, des ressources de type wordnet ont été développées pour plusieurs dizaines de langues, dont le français, le plus souvent au moyen de techniques automatiques ou semi-automatiques. Dans cet article, nous revenons tout d’abord sur les caractéristiques et les limites du modèle wordnet. Nous dressons ensuite un panorama des méthodes utilisées pour le développement de wordnets, avant d’illustrer nos propres travaux dans ce domaine par le développement du WOLF, le WOrdnet Libre du Français.
Jack Bowers and Laurent Romary. 2017. Deep encoding of etymological information in TEI. Journal of the Text Encoding Initiative TEI Consortium.

This paper aims to provide a comprehensive modeling and representation of etymological data in digital dictionaries. The purpose is to integrate in one coherent framework both digital representations of legacy dictionaries, and also born-digital lexical databases that are constructed manually or semi-automatically. We want to propose a systematic and coherent set of modeling principles for a variety of etymological phenomena that may contribute to the creation of a continuum between existing and future lexical constructs, where anyone interested in tracing the history of words and their meanings will be able to seamlessly query lexical resources.Instead of designing an ad hoc model and representation language for digital etymological data, we will focus on identifying all the possibilities offered by the TEI guidelines for the representation of lexical information.

Conference proceedings

Charles Riondet and Luca Foppiano. 2017. History Fishing When engineering meets History. In Text as a Resource. Text Mining in Historical Science #dhiha7. Paris, France.

While doing research in digital based context, historians often face the same kind of problem: how to translate their research questions in machine readable format? Find out how to annotate efficiently their material and which techniques/methods to apply are technical steps that tend to become everyday tasks for many Humanities researchers. The main issue is that their questions are too specific to simply apply generic tools and, on the other hand, adapting such tools may require technical abilities that humanists hardly ever have. In such cases, interacting with experts of different backgrounds (e.g. information technologies, developers) could be difficult due to different views, approach to the problem, way of reasoning, etc. In this paper is proposed a methodology, and the research use case on which it is tested, where the close collaboration between historians and engineers allows for a better understanding of the needs of each party, and helps the creation of customizable tools capable of being used in many different contexts and domains. The use-case is to study and compare entities corresponding to the actors of conflicts in a corpus of personal diaries written during World War II. The discourses analyzed are characterized by multiple and very peculiar terminologies, often very ambiguous (like pejorative nicknames). The challenge is to apply generic tools (like named-entity recognition tool – NERD, a POS tagger) and a domain specific dictionary to this corpus, trying not to cross the thin line between generic customization and ad hoc development. Acknowledgement
Piotr Bański, Jack Bowers and Tomaz Erjavec. 2017. TEI-Lex0 guidelines for the encoding of dictionary information on written and spoken forms. In Electronic Lexicography in the 21st Century: Proceedings of ELex 2017 Conference. Leiden, Netherlands.

Djamé Seddah and Marie Candito. 2017. Tour d'Horizon du French QuestionBank : Construire un Corpus Arboré de Questions pour le Français. In ACor4French - Les corpus annotés du français. Orléans, France.

We present the French QuestionBank, a treebank of 2600 questions, annotated with dependencyphrase-based structures. Two thirds being aligned with the English QuestionBank (Judge et al., 2006)and being freely available, this treebank will prove useful to build robust NLP systems. We alsodiscuss the development costs of such ressources.
Héctor Martínez Alonso and Barbara Plank. 2017. When is multitask learning effective? Semantic sequence prediction under varying data conditions. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. pages 44–53. Association for Computational Linguistics. Valencia, Spain.

Multitask learning has been applied successfully to a range of tasks, mostly mor-phosyntactic. However, little is known on when MTL works and whether there are data characteristics that help to determine its success. In this paper we evaluate a range of semantic sequence labeling tasks in a MTL setup. We examine different auxiliary tasks, amongst which a novel setup, and correlate their impact to data-dependent conditions. Our results show that MTL is not always effective, significant improvements are obtained only for 1 out of 5 tasks. When successful, auxiliary tasks with compact and more uniform label distributions are preferable.
Matthieu Constant and Héctor Martinez Alonso. 2017. Benchmarking Joint Lexical and Syntactic Analysis on Multiword-Rich Data. In Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017). pages 181–186. Association for Computational Linguistics. Valencia, Spain.

This article evaluates the extension of a dependency parser that performs joint syntactic analysis and multiword expression identification. We show that, given sufficient training data, the parser benefits from explicit multiword information and improves overall labeled accuracy score in eight of the ten evaluation cases.
Héctor Martínez Alonso, \vZeljko Agić, Barbara Plank and Anders Søgaard. 2017. Parsing Universal Dependencies without training. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. pages 230–240. Association for Computational Linguistics. Valencia, Spain.

We propose UDP, the first training-free parser for Universal Dependencies (UD). Our algorithm is based on PageRank and a small set of head attachment rules. It features two-step decoding to guarantee that function words are attached as leaf nodes. The parser requires no training, and it is competitive with a delexicalized transfer system. UDP offers a linguistically sound unsupervised alternative to cross-lingual parsing for UD, which can be used as a baseline for such systems. The parser has very few parameters and is distinctly robust to domain change across languages.
Marie Candito, Bruno Guillaume, Guy Perrier and Djamé Seddah. 2017. Enhanced UD Dependencies with Neutralized Diathesis Alternation. In Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017). pages 42–53. Linköping University Electronic Press. Pisa, Italy.

The 2.0 release of the Universal Dependency treebanks demonstrates the effectiveness of the UD scheme to cope with very diverse languages. The next step would be to get more of syntactic analysis , and the " enhanced dependencies " sketched in the UD 2.0 guidelines is a promising attempt in that direction. In this work we propose to go further and enrich the enhanced dependency scheme along two axis: extending the cases of recovered arguments of non-finite verbs, and neutralizing syntactic alternations. Doing so leads to both richer and more uniform structures, while remaining at the syntactic level, and thus rather neutral with respect to the type of semantic representation that can be further obtained. We implemented this proposal in two UD treebanks of French, using deterministic graph-rewriting rules. Evaluation on a 200 sentence gold standard shows that deep syntactic graphs can be obtained from surface syntax annotations with a high accuracy. Among all arguments of verbs in the gold standard, 13.91% are impacted by syntactic alternation normalization, and 18.93% are additional deep edges.
Benoît Sagot. 2017. Extracting an Etymological Database from Wiktionary. In Electronic Lexicography in the 21st century (eLex 2017). pages 716–728. Leiden, Netherlands.

Electronic lexical resources almost never contain etymological information. The availability of such information, if properly formalised, could open up the possibility of developing automatic tools targeted towards historical and comparative linguistics, as well as significantly improving the automatic processing of ancient languages. We describe here the process we implemented for extracting etymological data from the etymological notices found in Wiktionary. We have produced a multilingual database of nearly one million lexemes and a database of more than half a million etymological relations between lexemes.
Benoît Sagot and Héctor Martínez Alonso. 2017. Improving neural tagging with lexical information. In Proceedings of the 15th International Conference on Parsing Technologies. pages 25–31. Association for Computational Linguistics. Pisa, Italy.

Neural part-of-speech tagging has achieved competitive results with the incorporation of character-based and pre-trained word embeddings. In this paper, we show that a state-of-the-art bi-LSTM tagger can benefit from using information from morphosyntactic lexicons as additional input. The tagger, trained on several dozen languages, shows a consistent, average improvement when using lexical information, even when also using character-based embeddings, thus showing the complementarity of the different sources of lexical information. The improvements are particularly important for the smaller datasets.
Sebastian Schuster, Éric Villemonte de La Clergerie, Marie D Candito, Benoît Sagot, Christopher D Manning and Djamé Seddah. 2017. Paris and Stanford at EPE 2017: Downstream Evaluation of Graph-based Dependency Representations. In Proceedings of the 2017 Shared Task on Extrinsic Parser Evaluation (EPE 2017). pages 47–59. Association for Computational Linguistics. Pisa, Italy.

We describe the STANFORD-PARIS and PARIS-STANFORD submissions to the 2017 Extrinsic Parser Evaluation (EPE) Shared Task. The purpose of this shared task was to evaluate dependency graphs on three downstream tasks. Through our submissions, we evaluated the usability of several representations derived from English Universal Dependencies (UD), as well as the Stanford Dependencies (SD), Predicate Argument Structure (PAS), and DM representations. We further compared two parsing strategies: Directly parsing to graph-based dependency representations and a two-stage process of first parsing to surface syntax trees and then applying rule-based augmentations to obtain the final graphs. Overall, our systems performed very well and our submissions ranked first and third. In our analysis, we find that the two-stage parsing process leads to better downstream performance, and that enhanced UD, a graph-based representation, consistently outperforms basic UD, a strict surface syntax representation, suggesting an advantage of enriched representations for downstream tasks.
Éric de La Clergerie, Benoît Sagot and Djamé Seddah. 2017. The ParisNLP entry at the ConLL UD Shared Task 2017: A Tale of a #ParsingTragedy. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. pages 243–252. Association for Computational Linguistics. Vancouver, Canada.

We present the ParisNLP entry at the UD CoNLL 2017 parsing shared task. In addition to the UDpipe models provided, we built our own data-driven tokenization models, sentence segmenter and lexicon-based morphological analyzers. All of these were used with a range of different parsing models (neural or not, feature-rich or not, transition or graph-based, etc.) and the best combination for each language was selected. Unfortunately, a glitch in the shared task's Matrix led our model selector to run generic, weakly lexicalized models , tailored for surprise languages, instead of our dataset-specific models. Because of this #ParsingTragedy, we officially ranked 27th, whereas our real models finally unofficially ranked 6th.
Héctor Martínez Alonso, Amaury Delamaire and Benoît Sagot. 2017. Annotating omission in statement pairs. In Proceedings of the 11th Linguistic Annotation Workshop. pages 41–45. Association for Computational Linguistics. Valencia, Spain.

In this piece of industrial application, we focus on the identification of omission in statement pairs for an online news platform. We compare three annotation schemes, namely two crowdsourcing schemes and an expert annotation. The simplest of the two crowdsourcing approaches yields a better annotation quality than the more complex one. We use a dedicated classifier to assess whether the annotators' behaviour can be explained by straightforward linguistic features. However , for our task, we argue that expert and not crowdsourcing-based annotation is the best compromise between cost and quality.
Benoît Sagot. 2017. Construction automatique d'une base de données étymologiques à partir du wiktionary (Automatic construction of an etymological database using Wiktionary). In Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 - Articles longs. pages 169–181. ATALA. Orléans, France.

Automatic construction of an etymological database using Wiktionary. Electronic lexical resources almost never contain etymological information. The availability of such information, if properly formalised, would open up the possibility of developing automatic tools targeted towards historical and comparative linguistics, as well as significantly improving the automatic processing of ancient languages. We describe here the process we implemented for extracting etymological data from the etymological notices found in Wiktionary. We have produced a multilingual database of nearly one million lexemes and a database of more than half a million etymological relations between lexemes.
Loïc Grobol, Frédéric Landragin and Serge Heiden. 2017. Interoperable annotation of (co)references in the Democrat project. In Proceedings of the 13th Joint ISO-ACL Workshop on Interoperable Semantic Annotation (ISA-13).

This paper proposes XML-TEI-URS, a generic TEI-based format for the annotation of coreferences in arbitrary corpora. This proposal is made in the context of Democrat, a French Agence Nationale de la Recherche project that aims to produce a large corpus of written French with coreference annotations, in an attempt to design a corpus that is usable both by humans and automated tools and as compatible as possible with future concurrent annotations.
Stefan Pernes, Laurent Romary and Kara Warburton. 2017. TBX in ODD: Schema-agnostic specification and documentation for TermBase eXchange. In LOTKS 2017- Workshop on Language, Ontology, Terminology and Knowledge Structures. Montpellier, France.

TermBase eXchange (TBX), the ISO standard for the representation and interchange of terminological data, is currently undergoing revision and will for the first time formalize overarching structural constraints regarding the definition and validation of dialects and XML styles. The paper describes the design of an ODD architecture, which allows for a complete specification of presentday TBX.
Hajer Maraoui, Kais Haddar and Laurent Romary. 2017. Encoding prototype of Al-Hadith Al-Shareef in TEI. In ICALP 2017 - The 6th International Conference on Arabic Language Processing. page 14. Fes, Morocco.

The standardization of Al-Hadith Al-Shareef can guarantee the interoperability and interchangeability with other textual sources and takes the processing of Al-Hadith corpus to a higher level. Still, research works on Hadith corpora had not previously considered the standardization as real objective, especially for some standards such as TEI (Text Encoding Initiative). In this context, we aim at the standardization of Al-Hadith Al-Shareef on the basis of the TEI guidelines. To achieve this objective, we elaborated a TEI model that we customized for Hadith structure. Then we developed a prototype allowing the encoding of Hadith text. This prototype analyses Hadith texts and automatically generates a standardized version of the Hadith in TEI format. The evaluation of the TEI model and the prototype is based on Hadith corpus collected from Sahih Bukhari. The obtained results were encouraging despite some flaws related to exceptional cases of Hadith structure.
Géraldine Walther and Benoît Sagot. 2017. Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin. In Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. pages 89–94. Association for Computational Linguistics. Vancouver, Canada.

In this paper, we present ongoing work for developing language resources and basic NLP tools for an undocumented variety of Romansh, in the context of a language documentation and language acquisition project. Our tools are designed to improve the speed and reliability of corpus annotations for noisy data involving large amounts of code-switching, occurrences of child speech and orthographic noise. Being able to increase the efficiency of language resource development for language documentation and acquisition research also constitutes a step towards solving the data sparsity issues with which researchers have been struggling.
Loïc Grobol, Isabelle Tellier, Éric de La Clergerie, Marco Dinarelli and Frédéric Landragin. 2017. Apports des analyses syntaxiques pour la détection automatique de mentions dans un corpus de français oral (Experiences in using deep and shallow parsing to detect entity mentions in oral French). In Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts. pages 200–208. ATALA. Orléans, France.

We present three experiments in detecting entity mentions in the corpus of oral French ANCOR, using publicly available parsing tools and state-of-the-art mention detection techniques used in coreference detection, anaphora resolution and Entity Detection and Tracking systems. While the tools we use are not specifically designed to deal with oral French, our results are comparable to those of state-of-the-art end-to-end systems for other languages. We also mention several ways to improve our results for future work in developing an end-to-end coreference resolution system for French, to which these experiments could be a baseline for mention detection.
Mohamed Khemakhem, Luca Foppiano and Laurent Romary. 2017. Automatic Extraction of TEI Structures in Digitized Lexical Resources using Conditional Random Fields. In Proceedings of eLex 2017 conference. pages 598–613. Leiden, Netherlands.

An important number of digitized lexical resources remain unexploited due to their unstructured content. Manually structuring such resources is a costly task given their multifold complexity. Our goal is to find an approach to automatically structure digitized dictionaries, independently from the language or the lexicographic school or style. In this paper we present a first version of GROBID-Dictionaries1, an open source machine learning system for lexical information extraction.Our approach is twofold: we perform a cascading structure extraction, while we select at each level specific features for training.We followed a ”divide to conquer” strategy to dismantle text constructs in a digitized dictionary, based on the observation of their layout. Main pages (see Figure 1) in almost any dictionary share three blocks: a header (green), a footer (blue) and a body (orange). The body is, in its turn, constituted by several entries (red). Each lexical entry can be further decomposed (see Figure 2) as: form (green), etymology (blue), sense (red) or/and related entry. The same logic could be applied further for each extracted block but in the scope of this paper we focus just on the first three levels.The cascading approach ensures a better understanding of the learning process’s output and consequently simplifies the feature selection process. Limited exclusive text blocks per level helps significantly in diagnosing the cause of prediction errors. It allows an early detection and replacement of irrelevant selected features that can bias a trained model. In such a segmentation, it becomes more straightforward to notice that, for instance, the token position in the page is very relevant to detect headers and footers and has almost no pertinence for capturing a sense in a lexical entry which is very often split on two pages.To implement our approach, we took up the available infrastructure from GROBID [7], a machine learning system for the extraction of bibliographic metadata. GROBID adopts the same cascading approach and uses Conditional Random Fields (CRF) [6] to label text sequences. The output of Grobid dictionary is planned to generate a TEI compliant encoding [2, 9] where the various segmentation levels are associated with an appropriate XML tessellation. Collaboration with COST ENeL are ongoing to ensure maximal compatibility with existing dictionary projects.Our experiments justify so far our choices, where models for the first two levels trained on two different dictionary samples have given a high precision and recall with a small amount of annotated data. Relying mainly on the text layout, we tried to diversify the selected features for each model, on the token and line levels. We are working on tuning features and annotating more data to maintain the good results with new samples and to improve the third segmentation level.While just few task specific attempts [1] have been using machine learning in this research direction, the landscape remains dominated by rule based techniquess [4, 3, 8] which are ad-hoc and costly, even impossible, to adapt for new lexical resources.

Communications

Anne Baillot, Anna Busch, Marie Puren, Michael Mertens and Laurent Romary. 2017. Nachhaltigkeit durch Zusammenschluss: Die DARIAH Data Re-Use Charter. In DHd 2017. Bern, Switzerland.

Mathias Seuret, Daniel Stökl Ben Ezra and Marcus Liwicki. 2017. Robust Heartbeat-based Line Segmentation Methods for Regular Texts and Paratextual Elements. In Proceedings of the 4th International Workshop on Historical Document Imaging and Processing. pages 71–76. Kyoto, Japan.

Charles Riondet and Luca Foppiano. 2017. GROBID for Humanities When engineering meets History. In Text as a Resource. Text Mining in Historical Science. Paris, France.

In this presentation we explore the relationship between humanists and computer scientists and the crucial need of scientific crossover, based on our experience in the interdisciplinary team ALMAnaCH at Inria, which gathers people with very different backgrounds. We focus on the use and development of the GROBID suite. GROBID is a tool initially built for extracting metadata from scientific articles. Over the years it has evolved with new features and been used in new domains (for example archival documents), with the help of specialists in the field concerned. In our example, we use it to identify mentions corresponding to the actors of armed conflicts in historical personal diaries. This is a Named Entity Recognition task made more complex by the specialized terminology due to the period (Second world war) and the presence of constraints of writing (clandestinity).
Anne Baillot, Marie Puren, Charles Riondet, Dorian Seillier and Laurent Romary. 2017. Access to cultural heritage data. A challenge for digital humanities. In Digital Humanities 2017. Montréal, Canada.

Access to high quality Cultural Heritage data and metadata is the condition for reliable, performant and verifiable research in many arts and humanities fields. One of the core challenges of giving access to Cultural Heritage data is often the lack of connection between local GLAM Institutions, infrastructures and research. The initiative we present in this paper addresses this issue by bringing together several supra-national infrastructures. These infrastructures are currently developing a common online environment that will allow all the relevant actors to connect and improve access to Cultural Heritage data. The “Cultural Heritage Data Reuse Charter” offers a comprehensive framework regarding all aspects relevant to cooperations revolving around access to and reuse of Cultural Heritage data.

Books

Charles Riondet. 2017. Le Comité parisien de la Libération (1943-1945). page 304. Presses Universitaires de Rennes.

Book chapters

Daniel Stökl Ben Ezra. 2017. The Mishnah into French. In Studies in Mishnaic Hebrew and Related Dialects : Proceedings of the Yale Symposium, May 2014. pages 349–367. The Program in Judaic Studies, Yale University.

Romain Garnier, Laurent Sagart and Benoît Sagot. 2017. Milk and the Indo-Europeans. In Language Dispersal Beyond Farming. pages 291–311. John Benjamins Publishing Company.

the Yamnaya culture, ofen regarded as the bearer of the Proto-Indo-European language, underwent a strong population expansion in the late 4th and early 3rd millennia BCE. It suggests that the underlying reason for that expansion might be the then unique capacity to digest animal milk in adulthood. We examine the early Indo-European milk-related vocabulary to confrm the special role of animal milk in Indo-European expansions. We show that Proto-Indo-European did not have a specialized root for ‘to milk’ and argue that the IE root *h2melg̑- ‘to milk’ is secondary and post-Anatolian. We take this innovation as an indication of the novelty of animal milking in early Indo-European society. Together with a detailed study of language-specifc innovations in this semantic feld, we conclude that the ability to digest milk played an important role in boosting Proto-Indo-European demography.

Tech reports

Laurent Romary. 2017. DARIAH-EU Annual Report 2016. Technical report.

This is the DARIAH-EU Annual Report 2016, providing the main highlights of the DARIAH-EU activities in 2016.
Dorian Seillier, Anne Baillot, Marie Puren and Charles Riondet. 2017. Survey on researchers requirements and practices towards Cultural Heritage institutions. Technical report.

Arts and Humanities research is mainly based on the analysis of " human traces "-such as artefacts, pieces of art, written documents, audio and video recordings, photographs, etc. – that are most of the time preserved by Cultural Heritage Institutions (or CHIs). To preserve, study and promote these objects, an increasing number of heterogeneous digital data is produced by CHIs, but also by Arts and Humanities scholars themselves. For instance, digital Cultural Heritage data include natively digital documents (like qualitative or quantitative datasets, digital photographs, transcriptions, etc.), digitized resources of all kind (such as scanned texts, digitized images or 3D models, etc.), but also attached metadata, annotation or further enrichments. In this regard, access to high quality Cultural Heritage data and metadata is essential to ensure high quality research in Arts and Humanities. Data sharing is indeed a key issue in the future development of Digital Humanities. According to this vision, enabling access to and promoting the reuse of Cultural Heritage data are thus crucial to create new collaboration and new research, facilitate the development of an open publishing environment for Arts and Humanities research, and reinforce the adoption of digital methods and workflows amongst researchers.However, in their relations with CHI, scholars seem always facing the same recurringproblem: « There is no generally valid rule as to how they can quote, duplicate andfurthermore republish in their scholarly work2 ». This situation is clearly a hindrance for bothCHIs and scholars, to develop new research on the one hand, and on the other to gainvisibility. It is now essential to tackle this issue by providing a clear and comprehensiveframework that enables interactions between CHIs and Arts and Humanities scholars.
Laurent Romary, Piotr Banski, Jack Bowers, Emiliano Degl'innocenti, Matej Ďurčo, Roberta Giacomi, Klaus Illmayer, Adeline Joffres, Fahad Khan, Mohamed Khemakhem, Nicolas Larrousse, Antonis Litke, Monica Monachini, Annelies Van Nispen, Maciej Ogrodniczuk, Nikolaos Papadakis, Graziella Pastore, Stefan Pernes, Marie Puren, Charles Riondet, Mikel Sanz, Maurizio Sanesi, Panayiotis Siozos and Reinier De Valk. 2017. Report on Standardization (draft). Technical report.

The present report reflects the second stage of the definition of the Standardisation Survival Kit (SSK) within Work Package 4 of the PARTHENOS project. On the basis of the various user scenarios presented in Deliverable 4.1, where each stage of the research process has been annotated according to the actual standards that are actually needed in order to fulfill the research task, we present here a systematic review of the activities that have to be carried out to provide support to researchers in using, but also contributing to, these standards.
Pierre Alliez, Laurent Bergerot, Jean-François Bernard, Clotilde Boust, George Bruseker, Nicola Carboni, Mehdi Chayani, Matteo Dellepiane, Nicolo Dell'Unto, Bruno Dutailly, Hélène Gautier, Gabriele Guidi, Anaïs Guillem, Adeline Joffres, Florent Laroche, Adeline Manuel, Maria Cristina Manzetti, Alain P Michel, Anthony Pamart, Jean Ponce, Marie Puren, Charles Riondet, Karina Rodriguez Echavarria, Laurent Romary, Roberto Scopigno and Sarah Tournon-Valiente. 2017. Digital 3D Objects in Art and Humanities: challenges of creation, interoperability and preservation. White paper. Technical report.

With this White Paper, which gathers contributions from more than 25 experts of 3D imaging, modellng and processing, as well as professionals concerned with the interoperability and sustainability of research data, the PARTHENOS project aims at laying the foundations of a comprehensive environment centered on the researchers' practices concerning 3D digital objects.The topics addressed in the document are meant to help to ensure the development of standardized good practices relating to the production, the handling, the long-term conservation and the reuse of 3D objects. Therefore, even if the focus is put on technical questions (formats, processing, and annotation), the White Paper also identifies the need to clarify the legal status of 3D objects, in order to facilitate their reuse(s) in non-research contexts, in particular in Museums.
Charles Riondet, Laurent Romary, Annelies van Nispen, Kepa Joseba Rodriguez and Mike Bryant. 2017. Report on Standards. Technical report.

This document describes mechanisms where interoperability ofdata is ensured with the use of standards. The standards wecovered are both domain related, the archival standards in XMLformats such as EAD, EAC-CPF and EAG, and transversalstandards, whose use is recommended in the context of any digitalproject, in particular the ISO standards for the representation oflanguage, script and countries.Interoperability of archival descriptions expressed in EAD is madepossible with the specification of a specific EAD profile for EHRI.This profile is built and maintained using the TEI-ODD framework,which is explained of the first section of the report.Interoperability and reusability of EHRI resources is also ensuredwith the design of more consistent URLs, composed withstandardised methods and using ISO reference codes. This designhas to be seen as a first step through a persistent identifier system.The work initiated in WP11 and presented in this document will becontinued, enhanced and developed by other EHRI work packages,WP7 Virtual Access to EHRI Virtual Observatory, WP10 ResourceIdentification and Integration Workflows and WP13 Research DataInfrastructures for Holocaust Material.

Other

Jack Bowers and Laurent Romary. 2017. Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec.

Joakim Nivre, Željko Agić, Lars Ahrenberg, Lene Antonsen, Maria Jesus Aranzabe, Masayuki Asahara, Luma Ateyah, Mohammed Attia, Aitziber Atutxa, Liesbeth Augustinus, Elena Badmaeva, Miguel Ballesteros, Esha Banerjee, Sebastian Bank, Verginica Barbu Mititelu, John Bauer, Kepa Bengoetxea, Riyaz Ahmad Bhat, Eckhard Bick, Victoria Bobicev, Carl Börstell, Cristina Bosco, Gosse Bouma, Sam Bowman, Aljoscha Burchardt, Marie Candito, Gauthier Caron, Gülşen Cebiroğlu Eryiğit, Giuseppe G. A. Celano, Savas Cetin, Fabricio Chalub, Jinho Choi, Silvie Cinková, Çağri Çöltekin, Miriam Connor, Elizabeth Davidson, Marie-catherine De Marneffe, Valeria De Paiva, Arantza Diaz De Ilarraza, Peter Dirix, Kaja Dobrovoljc, Timothy Dozat, Kira Droganova, Puneet Dwivedi, Marhaba Eli, Ali Elkahky, Tomaž Erjavec, Richárd Farkas, Hector Fernandez Alcalde, Jennifer Foster, Cláudia Freitas, Katarína Gajdošová, Daniel Galbraith, Marcos Garcia, Moa Gärdenfors, Kim Gerdes, Filip Ginter, Iakes Goenaga, Koldo K. Gojenola, Memduh Gökirmak, Yoav Goldberg, Xavier Gómez Guinovart, Berta Gonzáles Saavedra, Matias Grioni, Normunds Grūzītis, Bruno Guillaume, Nizar Habash, Jan Hajič, Jan Hajič Jr., Linh Hà Mỹ, Kim Harris, Dag Haug, Barbora Hladká, Jaroslava Hlaváčová, Florinel Hociung, Petter Hohle, Radu Ion, Elena Irimia, Tomáš Jelínek, Anders Johannsen, Fredrik Jørgensen, Hüner Kaşikara, Hiroshi Kanayama, Jenna Kanerva, Tolga Kayadelen, Václava Kettnerová, Jesse Kirchner, Natalia Kotsyba, Simon Krek, Veronika Laippala, Lorenzo Lambertino, Tatiana Lando, John Lee, Phương Lê Hồng, Alessandro Lenci, Saran Lertpradit, Herman Leung, Cheuk Ying Li, Josie Li, Keying Li, Nikola Ljubešić, Olga Loginova, Olga Lyashevskaya, Teresa Lynn, Vivien Macketanz, Aibek Makazhanov, Michael Mandl, Christopher Manning, Cătălina Mărănduc, David Mareček, Katrin Marheinecke, Héctor Martínez Alonso, André Martins, Jan Mašek, Yuji Matsumoto, Ryan Mcdonald, Gustavo Mendonça, Niko Miekka, Anna Missilä, Cătălin Mititelu, Yusuke Miyao, Simonetta Montemagni, Amir More, Laura Moreno Romero, Shinsuke Mori, Bohdan Moskalevskyi, Kadri Muischnek, Kaili Müürisep, Pinkey Nainwani, Anna Nedoluzhko, Gunta Nešpore-bērzkalne, Lương Nguyễn Thị, Huyền Nguyễn Thị Minh, Vitaly Nikolaev, Hanna Nurmi, Stina Ojala, Petya Osenova, Robert Östling, Lilja Øvrelid, Elena Pascual, Marco Passarotti, Cenel-augusto Perez, Guy Perrier, Slav Petrov, Jussi Piitulainen, Emily Pitler, Barbara Plank, Martin Popel, Lauma Pretkalniņa, Prokopis Prokopidis, Tiina Puolakainen, Sampo Pyysalo, Alexandre Rademaker, Loganathan Ramasamy, Taraka Rama, Vinit Ravishankar, Livy Real, Siva Reddy, Georg Rehm, Larissa Rinaldi, Laura Rituma, Mykhailo Romanenko, Rudolf Rosa, Davide Rovati, Benoît Sagot, Shadi Saleh, Tanja Samardžić, Manuela Sanguinetti, Baiba Saulīte, Sebastian Schuster, Djamé Seddah, Wolfgang Seeker, Mojgan Seraji, Mo Shen, Atsuko Shimada, Dmitry Sichinava, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Mária Šimková, Kiril Simov, Aaron Smith, Antonio Stella, Milan Straka, Jana Strnadová, Alane Suhr, Umut Sulubacak, Zsolt Szántó, Dima Taji, Takaaki Tanaka, Trond Trosterud, Anna Trukhina, Reut Tsarfaty, Francis Tyers, Sumire Uematsu, Zdeňka Urešová, Larraitz L. Uria, Hans Uszkoreit, Sowmya Vajjala, Daniel Van Niekerk, Gertjan Van Noord, Viktor Varga, Éric Villemonte de La Clergerie, Veronika Vincze, Lars Wallin, Jonathan North Washington, Mats Wirén, Tak-sum Wong, Zhuoran Yu, Zdeněk Žabokrtský, Amir Zeldes, Daniel Zeman and Hanzhi Zhu. 2017. Universal Dependencies 2.1.

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).http://hdl.handle.net/11234/1-2515
Laurent Romary and Charles Riondet. 2017. Ongoing maintenance and customization of archival standards using ODD (EAC-CPF revision proposal).

The EAC-CPF tag library is natively expressed using the TEI (Text Encoding Initiative) guidelines and maintained collaboratively on GitHub. This solution has already proven to offer some flexibility. Starting from this, we propose to go one step further and ​ create a complete maintenance framework of EAC-CPF based on the technical means provided by​ ​ the​ ​ TEI​ ​ guidelines​. A​ ​ well-documented​ ​ framework The Text Encoding Initiative (TEI) is broadly recognized as the ​ de facto standard for the representation of a variety of textual content expressed in digital form, but the TEI can be used to represent a wider range of digital resources. For instance, the TEI XML schema and the associated guidelines are maintained with the TEI format, more precisely, with a subset called "One Document Does it all" (ODD) which, as the name indicates, is a description language that "includes the schema fragments, prose documentation, and reference documentation​ ​ [...]​ ​ in​ ​ a​ ​ single​ ​ document" ,​ ​ based​ ​ on​ ​ the​ ​ principles​ ​ of​ ​ literate​ ​ programming. 1 Literate programming is a programming and documentation methodology whose "central tenet is that documentation is more important than source code and should be the focus of a programmer's activity". With ODD, semantic and structural consistency is ensured as we 2 encode and document best practices in both machine and human-readable format. ODD was created at first to give TEI users a straightforward way to customize the TEI schema according​ ​ to​ ​ their​ ​ own​ ​ practices​ ​ and​ ​ document​ ​ this​ ​ customization. It is possible to describe a schema and the associated documentation of any XML format. In the context of the EHRI project (ehri-project.eu), ODD was used to encode completely the EAD standard, as well as the guidelines provided by the Library of Congress. The maintenance on a GitHub repository also offers great possibilities to collectively discuss potential​ ​ issues,​ ​ enhancements,​ ​ etc.
Laurent Romary and Andreas Degkwitz. 2017. IFLA Satellite Meeting - Digital Humanities–Opportunities and Risks: Connecting Libraries and Research.

Laurent Romary. 2017. The Text Encoding Initiative as an Infrastructure.

Marie Puren, Charles Riondet, Dorian Seillier and Laurent Romary. 2017. The Standardization Survival Kit (SSK).

The H2020 project PARTHENOS (Pooling Activities Resources and Tools for Heritage E-research Networking, Optimization and Synergies) aims at strengthening the cohesion of European research in Arts and Humanities. The involved infrastructures have to address the new challenges posed by the increasing amount of digital contents and tools, and to support the emergence of a next generation of digitally aware scholars in the Arts and Humanities. Various reports and statements - like Riding the wave in 2010 - has thus acknowledged the growing importance to develop a data-centered strategy for the management of scientific data. In this context, standardization becomes a necessity for researchers in Arts and Humanities. The poster will present the work carried out for a) the Standardization Survival Kit (SSK), an overlay platform dedicated to promote a wider use of standards within Arts and Humanities; and b) a specific awareness package articulated around the “Why standards?” leaflet to make scholars understand the essential role of standardized methods and content for the reusability of research results.The SSK is a comprehensive interface aiming at providing documentation and resources concerning standards. It covers three types of activities related to the deployment and use of standards in the Arts and Humanities scholarship:- Documenting existing standards by providing reference materials- Fostering the adoption of standards- Communicating to research communitiesThe SSK is designed as a comprehensive interface to guide scholars through all available resources, on the basis of reference scenarios identified since the beginning of the project. The design intends to provide a single entry point for novice or advanced scholars in the domain of digital methods, so that they can have quick access to the information needed for managing digital content, or applying the appropriate method in a scholarly context. Take, for example,a scholar,with few technical skills, who wishes to work on textual resources with digital tools. The SSK will offer her/him to explore reference scenarios, enabling her/him to easily discover new materials concerning standards (e.g. TEI P5 And Its subsets), such as bibliographic references, tutorials, prototypical examples, transformation stylesheets. It will thus accompany her/him throughout her/his project, from the transcription of the primary documents to their publication online.Theposter will also present the “Why standards”leaflet - key element of the SSK’s awareness package.The leaflet integrates a short text and a cartoon, and aims to be the point of entry to the SSK, but also to the PARTHENOS website and the helpdesk. This awareness-raising cartoon targets scholars with few technical skills: it is designed to communicate the necessity of standardization in the scientific world, in a way that will appeal to a wide audience and give a more modern and less off-putting image of standards.
Laurent Romary and Jennifer Edmond. 2017. Sustainability in DARIAH.

Notes jotted in the context of the DARIAH-DE discussion on sustainability.
Laurent Romary. 2017. How to Open up? (Digital) Libraries at the Service of (Digital) Scholars.

The talk presents the perspective of an organisation, Inria, that has made a strong move towards open science and the dissemination of digital content. We will contemplate the consequences on the development of new services within our institution and show how we move from the development of collections out of externally-provided content to the shaping of a scientist-centred digital library. We will also analyse the technological impact of this policy but also the necessary evolution of the role and skills of our library staff.

Preprints

Veerle Vanden Daelen, Jennifer Edmond, Petra Links, Mike Priddy, Linda Reijnhoudt, Václav Tollar, Annelies van Nispen, Charlotte Hauwaert and Charles Riondet. 2017. La publication durable digitale des guides d'archives de l'histoire du 20ème siècle. Preprint.