Publications

Explore our publications on the HAL archive

2024

PhD theses and Habiliations

Tu Anh Nguyen. 2024. Spoken Language Modeling from Raw Audio. PhD thesis. Sorbonne Université.

Speech has always been a dominant mode of social connection and communication. However, speech processing and modeling have been challenging due to its variability. Classic speech technologies rely on cascade modeling, i.e. transcribing speech to text with an Automatic Speech Recognition (ASR) system, processing transcribed text using Natural Language Processing (NLP) methods, and converting text back to speech with a Speech Synthesis model. This method eliminates speech variability but requires a lot of textual datasets, which are not always available for all languages. In addition, it removes all the expressivity contained in the speech itself.Recent advancements in self-supervised speech learning (SpeechSSL) have enabled the learning of good discrete speech representations from raw audio, bridging the gap between speech and text technologies. This allows to train language models on discrete representations (discrete units, or pseudo-text) obtained from the speech and has given rise to a new domain called TextlessNLP, where the task is to learn the language directly on audio signals, bypassing the need for ASR systems. The so-called Spoken Language Models (Speech Language Models, or SpeechLMs) have been shown to be working and offer new possibilities for speech processing compared to cascade systems.The objective of this thesis is thus to explore and improve this newly-formed domain. We are going to analyze why these discrete representations work, discover new applications of SpeechLMs to spoken dialogues, extend TextlessNLP to more expressive speech as well as improve the performance of SpeechLMs to reduce the gap between SpeechLMs and TextLMs.
Paul-Ambroise Duquenne. 2024. Sentence Embeddings for Massively Multilingual Speech and Text Processing. PhD thesis. Sorbonne Université.

Representation learning of sentences has been widely studied in NLP. While many works have explored different pre-training objectives to create contextual representations from sentences, several others have focused on learning sentence embeddings for multiple languages with the aim of closely encoding paraphrases and translations in the sentence embedding space.In this thesis, we first study how to extend text sentence embedding spaces to the speech modality in order to build a multilingual speech/text sentence embedding space. Next, we explore how to use this multilingual and multimodal sentence embedding space for large-scale speech mining. This allows us to automatically create alignments between written and spoken sentences in different languages. For high similarity thresholds in the latent space, aligned sentences can be considered as translations. If the alignments involve written sentences on one side and spoken sentences on the other, then these are potential speech-to-text translations. If the alignments involve on both sides spoken sentences, then these are potential speech-to-speech translations. To validate the quality of the mined data, we train speech-to-text translation models and speech-to-speech translation models. We show that adding the automatically mined data significantly improves the quality of the learned translation models, demonstrating the quality of the alignments and the usefulness of the mined data.Then, we study how to decode these sentence embeddings into text or speech in different languages. We explore several methods for training decoders and analyze their robustness to modalities/languages not seen during training, to evaluate cross-lingual and cross-modal transfers. We demonstrate that we could perform zero-shot cross-modal translation in this framework, achieving translation results close to systems learned in a supervised manner with a cross-attention mechanism. The compatibility between speech/text representations from different languages enables these very good performances, despite an intermediate fixed-size representation.Finally, we develop a new state-of-the-art massively multilingual speech/text sentence embedding space, named SONAR, based on conclusions drawn from the first two projects. We study different objective functions to learn such a space and we analyze their impact on the organization of the space as well as on the capabilities to decode these representations. We show that such sentence embedding space outperform previous state-of-the-art methods for both cross-lingual and cross-modal similarity search as well as decoding capabilities. This new space covers 200 written languages and 37 spoken languages. It also offers text translation results close to the NLLB system on which it is based, and speech translation results competitive with the Whisper supervised system. We also present SONAR EXPRESSIVE, which introduces an additional representation encoding non-semantic speech properties, such as vocal style or expressivity of speech.

Journal articles

Lucie Chenain, Rachid Riad, Nicolas Fraisse, Cécilia Jubin, Graça Morgado, Katia Youssov, Marine Lunven and Anne-Catherine Bachoud-Levi. 2024. Graph methods to infer spatial disturbances: Application to Huntington's Disease's speech. Cortex 176 pages 144 – 160. Elsevier.

Objective: Huntington's Disease (HD) is an inherited neurodegenerative disease caused by the mutation of the Htt gene, impacting all aspects of living and functioning. Among cognitive disabilities, spatial capacities are impaired, but their monitoring remains scarce as limited by lengthy experts' assessments. Language offers an alternative medium to evaluate patients' performance in HD. Yet, its capacities to assess HD's spatial abilities are unknown. Here, we aimed to bring proof-of-concept that HD's spatial deficits can be assessed through speech.Methods: We developed the Spatial Description Model to graphically represent spatial relations described during the Cookie Theft Picture (CTP) task. We increased the sensitivity of our model by using only sentences with spatial terms, unlike previous studies in Alzheimer's disease. 78 carriers of the mutant Htt, including 56 manifest and 22 premanifest individuals, as well as 25 healthy controls were included from the BIOHD & (NCT01412125) & Repair-HD (NCT03119246) cohorts. The convergence and divergence of the model were validated using the SelfCog battery.Results: Our Spatial Description Model was the only one among the four assessed approaches, revealing that individuals with manifest HD expressed fewer spatial relations and engaged in less spatial exploration compared to healthy controls. Their graphs correlated with both visuospatial and language SelfCog performances, but not with motor, executive nor memory functions.Conclusions: We provide the proof-of-concept using our Spatial Description Model that language can grasp HD patient's spatial disturbances. By adding spatial capabilities to the panel of functions tested by the language, it paves the way for eventual remote clinical application.
Cyril Chhun, Fabian M. Suchanek and Chloé Clavel. 2024. Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation. Transactions of the Association for Computational Linguistics 12 pages 1122–1142. MIT Press. Cambridge, MA.

Storytelling is an integral part of human experience and plays a crucial role in social interactions. Thus, Automatic Story Evaluation (ASE) and Generation (ASG) could benefit society in multiple ways, but they are challenging tasks which require high-level human abilities such as creativity, reasoning and deep understanding. Meanwhile, Large Language Models (LLM) now achieve state-of-the-art performance on many NLP tasks. In this paper, we study whether LLMs can be used as substitutes for human annotators for ASE. We perform an extensive analysis of the correlations between LLM ratings, other automatic measures, and human annotations, and we explore the influence of prompting on the results and the explainability of LLM behaviour. Most notably, we find that LLMs outperform current automatic measures for system-level evaluation but still struggle at providing satisfactory explanations for their answers.
Aina Garí Soler, Matthieu Labeau and Chloé Clavel. 2024. The Impact of Word Splitting on the Semantic Content of Contextualized Word Representations. Transactions of the Association for Computational Linguistics 12 pages 299–320. MIT Press. Cambridge, MA.

When deriving contextualized word representations from language models, a decision needs to be made on how to obtain one for out-of-vocabulary (OOV) words that are segmented into subwords. What is the best way to represent these words with a single vector, and are these representations of worse quality than those of in-vocabulary words? We carry out an intrinsic evaluation of embeddings from different models on semantic similarity tasks involving OOV words. Our analysis reveals, among other interesting findings, that the quality of representations of words that are split is often, but not always, worse than that of the embeddings of known words. Their similarity values, however, must be interpreted with caution.
Ana Salgado, Laurent Romary, Rute Costa, Toma Tasovac, Anas Fahad Khan, Margarida Ramos, Bruno Almeida, Sara Carvalho, Mohamed Khemakhem, Raquel Silva and Boris Lehečka. 2024. The Morais Dictionary: Following Best Practices in a Retro-digitized Dictionary Project. International Journal of Humanities and Arts Computing 18 pages 125 – 147. Edinburgh University Press.

This article outlines essential best practices for retro-digitized dictionary projects, using the ongoing MORDigital project (DOI 10.54499/PTDC/LLT-LIN/6841/2020) as a case study. The MORDigital project focuses on digitally transforming the historically significant Portuguese Morais dictionary’s first three editions (1789, 1813, 1823). While the primary objective is to create faithful digital versions of these renowned dictionaries, MORDigital stands out by going beyond the mere adoption of established best practices. Instead, it reflects on the choices made throughout the process, providing insights into the decision-making process. The key topics emphasized include (1) the establishment of a robust data model; (2) the refinement of metadata; (3) the implementation of consistent identifiers; and (4) the enhancement of encoding techniques; additionally exploring the issue of structuring domain labelling. The article aims to contribute to the ongoing discourse on best practices in retro-digitized dictionary projects and their implications for data preservation and knowledge organization.

Conference proceedings

Armel Zebaze, Benoît Sagot and Rachel Bawden. 2024. Tree of Problems: Improving structured problem solving with compositionality. In EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing pages 18028–18047. Miami, FL, United States.

Large Language Models (LLMs) have demonstrated remarkable performance across multiple tasks through in-context learning. For complex reasoning tasks that require step-by-step thinking, Chain-of-Thought (CoT) prompting has given impressive results, especially when combined with self-consistency. Nonetheless, some tasks remain particularly difficult for LLMs to solve. Tree of Thoughts (ToT) and Graph of Thoughts (GoT) emerged as alternatives, dividing the complex problem into paths of subproblems. In this paper, we propose Tree of Problems (ToP), a simpler version of ToT, which we hypothesise can work better for complex tasks that can be divided into identical subtasks. Our empirical results show that our approach outperforms ToT and GoT, and in addition performs better than CoT on complex reasoning tasks. All code for this paper is publicly available here: https://github.com/ArmelRandy/tree-of-problems.
Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Christof Monz, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popović, Mariya Shmatova, Stein\thór Steingrímsson and Vilém Zouhar. 2024. Findings of the WMT24 General Machine Translation Shared Task: The LLM Era is Here but MT is Not Solved Yet. In WMT 2024 - Ninth Conference on Machine Translation. pages 1–46. Miami, Florida, United States.

This overview paper presents the results of the General Machine Translation Task organised as part of the 2024 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of three to five different domains. In addition to participating systems, we collected translations from 8 different large language models (LLMs) and 4 online translation providers. We evaluate system outputs with professional human annotators using a new protocol called Error Span Annotations (ESA).
Mariana Neves, Cristian Grozea, Philippe Thomas, Roland Roller, Rachel Bawden, Aurélie Névéol, Steffen Castle, Vanessa Bonato, Giorgio Maria Di Nunzio, Federica Vezzani, Maika Vicente Navarro, Lana Yeganova and Antonio Jimeno Yepes. 2024. Findings of the WMT 2024 Biomedical Translation Shared Task: Test Sets on Abstract Level. In Proceedings of the Ninth Conference on Machine Translation. pages 124–138. Association for Computational Linguistics. Miami, Florida, USA.

We present the results of the ninth edition of the Biomedical Translation Task at WMT'24. We released test sets for six language pairs, namely, French, German, Italian, Portuguese, Russian, and Spanish, from and into English. Each test set consists of 50 abstracts from PubMed. Differently from previous years, we did not split abstracts into sentences. We received submissions from five teams, and for almost all language directions. We used a baseline/comparison system based on Llama 3.1 and share the source code at https://github. com/cgrozea/wmt24biomed-ref.
Karim El Haff, Wissam Antoun, Agnès Braud, Florence Le Ber and Véronique Pitchon. 2024. Building and Assessing a Named Entity Recognition Resource for Ancient Pharmacopeias. In ECAI 2024 - 27th European conference on artificial intelligence. 392 pages 2354–2361. IOS Press. Santiago de Compostela, Spain.

This research revolves around utilising Named Entity Recognition (NER) to analyse and categorise data from English translations of pharmacopeias from the Abbasid era, noted for its valuable contributions to science and medicine. The main goal of this work, along with publishing this resource freely, is to assess crossmanuscript NER performance by evaluating the NER model's performance on unseen corpora and translation styles, as well as demonstrating the transferability of the NER task on such corpora. Two distinct experiments were conducted, focusing on F1-scores differences from mixing source translators and varying training dataset sizes. In experiments involving mixing translator styles, training on a mix of all available styles while accounting for dataset size yielded the best F1-scores compared to even training on the same style as the testing data, while experiments with dataset sizes show diminishing returns of scaling training datasets compared to varying translation styles. This work attempts to enhance the exploration of the medical knowledge embodied in these texts to facilitate their analysis for knowledge extraction relevant to modern medical practices. Furthermore, this research demonstrates strategies to optimise NER results in this context, forming a juncture between digitising historical information and enabling further explorations in pharmacopeia-related Natural Language Processing research.
Anh Ngo, Dirk Heylen, Nicolas Rollet, Catherine Pelachaud and Chloé Clavel. 2024. Exploration of Human Repair Initiation in Task-oriented Dialogue : A Linguistic Feature-based Approach. In SIGDIAL 2024 - 25th Meeting of the Special Interest Group on Discourse and Dialogue. pages 603–609. Kyoto, Japan.

<div><p>In daily conversations, people often encounter problems prompting conversational repair to enhance mutual understanding. By employing an automatic coreference solver, alongside examining repetition, we identify various linguistic features that distinguish turns when the addressee initiates repair from those when they do not. Our findings reveal distinct patterns that characterize the repair sequence and each type of other-repair initiation.</p></div>
Hugo Scheithauer and Laurent Romary. 2024. Experimenting With Generic Recognition Systems for Kuzushiji Documents: Furigana Extraction as a Use-Case. In JADH2024 - 13th Conference of Japanese Association for Digital Humanities «Leveraging AI and Digital Humanities for Sustainable Infrastructure»; Tokyo, Japan.

Simon Gabay and Thibault Clérice. 2024. The birth of French orthography. A computational analysis of French spelling systems in diachrony. In CHR2024–Computational Humanities Research Conference. Aahrus, Denmark.

The 17th c. is crucial for the French language, as it sees the creation of a strict orthographic norm that largely persists to this day. Despite its significance, the history of spelling systems remains however an overlooked area in linguistics for two reasons. On the one hand, spelling is made up of microchanges which requires a quantitative approach, and on the other hand, no corpus is available due to the interventions of editors in almost all the texts already available. In this paper, we therefore propose a new corpus allowing such a study, as well as the extraction and analysis tools necessary for our research. By comparing the text extracted with OCR and a version automatically aligned with contemporary French spelling, we extract the variant zones, we categorise these variants, and we study their frequency to study the (ortho)graphic change during the 17th century.
Benjamin Kiessling and Thibault Clérice. 2024. Does Context Matter ? Enhancing Handwritten Text Recognition with Metadata in Historical Manuscripts. In CHR2024–Computational Humanities Research Conference. Aahrus, Denmark.

The digitization of historical manuscripts has significantly advanced in recent decades, yet many documents remain as images without machine-readable text. Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting these images into text, facilitating large-scale analysis of historical collections. In 2024, the CATMuS Medieval dataset was released, featuring extensive diachronic coverage and a variety of languages and script types. Previous research indicated that model performance degraded on the best manuscripts over time as more data was incorporated, likely due to over-generalization. This paper investigates the impact of incorporating contextual metadata in training HTR models using the CATMuS Medieval dataset to mitigate this effect. Our experiments compare the performance of various model architectures, focusing on Conformer models with and without contextual inputs, as well as Conformer models trained with auxiliary classification tasks. Results indicate that Conformer models utilizing semantic contextual tokens (Century, Script, Language) outperform baseline models, particularly on challenging manuscripts. The study underscores the importance of metadata in enhancing model accuracy and robustness across diverse historical texts.
Rachel Bawden, Hatim Bourfoune, Bertrand Cabot, Nathan Cassereau, Pierre Cornette, Marco Naguib and François Yvon. 2024. Evaluer BLOOM en français. In Proceedings of the 2024 Atelier sur l'évaluation des modèles génératifs (LLM) et challenge d'extraction d'information few-shot. Toulouse, France.

The development of very large language models, capable of performing multipes tasks, implies to develop the necessary infrastructures to evaluate these models, ideally covering as many facets as possible. Numerous benchmarks have already been compiled for English, making it possible to precisely gauge their ability to process this language. In this paper, we present our own efforts to assemble a multi-task evaluation set for French, which is then used to evaluate models from the Bloom family. Our results complement the main evaluation results for Bloom in English ; they suggest that the performance obtained in French and English are very similar, and even better when the amorces used for contextual inference are in the same language as the texts to analyze
Chadi Helwe, Tom Calamai, Pierre-Henri Paris, Chloé Clavel and Fabian Suchanek. 2024. MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pages 4810–4845. Association for Computational Linguistics. Mexico City, Mexico.

We introduce MAFALDA, a benchmark for fallacy classification that merges and unites previous fallacy datasets. It comes with a taxonomy that aligns, refines, and unifies existing classifications of fallacies. We further provide a manual annotation of a part of the dataset together with manual explanations for each annotation. We propose a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity. We then evaluate several language models under a zero-shot learning setting and human performances on MAFALDA to assess their capability to detect and classify fallacies.
Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Suppa, Hila Gonen, Joseph Marvin Imperial, Börje Karlsson, Peiqin Lin, Nikola Ljube\vsić, Lester James Miranda, Barbara Plank, Arij Riabi and Yuval Pinter. 2024. Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pages 4322–4337. Association for Computational Linguistics. Mexico City, Mexico.

We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 19 datasets annotated with named entities in a cross-lingual consistent schema across 13 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public.
Anh Ngo, Chloé Clavel, Catherine Pelachaud and Nicolas Rollet. 2024. Multimodal models of repair in social human-agent interactions. In Proceedings of the 2024 Workshop Affect, Compagnons Artificiels et Interactions (WACAI 2024). Bordeaux, France.

People often encounter troubles in everyday conversations, prompting them to initiate repairs, which are various approaches employed to recognize and resolve those problems, fostering mutual understanding across conversational turns. However, maintaining a smooth interaction remains challenging for Conversational Agents (CAs), which are dialogue systems designed to simulate conversation with humans (including chatbots, social robots, and virtual assistants). To foster seamless human-agent interaction, the CA should be able to recognize repairs initiated by humans, utilize multimodal cues, and participate in the repair process. This article, which is an overview of our thesis research project, outlines our ongoing efforts to accomplish this objective. The initial phase involves analyzing repair phenomena in human-human interactions.
Nathaniel Robinson, Raj Dabre, Ammon Shurtz, Rasul Dent, Onenamiyi Onesi, Claire Monroc, Loïc Grobol, Hasan Muhammad, Ashi Garg, Naome Etori, Vijay Murari Tiyyala, Olanrewaju Samuel, Matthew Stutzman, Bismarck Odoom, Sanjeev Khudanpur, Stephen Richardson and Kenton Murray. 2024. Kreyòl-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pages 3083–3110. Association for Computational Linguistics. Mexico City, Mexico.

A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We present the largest cumulative dataset to date for Creole language MT, including 14.5M unique Creole sentences with parallel translations -- 11.6M of which we release publicly, and the largest bitexts gathered to date for 41 languages -- the first ever for 21. In addition, we provide MT models supporting all 41 Creole languages in 172 translation directions. Given our diverse dataset, we produce a model for Creole language MT exposed to more genre diversity than ever before, which outperforms a genre-specific Creole MT model on its own benchmark for 26 of 34 translation directions.
Lauriane Aufrant and Lucie Chasseur. 2024. UkraiNER: A New Corpus and Annotation Scheme towards Comprehensive Entity Recognition. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 16941–16952. ELRA and ICCL. Torino, Italia.

Named entity recognition as it is traditionally envisioned excludes in practice a significant part of the entities of potential interest for real-word applications: nested, discontinuous, non-named entities. Despite various attempts to broaden their coverage, subsequent annotation schemes have achieved little adoption in the literature and the most restrictive variant of NER remains the default. This is partly due to the complexity of those annotations and their format. In this paper, we introduce a new annotation scheme that offers higher comprehensiveness while preserving simplicity, together with an annotation tool to implement that scheme. We also release the corpus UkraiNER, comprised of 10,000 French sentences in the geopolitical news domain and manually annotated with comprehensive entity recognition. Our baseline experiments on UkraiNER provide a first point of comparison to facilitate future research (82 F1 for comprehensive entity recognition, 87 F1 when focusing on traditional nested NER), as well as various insights on the composition and challenges that this corpus presents for state-of-the-art named entity recognition models.
Simon Meoni, Éric De la Clergerie and Théo Ryffel. 2024. Generating Synthetic Documents with Clinical Keywords: A Privacy-Sensitive Methodology. In Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024. pages 115–123. ELRA and ICCL. Torino, Italia.

Electronic Health Records (EHR) store valuable patient-staff interaction data. These notes, often unstructured to save healthcare personnel time, can be challenging to analyze manually. Proprietary online LLMs have demonstrated impressive results in analyzing EHR notes. However, Clinical NLP faces unique challenges due to the sensitive and specialized nature of the data. Sending patient information via external APIs poses privacy risks, and hospitals require customized NLP systems to align with their practices. Developing customized LLMs using specific training datasets is crucial to address these challenges. We propose generating synthetic training data using keywords extracted without confidential information. Furthermore, we introduce a reward mechanism that iteratively refines the quality of synthetic documents. This involves scoring synthetic candidates against real clinical reports using a semantic textual similarity score and performing an alignment step to align the model with its best-scored utterances.
Arij Riabi, Menel Mahamdi, Virginie Mouilleron and Djamé Seddah. 2024. Cloaked Classifiers: Pseudonymization Strategies on Sensitive Classification Tasks. In Proceedings of the Fifth Workshop on Privacy in Natural Language Processing. pages 123–136. Association for Computational Linguistics. Bangkok, Thailand.

Protecting privacy is essential when sharing data, particularly in the case of an online radicalization dataset that may contain personal information. In this paper, we explore the balance between preserving data usefulness and ensuring robust privacy safeguards, since regulations like the European GDPR shape how personal information must be handled. We share our method for manually pseudonymizing a multilingual radicalization dataset, ensuring performance comparable to the original data. Furthermore, we highlight the importance of establishing comprehensive guidelines for processing sensitive NLP data by sharing our complete pseudonymization process, our guidelines, the challenges we encountered as well as the resulting dataset.
Léo Labat and Lauriane Aufrant. 2024. Évaluation de l'apport des chaînes de coréférences pour le liage d'entités. In Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position. pages 397–409. ATALA and AFPC. Toulouse, France.

Ce travail propose de revisiter les approches de liage d’entités au regard de la tâche très prochequ’est la résolution de coréférence. Nous observons en effet différentes configurations (appuyéespar l’exemple) où le reste de la chaîne de coréférence peut fournir des indices utiles pour améliorerla désambiguïsation. Guidés par ces motivations théoriques, nous menons une analyse d’erreursaccompagnée d’expériences oracles qui confirment le potentiel de stratégies de combinaison deprédictions au sein de la chaîne de coréférence (jusqu’à 4.3 F1 sur les mentions coréférentes en anglais). Nousesquissons alors une première preuve de concept de combinaison par vote, en explorant différentesheuristiques de pondération, qui apporte des gains modestes mais interprétables.
Ziqian Peng, Rachel Bawden and François Yvon. 2024. À propos des difficultés à traduire automatiquement de longs documents. In Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position. pages 2–21. ATALA and AFPC. Toulouse, France.

Les nouvelles architectures de traduction automatique sont capables de traiter des segments longs et de surpasser la traduction de phrases isolées, laissant entrevoir la possibilité de traduire des documents complets. Pour y parvenir, il est nécessaire de surmonter un certain nombre de difficultés liées à la longueur des documents à traduire. Dans cette étude, nous discutons de la traduction des documents sous l'angle de l'évaluation, en essayant de répondre à une question simple: comment mesurer s'il existe une dégradation des performances de traduction avec la longueur des documents ? Nos analyses, qui évaluent des systèmes encodeur-décodeur et un grand modèle de langue à l'aune de plusieurs métriques sur une tâche de traduction de documents scientifiques suggèrent que traduire les documents longs d'un bloc reste un problème difficile.
You Zuo, Kim Gerdes, Éric Clergerie and Benoît Sagot. 2024. PatentEval: Understanding Errors in Patent Generation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pages 2687–2710. Association for Computational Linguistics. Mexico City, Mexico.

In this work, we introduce a comprehensive error typology specifically designed for evaluating two distinct tasks in machine-generated patent texts: claims-to-abstract generation, and the generation of the next claim given previous ones. We have also developed a benchmark, PatentEval, for systematically assessing language models in this context. Our study includes a comparative analysis, annotated by humans, of various models. These range from those specifically adapted during training for tasks within the patent domain to the latest general-purpose large language models (LLMs). Furthermore, we explored and evaluated some metrics to approximate human judgments in patent text evaluation, analyzing the extent to which these metrics align with expert assessments. These approaches provide valuable insights into the capabilities and limitations of current language models in the specialized field of patent text generation.
Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis and Chloé Clavel. 2024. The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text. In Findings of the Association for Computational Linguistics: NAACL 2024. pages 3589–3604. Association for Computational Linguistics. Mexico City, Mexico.

This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.
Nathan Godey, Éric Clergerie and Benoît Sagot. 2024. Anisotropy Is Inherent to Self-Attention in Transformers. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). pages 35–48. Association for Computational Linguistics. St. Julian's, Malta.

The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which makes them unexpectedly close to each other in terms of angular distance (cosine-similarity). Some recent works tend to show that anisotropy is a consequence of optimizing the cross-entropy loss on long-tailed distributions of tokens. We show in this paper that anisotropy can also be observed empirically in language models with specific objectives that should not suffer directly from the same consequences. We also show that the anisotropy problem extends to Transformers trained on other modalities. Our observations tend to demonstrate that anisotropy might actually be inherent to Transformers-based models.
Jesujoba O. Alabi and Rachel Bawden. 2024. Exploring Inline Lexicon Injection for Cross-Domain Transfer in Neural Machine Translation. In Proceedings of the First International Workshop on Knowledge-Enhanced Machine Translation. pages 7–20. European Association for Machine Translation. Sheffield, United Kingdom.

Domain transfer remains a challenge in machine translation (MT), particularly concerning rare or unseen words. Amongst the strategies proposed to address the issue, one of the simplest and most promising in terms of generalisation capacity is coupling the MT system with external resources such as bilingual lexicons and appending inline annotations within source sentences. This method has been shown to work well for controlled language settings, but its usability for general language (and ambiguous) MT is less certain. In this article we explore this question further, testing the strategy in a multi-domain transfer setting for German-to-English MT, using the mT5 language model fine-tuned on parallel data. We analyse the MT outputs and design evaluation strategies to understand the behaviour of such models. Our analysis using distractor annotations suggests that although improvements are not systematic according to automatic metrics, the model does learn to select appropriate translation candidates and ignore irrelevant ones, thereby exhibiting more than a systematic copying behaviour. However, we also find that the method is less successful in a higher-resource setting with a larger lexicon, suggesting that it is not a magic solution, especially when the baseline model is already exposed to a wide range of vocabulary.
Rachel Bawden, Ziqian Peng, Maud Bénard, Éric Clergerie, Raphaël Esamotunu, Mathilde Huguin, Natalie Kübler, Alexandra Mestivier, Mona Michelot, Laurent Romary, Lichao Zhu and François Yvon. 2024. Translate your Own: a Post-Editing Experiment in the NLP domain. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1). pages 431–443. European Association for Machine Translation (EAMT). Sheffield, UK.

The improvements in neural machine translation make translation and post- editing pipelines ever more effective for a wider range of applications. In this paper, we evaluate the effectiveness of such a pipeline for the translation of scientific documents (limited here to article abstracts). Using a dedicated interface, we collect, then analyse the post-edits of approximately 350 abstracts (English→French) in the Natural Lan- guage Processing domain for two groups of post-editors: domain experts (academics encouraged to post-edit their own articles) on the one hand and trained translators on the other. Our results confirm that such pipelines can be effective, at least for high-resource language pairs. They also highlight the difference in the post-editing strategy of the two subgroups. Finally, they suggest that working on term translation is the most pressing issue to improve fully automatic translations, but that in a post-editing setup, other error types can be equally annoying for post-editors.
Maxime Guénette, Mathilde Verstraete, Marcello Vitali-Rosati and Alix Chagué. 2024. Transcrire un manuscrit en grec ancien. In Humanistica 2024. Meknès, Morocco.

Cette contribution a pour but de présenter les résultats de nos expérimentations d’entraînement d’un modèle de transcription automatique (HTR) pour le grec ancien à partir d’un corpus d’entraînement élaboré sur le Heidelbergensis Palatinus graecus 23 et avec l’environnement logiciel eScriptorium/Kraken. Ce manuscrit byzantin datant de la fin du Xe siècle est un témoin capital pour l’épigrammatique grecque, en ce qu’il est la source principale nous livrant l’Anthologie palatine. Sa structure claire et son écriture soignée en font un candidat idéal pour l’entraînement d’un modèle pour le grec ancien.
Simon Gabay, Thibault Clérice, Pauline Jacsont, Elina Leblanc, Marie Jeannot-Tirole, Sonia Solfrini, Sophie Dolto, Floriane Goy, Carmen Carrasco Luján, Maddalena Zaglio, Myriam Perregaux, Juliette Janes, Benoît Sagot, Rachel Bawden, Rasul Dent, Oriane Nédey and Alix Chagué. 2024. Reconnaissance des écritures dans les imprimés. In Humanistica 2024. Meknès, Morocco.

La reconnaissance optique de caractères (OCR) a connu d'importants succès pour les documents manuscrits ou les imprimés anciens ces dernières années, mais ce type de document reste marginal dans la production textuelle aujourd'hui disponible. Afin d'offrir aux chercheur.e.s des modèles performants couvrant un plus grand large éventail de cas, nous avons conçu un nouveau modèle généraliste, capable de gérer au mieux des imprimés, anciens comme contemporains, écrits dans une pluralité de langues. Plusieurs architectures sont évaluées, afin de comparer leur efficacité respective en terme de taux d'erreur par caractère, mais aussi de temps d'inférence.
Nathan Godey, Éric de la Clergerie and Benoît Sagot. 2024. On the Scaling Laws of Geographical Representation in Language Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 12416–12422. ELRA and ICCL. Torino, Italia.

Language models have long been shown to embed geographical information in their hidden representations. This line of work has recently been revisited by extending this result to Large Language Models (LLMs). In this paper, we propose to fill the gap between well-established and recent literature by observing how geographical knowledge evolves when scaling language models. We show that geographical knowledge is observable even for tiny models, and that it scales consistently as we increase the model size. Notably, we observe that larger language models cannot mitigate the geographical bias that is inherent to the training data.
Maria Dermentzi and Hugo Scheithauer. 2024. Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. In Proceedings of the First Workshop on Holocaust Testimonies as Language Resources (HTRes) @ LREC-COLING 2024. pages 18–28. ELRA and ICCL. Torino, Italia.

The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable. With this paper, we release EHRI-NER, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for Named Entity Recognition (NER) in Holocaust-related texts. EHRI-NER is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. We leverage this dataset to fine-tune the multilingual Transformer-based language model XLM-RoBERTa (XLM-R) to determine whether a single model can be trained to recognize entities across different document types and languages. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5\%. We argue that this score is sufficiently high to consider the next steps towards deploying this model.
Sarah Bénière, Floriane Chiffoleau and Laurent Romary. 2024. TEI Specifications for a Sustainable Management of Digitized Holocaust Testimonies. In Proceedings of the First Workshop on Holocaust Testimonies as Language Resources (HTRes) @ LREC-COLING 2024. pages 10–17. ELRA and ICCL. Torino, Italia.

Data modeling and standardization are central issues in the field of Digital Humanities, and all the more so when dealing with Holocaust testimonies, where stable preservation and long-term accessibility are key. The EHRI Online Editions are composed of documents of diverse nature (testimonies, letters, diplomatic reports, etc.), held by EHRI’s partnering institutions, and selected, gathered thematically and encoded according to the TEI Guidelines by the editors within the EHRI Consortium. Standardization is essential in order to make sure that the editions are consistent with one another. The issue of consistency also encourages a broader reflection on the usage of standards when processing data, and on the standardization of digital scholarly editions of textual documents in general. In this paper, we present the normalization work we carried out on the EHRI Online Editions. It includes a customization of the TEI adapted to Holocaust-related documents, and a focus on the implementation of controlled vocabulary. We recommend the use of these encoding specifications as a tool for researchers and/or non-TEI experts to ensure their encoding is valid and consistent across editions, but also as a mechanism for integrating the edition work smoothly within a wider workflow leading from image digitization to publication.
Fahad Khan, Maxim Ionov, Christian Chiarcos, Laurent Romary, Gilles Sérasset and Besim Kabashi. 2024. On Modelling Corpus Citations in Computational Lexical Resources. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 12385–12394. ELRA and ICCL. Torino, Italia.

In this article we look at how two different standards for lexical resources, TEI and OntoLex, deal with corpus citations in lexicons. We will focus on how corpus citations in retrodigitised dictionaries can be modelled using each of the two standards since this provides us with a suitably challenging use case. After looking at the structure of an example entry from a legacy dictionary, we examine the two approaches offered by the two different standards by outlining an encoding for the example entry using both of them (note that this article features the first extended discussion of how the Frequency Attestation and Corpus (FrAC) module of OntoLex deals with citations). After comparing the two approaches and looking at the advantages and disadvantages of both, we argue for a combination of both. In the last part of the article we discuss different ways of doing this, giving our preference for a strategy which makes use of RDFa.
Rian Touchent and Éric de la Clergerie. 2024. CamemBERT-bio: Leveraging Continual Pre-training for Cost-Effective Models on French Biomedical Data. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 2692–2701. ELRA and ICCL. Torino, Italia.

Clinical data in hospitals are increasingly accessible for research through clinical data warehouses. However these documents are unstructured and it is therefore necessary to extract information from medical reports to conduct clinical studies. Transfer learning with BERT-like models such as CamemBERT has allowed major advances for French, especially for named entity recognition. However, these models are trained for plain language and are less efficient on biomedical data. Addressing this gap, we introduce CamemBERT-bio, a dedicated French biomedical model derived from a new public French biomedical dataset. Through continual pre-training of the original CamemBERT, CamemBERT-bio achieves an improvement of 2.54 points of F1-score on average across various biomedical named entity recognition tasks, reinforcing the potential of continual pre-training as an equally proficient yet less computationally intensive alternative to training from scratch. Additionally, we highlight the importance of using a standard evaluation protocol that provides a clear view of the current state-of-the-art for French biomedical models.
Niyati Bafna, Cristina España-Bonet, Josef van Genabith, Benoît Sagot and Rachel Bawden. 2024. When Your Cousin Has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 17544–17556. ELRA and ICCL. Torino, Italia.

Most existing approaches for unsupervised bilingual lexicon induction (BLI) depend on good quality static or contextual embeddings requiring large monolingual corpora for both languages. However, unsupervised BLI is most likely to be useful for low-resource languages (LRLs), where large datasets are not available. Often we are interested in building bilingual resources for LRLs against related high-resource languages (HRLs), resulting in severely imbalanced data settings for BLI. We first show that state-of-the-art BLI methods in the literature exhibit near-zero performance for severely data-imbalanced language pairs, indicating that these settings require more robust techniques. We then present a new method for unsupervised BLI between a related LRL and HRL that only requires inference on a masked language model of the HRL, and demonstrate its effectiveness on truly low-resource languages Bhojpuri and Magahi (with <5M monolingual tokens each), against Hindi. We further present experiments on (mid-resource) Marathi and Nepali to compare approach performances by resource range, and release our resulting lexicons for five low-resource Indic languages: Bhojpuri, Magahi, Awadhi, Braj, and Maithili, against Hindi.
Lydia Nishimwe, Benoît Sagot and Rachel Bawden. 2024. Making Sentence Embeddings Robust to User-Generated Content. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 10984–10998. ELRA and ICCL. Torino, Italia.

NLP models have been known to perform poorly on user-generated content (UGC), mainly because it presents a lot of lexical variations and deviates from the standard texts on which most of these models were trained. In this work, we focus on the robustness of LASER, a sentence embedding model, to UGC data. We evaluate this robustness by LASER's ability to represent non-standard sentences and their standard counterparts close to each other in the embedding space. Inspired by previous works extending LASER to other languages and modalities, we propose RoLASER, a robust English encoder trained using a teacher-student approach to reduce the distances between the representations of standard and UGC sentences. We show that with training only on standard and synthetic UGC-like data, RoLASER significantly improves LASER's robustness to both natural and artificial UGC data by achieving up to 2x and 11x better scores. We also perform a fine-grained analysis on artificial UGC data and find that our model greatly outperforms LASER on its most challenging UGC phenomena such as keyboard typos and social media abbreviations. Evaluation on downstream tasks shows that RoLASER performs comparably to or better than LASER on standard data, while consistently outperforming it on UGC data.
Biswesh Mohapatra, Seemab Hassan, Laurent Romary and Justine Cassell. 2024. Conversational Grounding: Annotation and Analysis of Grounding Acts and Grounding Units. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 3967–3977. ELRA and ICCL. Torino, Italia.

Successful conversations often rest on common understanding, where all parties are on the same page about the information being shared. This process, known as conversational grounding, is crucial for building trustworthy dialog systems that can accurately keep track of and recall the shared information. The proficiencies of an agent in grounding the conveyed information significantly contribute to building a reliable dialog system. Despite recent advancements in dialog systems, there exists a noticeable deficit in their grounding capabilities. Traum provided a framework for conversational grounding introducing Grounding Acts and Grounding Units, but substantial progress, especially in the realm of Large Language Models, remains lacking. To bridge this gap, we present the annotation of two dialog corpora employing Grounding Acts, Grounding Units, and a measure of their degree of grounding. We discuss our key findings during the annotation and also provide a baseline model to test the performance of current Language Models in categorizing the grounding acts of the dialogs. Our work aims to provide a useful resource for further research in making conversations with machines better understood and more reliable in natural day-to-day collaborative dialogs.
Seth Aycock and Rachel Bawden. 2024. Topic-guided Example Selection for Domain Adaptation in LLM-based Machine Translation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop. pages 175–195. Association for Computational Linguistics. St. Julian's, Malta.

Current machine translation (MT) systems perform well in the domains on which they were trained, but adaptation to unseen domains remains a challenge. Rather than fine-tuning on domain data or modifying the architecture for training, an alternative approach exploits large language models (LLMs), which are performant across NLP tasks especially when presented with in-context examples. We focus on adapting a pre-trained LLM to a domain at inference through in-context example selection. For MT, examples are usually randomly selected from a development set. Some more recent methods though select using the more intuitive basis of test source similarity. We employ topic models to select examples based on abstract semantic relationships below the level of a domain. We test the relevance of these statistical models and use them to select informative examples even for out-of-domain inputs, experimenting on 7 diverse domains and 11 language pairs of differing resourcedness. Our method outperforms baselines on challenging multilingual out-of-domain tests, though it does not match performance with strong baselines for the in-language setting. We find that adding few-shot examples and related keywords consistently improves translation quality, that example diversity must be balanced with source similarity, and that our pipeline is overly restrictive for example selection when a targeted development set is available.
Thibault Clérice, Ariane Pinche, Malamatenia Vlachou-Efstathiou, Alix Chagué, Jean-Baptiste Camps, Matthias Gille-Levenson, Olivier Brisville-Fertin, Franz Fischer, Michaels Gervers, Agnès Boutreux, Avery Manton, Simon Gabay, Patricia O'Connor, Wouter Haverals, Mike Kestemont, Caroline Vandyck and Benjamin Kiessling. 2024. CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond. In Proceedings of the 2024 International Conference on Document Analysis and Recognition (ICDAR). Athens, Greece.

The surge in digitisation initiatives by Cultural Heritage institutions has facilitated online accessibility to numerous historical manuscripts. However, a substantial portion of these documents exists solely as images, lacking machine-readable text. Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting these images into machine-readable formats, enabling researchers and scholars to analyse vast collections efficiently. Despite significant technological progress, establishing consistent ground truth across projects for HTR tasks, particularly for complex and heterogeneous historical sources like medieval manuscripts in Latin scripts (8th-15th century CE), remains nonetheless challenging. We introduce the Consistent Approaches to Transcribing Manuscripts (CATMuS) dataset for medieval manuscripts, which offers (1) a uniform framework for annotation practices for medieval manuscripts, a benchmarking environment (2) for evaluating automatic text recognition models across multiple dimensions thanks to rich metadata (century of production, language, genre, script, etc.), (3) for other tasks (such as script classification or dating approaches), (4) and finally for exploratory work pertaining to computer vision and digital paleography around line-based tasks, such as generative approaches.Developed through collaboration among various institutions and projects, CATMuS provides an inter-compatible dataset spanning more than 200 manuscripts and incunabula in 10 different languages, comprising over 160,000 lines of text and 5 million characters spanning from the 8th century to the 16th. The dataset's consistency in transcription approaches aims to mitigate challenges arising from the diversity in standards for medieval manuscript transcriptions, providing a comprehensive benchmark for evaluating HTR models on historical sources.
Wissam Antoun, Benoît Sagot and Djamé Seddah. 2024. From Text to Source: Results in Detecting Large Language Model-Generated Content. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 7531–7543. ELRA and ICCL. Torino, Italia.

The widespread use of Large Language Models (LLMs), celebrated for their ability to generate human-like text, has raised concerns about misinformation and ethical implications. Addressing these concerns necessitates the development of robust methods to detect and attribute text generated by LLMs. This paper investigates "Cross-Model Detection," by evaluating whether a classifier trained to distinguish between source LLM-generated and human-written text can also detect text from a target LLM without further training. The study comprehensively explores various LLM sizes and families, and assesses the impact of conversational fine-tuning techniques, quantization, and watermarking on classifier generalization. The research also explores Model Attribution, encompassing source model identification, model family, and model size classification, in addition to quantization and watermarking detection. Our results reveal several key findings: a clear inverse relationship between classifier effectiveness and model size, with larger LLMs being more challenging to detect, especially when the classifier is trained on data from smaller models. Training on data from similarly sized LLMs can improve detection performance from larger models but may lead to decreased performance when dealing with smaller models. Additionally, model attribution experiments show promising results in identifying source models and model families, highlighting detectable signatures in LLM-generated text, with particularly remarkable outcomes in watermarking detection, while no detectable signatures of quantization were observed. Overall, our study contributes valuable insights into the interplay of model size, family, and training data in LLM detection and attribution.
Thibault Clerice. 2024. Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pages 4772–4783. ELRA and ICCL. Torino, Italia.

In this study, we propose to evaluate the use of deep learning methods for semantic classification at the sentence level to accelerate the process of corpus building in the field of humanities and linguistics, a traditional and time-consuming task. We introduce a novel corpus comprising around 2500 sentences spanning from 300 BCE to 900 CE including sexual semantics (medical, erotica, etc.). We evaluate various sentence classification approaches and different input embedding layers, and show that all consistently outperform simple token-based searches. We explore the integration of idiolectal and sociolectal metadata embeddings (centuries, author, type of writing), but find that it leads to overfitting. Our results demonstrate the effectiveness of this approach, achieving high precision and true positive rates (TPR) of respectively 70.60% and 86.33% using HAN. We evaluate the impact of the dataset size on the model performances (420 instead of 2013), and show that, while our models perform worse, they still offer a high enough precision and TPR, even without MLM, respectively 69% and 51%. Given the result, we provide an analysis of the attention mechanism as a supporting added value for humanists in order to produce more data.

Communications

Simon Gabay, Ariane Pinche, Peter Nahon, Alix Chagué, Pauline Jacsont, Élodie Paupe, Jean-Claude Rebetez, Maxime Humeau, Christine Payot, Thibault Maillard, Yvan Jauregui, Elina Leblanc and Loraine Chappuis. 2024. Vers un modèle diachronique pour les mains modernes françaises. In Humanistica 2024. Meknès, Morocco.

Pour le domaine francophone, les manuscrits rédigés après le Moyen Âge restent le dernier type de document qui n'est pas correctement traité par les moteurs de reconnaissance optique de caractères. Si des modèles ont déjà été publiés, leur efficacité et leur documentation restent encore insatisfaisants, en grande partie à cause des problèmes posés par l'importante évolution graphique (au sens paléographique comme linguistique) qu'a connu la langue au cours des siècles, et donc de la diversité des formes à traiter. Après une brève description du problème philologique, nous proposons donc ici quelques premières réflexions sur la transcription des documents modernes, ainsi qu'un nouveau modèle pour améliorer les conditions de travail des chercheur•se•s, le temps de concevoir une solution véritablement satisfaisante.
Sarah Bénière, Hugo Scheithauer, Juliette Janes and Laurent Romary. 2024. An ODD Schema for a Sustainable Encoding of Catalog Objects. In TEI 2024–Texts, Languages and Communities. Buenos Aires, Argentina.

Sales catalogs are a valuable resource for art historians as they are the witnesses of the circulation of works of art. The organization of information within the catalogs is consistent and structured, which makes it interesting material for automatic processing tasks. This long paper proposal presents our reflection on the structuration of the content of sales catalogs in TEI-XML. This consideration is part of a wider reflection within the framework of the DataCatalogue research project (Inria, BnF, INHA) on an automated workflow processing sales catalogs from digitization to publication.
Alix Chagué. 2024. McCATMuS : retours sur la production d'un méta-dataset multilingue et multiséculaire. In Le patrimoine archivistique face au virage numérique. Rimouski, Canada.

Alix Chagué. 2024. FAIRer transcriptions: HTR-United and the possibility of a common for training data. In Horizons of digital philology. Naples, Italy.

Alix Chagué. 2024. Initiation to Handwritten Text Recognition with eScriptorium. In Horizons of digital philology. Naples, Italy.

Alix Chagué and Hugo Scheithauer. 2024. Do (colored) backgrounds matter? An experiment on artificially augmented ground truth for handwritten text recognition applied to historical manuscripts. In CSDH/SCHN 2024: Sustaining Shared Futures. Montréal, Canada.

We present an experiment conducted on the augmentation of older grayscale datasets designed for automatic text recognition on contemporary handwriting (IAM-Database). The augmentation method relies on the addition of colored backgrounds taken from real-world historical blank pages and allows us to create an enhanced version of IAM-Database. We train various transcription models playing on the composition of trainset and validationset using the original and enhanced IAM-Database. We test the resulting models against the original and enhanced testsets, as well as a testset composed from real-world historical documents. We find that though the transcription engine proves robust to color changes, this technique could be used to bring up to speed older grayscale datasets to create transcription models efficient on historical handwriting. Additionally, we consider the environmental costs of using enhanced data as opposed to the original dataset, and find that the impact is minor.
Sarah Bénière, Floriane Chiffoleau and Hugo Scheithauer. 2024. Streamlining the Creation of Holocaust-related Digital Editions with Automatic Tools. In EHRI Academic Conference - Researching the Holocaust in the Digital Age. Varsovie, Poland.

Alix Chagué, Floriane Chiffoleau and Hugo Scheithauer. 2024. Collaboration and Transparency: A User-Generated Documentation for eScriptorium. In DH2024 Reinvention & Responsibility. Washington D. C., United States.

Floriane Chiffoleau and Hugo Scheithauer. 2024. Leveraging EHRI Online Editions for training automated edition tools. In EHRI Workshop Natural Language Processing Meets Holocaust Archives. Prague, Czech Republic.

Thibault Clérice and Malamatenia Vlachou-Efstathiou. 2024. The CATMuS initiative: building large and diverse corpora for handwritten text recognition. In DH AI Seminar 2024 - Digital Humanities / Artificial Intelligence. Paris, France.

Thibault Clérice. 2024. Distributed Texts Services: Présentation. In Journées Biblissima+: Partager, décloisonner, réutiliser : outiller la recherche et développer de nouveaux usages. Aubervilliers, France.

Hugo Scheithauer, Sarah Bénière and Laurent Romary. 2024. Automatic retro-structuration of auction sales catalogs layout and content. In DH2024 - Reinvention and Responsibility. Washinghton DC, United States.

This paper showcases a pipeline for automatically retro-structuring auction sales catalogs, based on document layout analysis and information extraction technologies. Structured layout and textual data are then transformed into TEI XML for publication. It also advocates for a generalized use of layout segmentation in digitization pipelines.
Thibault Clérice, Juliette Janes, Hugo Scheithauer, Sarah Bénière, Laurent Romary and Benoît Sagot. 2024. Layout Analysis Dataset with SegmOnto. In DH2024 - Annual conference of the Alliance of Digital Humanities Organizations. Washington DC, United States.

Ariane Pinche, Thibault Clérice, Alix Chagué, Jean-Baptiste Camps, Malamatenia Vlachou-Efstathiou, Matthias Gille Levenson, Olivier Brisville-Fertin, Federico Boschetti, Franz Fischer, Michael Gervers, Agnès Boutreux, Avery Manton, Simon Gabay, Wouter Haverals, Mike Kestemont, Caroline Vandyck and Patricia O'Connor. 2024. CATMuS-Medieval: Consistent Approaches to Transcribing ManuScripts. In Digital Humanities - DH2024. Washington DC, United States.

Books

Benoît Sagot. 2024. Apprendre les langues aux machines. 325 Éditions du Collège de France.

À l’automne 2022, le lancement de ChatGPT a installé l’intelligence artificielle au cœur de l’actualité. Chacun a pu s’emparer de cet agent conversationnel et prendre la mesure de sa puissance, mais son fonctionnement est resté pour beaucoup mystérieux. Cette leçon inaugurale lève le voile sur un domaine de recherche auquel il doit son existence, le traitement automatique des langues, ou TAL.Pas à pas, l’auteur nous conduit à travers l’histoire du TAL afin de dégager les enjeux actuels de cette discipline aussi ancienne que l’informatique et qui s’efforce d’apprendre les langues aux machines. Comment en est-on arrivé à l’apprentissage automatique, aux réseaux de neurones et aux modèles génératifs ? Quels aspects éthiques requièrent notre vigilance face à l’accélération de la recherche et de l’innovation ? En fin de compte, ChatGPT est-il vraiment une révolution ?

Tech reports

Ziqian Peng, Rachel Bawden and François Yvon. 2024. Model Cards for the MaTOS Project. Technical report.

Nicolas Dahan, Rachel Bawden and François Yvon. 2024. Survey of Automatic Metrics for Evaluating Machine Translation at the Document Level. Technical report.

This report presents a survey of document-level automatic metrics for machine translation (MT), addressing the need for sophisticated evaluation methods that extend beyond sentence-level assessments. Traditional metrics, which evaluate translations on a sentence-by-sentence basis, often fail to capture the complexities of discourse phenomena, leading to gaps in assessing coherence, cohesion, and cross-sentence dependencies.The report starts by introducing the terminology and notation relevant to document-level MT evaluation. It then describes the linguistic phenomena that are crucial at the document level, related for example to lexical and grammatical cohesion, and overall text coherence, which pose significant challenges for MT systems.Following this, we explore human evaluation protocols targeting document-level translation, discussing the methodologies used to judge translation quality in a more holistic manner. Studying human judgments is necessary, as automatic metrics often aim at reproducing them. We also examine the various test sets that have been developed to support the evaluation of document-level MT.The core of the survey focuses on automatic evaluation metrics designed for document-level translation. These metrics aim to provide a more accurate representation of translation quality by considering the broader context and long-range dependencies within a text, offering a more comprehensive assessment than sentence-level metrics.The report concludes with an overview of the current trends in document-level MT evaluation, summarizing key challenges and identifying areas for future research. It emphasizes the need for the development of context-aware metrics and the importance of creating standardized, document-level test sets to advance MT evaluation.
Alix Chagué, Floriane Chiffoleau, Matthias Gille Levenson, Hugo Scheithauer and Ariane Pinche. 2024. Chaînes d'acquisition, de traitement et de publication du texte. Technical report.

Né dans le contexte du consortium Ariane-HN , et face à l’émergence de l’intégration de l’intelligence artificielle dans la production de textes en sciences humaines, ce livrable vise à de présenter les différentes étapes d'une chaîne d'acquisition textuelle de la transcription à la mise en ligne (acquisition, modélisation des données, enrichissement et mise en ligne). Les protocoles proposés ne sont pas limités par des principes éditoriaux stricts, mais souples, adaptables et indépendants d’outils particuliers. En effet, enfermer cette réflexion dans une chaîne logicielle présenterait des risques, notamment en raison de leur obsolescence, de la diversité des besoins et du niveau de complexité des tâches liées aux particularités des corpus. C’est pourquoi nous avons préféré nous en tenir à une chaîne théorique adaptable en fonction des solutions techniques disponibles. Ainsi, en fonction des ressources à disposition et des objectifs des projets, nous proposons, ici, deux voies : une voie simple qui demandera peu de compétences en ingénierie et une voie plus complexe qui ajoutera un certain nombre de tâches d’automatisation dans l’acquisition du texte et son enrichissement, nécessitant une plus grande maîtrise des outils techniques ainsi qu’une compréhension plus approfondie de leurs enjeux scientifiques.
Ziqian Peng, Rachel Bawden and François Yvon. 2024. Handling Very Long Contexts in Neural Machine Translation: a Survey. Technical report.

This report examines methods for integrating an extended discourse context in machine translation, focusing on neural translation methods. Machine translation systems generally translate each sentence independently of its neighbors, which yields systematic errors resulting from a limited discourse context. Therefore, various approaches have been proposed to incorporate cross-sentential context, mostly based on the predominant Transformer architecture.Recently, the introduction of large language models (LLMs) also created novel opportunities to process long-range dependencies, inspiring several context-aware machine translation approaches.We present the challenges of translating long inputs, then investigate encoder-decoder architectures and LLM-based approaches, with a brief overview of efficient transformer implementations as a common background. Furthermore, we also discuss strategies to extend other NLP tasks to a longer context, and list recently available open-source document-level parallel corpus for future exploration. We conclude with a summary of current work and the main research directions.

Other

Sarah Bénière. 2024. DataCatalogue : Restructurer automatiquement les catalogues de ventes.

Présentation du projet DataCatalogue et de sa chaîne de traitement dans le cadre du cours "Panorama de projets" dispensé aux étudiant·e·s du M2 TNAH à l'École nationale des chartes, le 24 janvier 2024.
Sarah Bénière. 2024. TEI Publisher: A Platform for Digital Editions.

Preprints

Wissam Antoun, Francis Kulumba, Rian Touchent, Eric Villemonte de La Clergerie, Benoît Sagot and Djamé Seddah. 2024. CamemBERT 2.0: A Smarter French Language Model Aged to Perfection. Preprint.

French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due to temporal concept drift, where outdated training data leads to a decline in performance, especially when encountering new topics and terminology. This issue emphasizes the need for updated models that reflect current linguistic trends. In this paper, we introduce two new versions of the CamemBERT base model-CamemBERTav2 and CamemBERTv2-designed to address these challenges. CamemBERTav2 is based on the DeBERTaV3 architecture and makes use of the Replaced Token Detection (RTD) objective for better contextual understanding, while CamemBERTv2 is built on RoBERTa, which uses the Masked Language Modeling (MLM) objective. Both models are trained on a significantly larger and more recent dataset with longer context length and an updated tokenizer that enhances tokenization performance for French. We evaluate the performance of these models on both general-domain NLP tasks and domain-specific applications, such as medical field tasks, demonstrating their versatility and effectiveness across a range of use cases. Our results show that these updated models vastly outperform their predecessors, making them valuable tools for modern NLP systems. All our new models, as well as intermediate checkpoints, are made openly available on Huggingface.
Thibault Clérice, Juliette Janes, Hugo Scheithauer, Sarah Bénière, Florian Cafiero, Laurent Romary, Simon Gabay and Benoît Sagot. 2024. Diachronic Document Dataset for Semantic Layout Analysis. Preprint.

We present a novel, open-access dataset designed for semantic layout analysis, built to support document recreation workflows through mapping with the Text Encoding Initiative (TEI) standard. This dataset includes 7,254 annotated pages spanning a large temporal range (1600-2024) of digitised and born-digital materials across diverse document types (magazines, papers from sciences and humanities, PhD theses, monographs, plays, administrative reports, etc.) sorted into modular subsets. By incorporating content from different periods and genres, it addresses varying layout complexities and historical changes in document structure. The modular design allows domain-specific configurations. We evaluate object detection models on this dataset, examining the impact of input size and subset-based training. Results show that a 1280-pixel input size for YOLO is optimal and that training on subsets generally benefits from incorporating them into a generic model rather than fine-tuning pre-trained weights.
Matthieu Futeral, Cordelia Schmid, Benoît Sagot and Rachel Bawden. 2024. Towards Zero-Shot Multimodal Machine Translation. Preprint.

Current multimodal machine translation (MMT) systems rely on fully supervised data (i.e models are trained on sentences with their translations and accompanying images). However, this type of data is costly to collect, limiting the extension of MMT to other language pairs for which such data does not exist. In this work, we propose a method to bypass the need for fully supervised data to train MMT systems, using multimodal English data only. Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives: visually conditioned masked language modelling and the Kullback-Leibler divergence between the original and new MMT outputs. We evaluate on standard MMT benchmarks and the recently released CoMMuTE, a contrastive benchmark aiming to evaluate how well models use images to disambiguate English sentences. We obtain disambiguation performance close to state-of-the-art MMT models trained additionally on fully supervised examples. To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese. We further show that we can control the trade-off between disambiguation capabilities and translation fidelity at inference time using classifier-free guidance and without any additional data. Our code, data and trained models are publicly accessible.
Rasul Dent, Juliette Janes, Thibault Clérice, Pedro Ortiz Suarez and Benoît Sagot. 2024. Molyé: A Corpus-based Approach to Language Contact in Colonial France. Preprint.

Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of intermediate forms. This work introduces a new open corpus, the Molyé corpus, which combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.
Armel Zebaze, Benoît Sagot and Rachel Bawden. 2024. In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation. Preprint.

The ability of generative large language models (LLMs) to perform in-context learning has given rise to a large body of research into how best to prompt models for various natural language processing tasks. In this paper, we focus on machine translation (MT), a task that has been shown to benefit from in-context translation examples. However no systematic studies have been published on how best to select examples, and mixed results have been reported on the usefulness of similarity-based selection over random selection. We provide a study covering multiple LLMs and multiple in-context example retrieval strategies, comparing multilingual sentence embeddings. We cover several language directions, representing different levels of language resourcedness (English into French, German, Swahili and Wolof). Contrarily to previously published results, we find that sentence embedding similarity can improve MT, especially for low-resource language directions, and discuss the balance between selection pool diversity and quality. We also highlight potential problems with the evaluation of LLM-based MT and suggest a more appropriate evaluation protocol, adapting the COMET metric to the evaluation of LLMs. Code and outputs are freely available at https://github.com/ArmelRandy/ICL-MT.
Francis Kulumba, Wissam Antoun, Guillaume Vimont and Laurent Romary. 2024. Harvesting Textual and Structured Data from the HAL Publication Repository. Preprint.

HAL (Hyper Articles en Ligne) is the French national publication repository, used by most higher education and research organizations for their open science policy. As a digital library, it is a rich repository of scholarly documents, but its potential for advanced research has been underutilized. We present HALvest, a unique dataset that bridges the gap between citation networks and the full text of papers submitted on HAL. We craft our dataset by filtering HAL for scholarly publications, resulting in approximately 700,000 documents, spanning 56 languages across 13 identified domains, suitable for language model training, and yielding approximately 16.5 billion tokens (with 8 billion in French and 7 billion in English, the most represented languages). We transform the metadata of each paper into a citation network, producing a directed heterogeneous graph. This graph includes uniquely identified authors on HAL, as well as all open submitted papers, and their citations. We provide a baseline for authorship attribution using the dataset, implement a range of state-of-the-art models in graph representation learning for link prediction, and discuss the usefulness of our generated knowledge graph structure.
Matthieu Futeral, Armel Zebaze, Pedro Ortiz Suarez, Julien Abadji, Rémi Lacroix, Cordelia Schmid, Rachel Bawden and Benoît Sagot. 2024. mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus. Preprint.

Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. [2022] showed that additionally training them on interleaved sequences of text and images can lead to the emergence of in-context learning capabilities. However, the dataset they used, M3W, is not public and is only in English. There have been attempts to reproduce their results but the released datasets are English-only. In contrast, current multilingual and multimodal datasets are either composed of caption-like only or medium-scale or fully private data. This limits mLLM research for the 7,000 other languages spoken in the world. We therefore introduce mOSCAR, to the best of our knowledge the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 315M documents, 214B tokens and 1.2B images. We carefully conduct a set of filtering and evaluation steps to make sure mOSCAR is sufficiently safe, diverse and of good quality. We additionally train two types of multilingual model to prove the benefits of mOSCAR: (1) a model trained on a subset of mOSCAR and captioning data and (2) a model train on captioning data only. The model additionally trained on mOSCAR shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks, confirming previous findings for English-only mLLMs.
Nathan Godey, Eric Villemonte de La Clergerie and Benoît Sagot. 2024. Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck. Preprint.

Recent advances in language modeling consist in pretraining highly parameterized neural networks on extremely large web-mined text corpora. Training and inference with such models can be costly in practice, which incentivizes the use of smaller counterparts. However, it has been observed that smaller models can suffer from saturation, characterized as a drop in performance at some advanced point in training followed by a plateau. In this paper, we find that such saturation can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution. This mismatch affects the performance of the linear prediction head used in such models through the well-known softmax bottleneck phenomenon. We measure the effect of the softmax bottleneck in various settings and find that models based on less than 1000 hidden dimensions tend to adopt degenerate latent representations in late pretraining, which leads to reduced evaluation performance.
Alix Chagué and Hugo Scheithauer. 2024. Do (colored) backgrounds matter? An experiment on artificially augmented ground truth for handwritten text recognition applied to historical manuscripts. Preprint.

We present an experiment conducted on the augmentation of older grayscale datasets designed for automatic text recognition on contemporary handwriting (IAM-Database). The augmentation method relies on the addition of colored backgrounds taken from real-world historical blank pages and allows us to create an enhanced version of IAM-Database. We train various transcription models playing on the composition of trainset and validationset using the original and enhanced IAM-Database. We test the resulting models against the original and enhanced testsets, as well as a testset composed from real-world historical documents. We find that though the transcription engine proves robust to color changes, this technique could be used to bring up to speed older grayscale datasets to create transcription models efficient on historical handwriting. Additionally, we consider the environmental costs of using enhanced data as opposed to the original dataset, and find that the impact is minor.
Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-Jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoît Sagot and Emmanuel Dupoux. 2024. SpiRit-LM: Interleaved Spoken and Written Language Model. Preprint.

We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. SPIRIT-LM comes in two versions: a BASE version that uses speech semantic units and an EXPRESSIVE version that models expressivity using pitch and style units in addition to the semantic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that SPIRIT-LM is able to learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification).
Rachel Bawden, Hatim Bourfoune, Bertrand Cabot, Nathan Cassereau, Pierre Cornette, Marco Naguib, Aurélie Névéol and François Yvon. 2024. Les modèles Bloom pour le traitement automatique de la langue française. Preprint.

The development of very large language models, capable of performing a large range of automatic language processing tasks, simultaneously requires to develop the infrastructure needed to evaluate these models, ideally covering as many tasks as possible. Numerous benchmarks have already been compiled for the English language, making it possible to evaluate these large models from multiple angles. Several multilingual test sets are also available, with a much lesser coverage, which are used to measure the ability of these models to handle multiple languages. In this paper, we present our efforts to assemble a multi-task evaluation set for French, which is then used to evaluate models from the BLOOM family. Our results confirm and complement the main evaluation results for BLOOM in English; they allow us to conclude that the performances obtained in French and English are very similar and even better when the prompts used at inference are written in the same language as the texts to analyze.

2023

PhD theses and Habiliations

Lionel Tadonfouet Tadjou. 2023. Constitution de fils de discussion cohérents à partir de conversations issues d'outils professionnels de communication et de collaboration. PhD thesis. Sorbonne Université.

Constituting coherent threads of conversation from professional communication and collaboration tools is a process of transforming a written, asynchronous conversation into sub-conversations, each dealing with a specific topic while maintining the order of arrival of the messages sent by interlocutors in the original conversation. These sub-conversations thus result in linear or tree-like conversation structures. This process can be applied to forum discussions but also to e-mail conversations, both examples being more generally representative of Computer Mediated Content (CMC). To build up these sub-threads of e-mail conversations, we need to rely on their metadata and content. In practice, however, these elements do not seem sufficient. An e-mail conversation is, in fact, a dialogue with a discursive structure that is potentially useful for tracking the evolution of the discussion. It should be noted, however, that this dialogue is asynchronous, which emphasizes specificities. In synchronous dialogues, very strong relationships often emerge between consecutive utterances, which in a long discussion can form clusters of sub-conversations. The constitution of conversation sub-threads from main conversations is based in this type of relationships between the sentences of successive emails in a conversation : this type of relationship is refered to as transverse. Unlike dialogues, where such relations can easily be identified, this is a very complex task in email conversations and constitutes the main sub-problem called statement matching for which we suggest several resolution methods. Conversations generally abound in linguistic and paralinguistic information, among which are dialogue acts. They very often help to better identify the content of a dialogue and could strongly contribute to constituting conversation sub-threads via a better identification of relations between utterances. This is the hypothesis we state in the context of solving the statement matching problem, based on an initial phase of classification of dialogue statements. This manuscript decribes the work related to our core problem, as well as the sub-problems mentioned above. Around this main focus, we address various related but important, necessary or useful aspects. Thus, we take an in-depth look at CMOs, discourse analysis and its historicity, as well as the available corpus to approach such problems. Then we offer different resolution methods for our sub-problems, with well-detailed experiments and evaluations of said methods. Finally, our manuscript concludes with the following propositions : the application of the proposed methods to other types of CMO, such as forums, and other possibilities to be explored to solve the problem of constituting conversational sub-threads.
José Rosales Núñez. 2023. Machine Translation of User-Generated Contents : an Evaluation of Neural Translation Systems under Zero-shot Conditions. PhD thesis. Université Paris-Saclay.

The rapid advancements in telecommunications over the past few decades have revolutionized the way people exchange information. Thanks to these advancements, the average user can now communicate with others across the globe in real-time and with minimal delay. With approximately 60% of the global population having Internet access, billions of individuals interact by sharing user-generated content (UGC) in various forms. This UGC, which often includes reviews and opinions, provides a valuable source of information, offering a comprehensive view of global trends. Machine Translation (MT) plays a vital role in enabling smooth communication and facilitating the automatic processing of UGC for data mining purposes.However, translating UGC presents unique challenges compared to translating traditional text. UGC is highly productive and exhibits various phenomena such as repeated characters, typographical errors, contractions, jargon, and unconventional sentence structures. These specificities lead to a significant number of Out-of-Vocabulary tokens (OOVs) and rare sequences, which pose problems since they are not adequately represented in the standard parallel corpora used to train MT models. Additionally, conventional domain adaptation techniques like fine-tuning have limited success in addressing these challenges. They suffer from performance degradation when applied to in-domain data and are unable to keep up with the ever-evolving nature of UGC.In this study, we focus on the task of automatically translating UGC in the zero-shot scenario, where we restrain from using any UGC-specific training data. Our aim is to develop more generalized MT architectures that can handle the distributional drift inherent in UGC. In the initial phase of our research, we dedicated our efforts to identifying and quantifying the specificities of UGC that hinder translation performance. We have also created evaluation frameworks and data collections to aid in this endeavor. Using off-the-shelf models, we investigate the challenges faced by MT systems when translating UGC and link the errors to their underlying mechanisms.Subsequently, we delve into the study and proposal of different methods to address the challenges posed by UGC. These methods include exploring normalization pipelines, employing more granular tokenization techniques, and utilizing latent variable models to enhance the robustness of MT systems. For each of these approaches, we systematically evaluate the performance and robustness of the systems, conduct a detailed error analysis, and offer insights into promising avenues for tackling the automatic translation of UGC in the zero-shot setting.

Journal articles

Rute Costa, Ana Salgado, Margarida Ramos, Sara Carvalho, Fahad Khan, Toma Tasovac, Bruno Almeida, Mohamed Khemakhem, Laurent Romary and Raquel Silva. 2023. A crossroad between lexicography and terminology work: Knowledge organization and domain labelling. Digital Scholarship in the Humanities 38 pages i17–i29. Oxford University Press.

Abstract MORDigital project aims to encode the selected editions of Diccionario de Lingua Portugueza by António de Morais Silva, first published in 1789. Our ultimate goals are, on the one hand, to promote accessibility to cultural heritage while fostering reusability and, on the other hand, to contribute towards a more significant presence of lexicographic digital content in Portuguese through open tools and standards. The Morais dictionary represents a significant legacy, since it marks the beginning of Portuguese dictionaries, having served as a model for all subsequent lexicographic production. The team follows a new paradigm in lexicography, which results from the convergence between lexicography, terminology, computational linguistics, and ontologies as an integral part of digital humanities and linked (open) data. In the Portuguese context, this research fills a gap concerning searchable online retrodigitized dictionaries, built on current standards and methodologies which promote data sharing and harmonization, namely TEI Lex-0. The team will further ensure the connection to other existing systems and lexical resources, particularly in the Portuguese-speaking world.
Simon Gabay, Philippe Gambette, Rachel Bawden and Benoît Sagot. 2023. Ancien ou moderne ? Pistes computationnelles pour l'analyse graphématique des textes écrits au XVIIe siècle. Linx 85 Presses Universitaires de Paris Nanterre.

The use of contemporary spelling rather than old graphic systems in the vast majority of current editions of 17th century French texts has the unfortunate effect of masking their graphematic richness. Such valuable information has remained concealed and therefore under-exploited, despite the potential it holds in terms of analysis. By favouring a practical corpus-based approach, rather than a theoretical one, and by relying on a recategorisation of the various competing systems at that time in French scriptae, we propose the foundations of a scriptometric study of the classical language, focusing on the analysis of specific documents, both manuscripts and old prints.
Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoît Sagot, Abdelrahman Mohamed and Emmanuel Dupoux. 2023. Generative Spoken Dialogue Language Modeling. Transactions of the Association for Computational Linguistics 11 pages 250–266. The MIT Press.

We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model.
Thibault Clérice, Malamatenia Vlachou-Efstathiou and Alix Chagué. 2023. CREMMA Medii Aevi: Literary manuscript text recognition in Latin. Journal of Open Humanities Data 9 pages 1–19. Ubiquity Press.

This paper present a novel segmentation and handwritten text recognition dataset for Medieval Latin, from the 11 th to the 16 th century. It connects with Medieval French dataset as well as ealier Latin dataset, by enforcing common guidelines. We provide our own addition to Ariane Pinche's Old French guidelines to deal with specific Latin case. We also offer an overview of how we addressed this dataset compilation through the use of pre-existing resources. With a higher abbreviation ratio and a better representation of abbreviating marks, we offer new models that outperform the base Old French model on Latin dataset, reaching readability levels on unknown manuscripts.
Thibault Clérice. 2023. You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine. Journal of Data Mining and Digital Humanities Historical Documents and... INRIA.

Layout Analysis (the identification of zones and their classification) is the first step along line segmentation in Optical Character Recognition and similar tasks. The ability of identifying main body of text from marginal text or running titles makes the difference between extracting the work full text of a digitized book and noisy outputs. We show that most segmenters focus on pixel classification and that polygonization of this output has not been used as a target for the latest competition on historical document (ICDAR 2017 and onwards), despite being the focus in the early 2010s. We propose to shift, for efficiency, the task from a pixel classification-based polygonization to an object detection using isothetic rectangles. We compare the output of Kraken and YOLOv5 in terms of segmentation and show that the later severely outperforms the first on small datasets (1110 samples and below). We release two datasets for training and evaluation on historical documents as well as a new package, YALTAi, which injects YOLOv5 in the segmentation pipeline of Kraken 4.1.

Conference proceedings

Marc Hulcelle, Giovanna Varni, Nicolas Rollet and Chloé Clavel. 2023. Comparing a Mentalist and an Interactionist Approach for Trust Analysis in Human-Robot Interaction. In Proceedings of the 11th International Conference on Human-Agent Interaction. pages 273–280. ACM. Gothenburg, Sweden.

Yanzhu Guo, Guokan Shang, Virgile Rennard, Michalis Vazirgiannis and Chloé Clavel. 2023. Automatic Analysis of Substantiation in Scientific Peer Reviews. In Findings of the Association for Computational Linguistics: EMNLP 2023. pages 10198–10216. Association for Computational Linguistics. Singapore.

With the increasing amount of problematic peer reviews in top AI conferences, the community is urgently in need of automatic quality control measures. In this paper, we restrict our attention to substantiation — one popular quality aspect indicating whether the claims in a review are sufficiently supported by evidence — and provide a solution automatizing this evaluation process. To achieve this goal, we first formulate the problem as claim-evidence pair extraction in scientific peer reviews, and collect SubstanReview, the first annotated dataset for this task. SubstanReview consists of 550 reviews from NLP conferences annotated by domain experts. On the basis of this dataset, we train an argument mining system to automatically analyze the level of substantiation in peer reviews. We also perform data analysis on the SubstanReview dataset to obtain meaningful insights on peer reviewing quality in NLP conferences over recent years. The dataset is available at https://github.com/YanzhuGuo/SubstanReview.
Robin Algayres, Yossi Adi, Tu Nguyen, Jade Copet, Gabriel Synnaeve, Benoît Sagot and Emmanuel Dupoux. 2023. Generative Spoken Language Model based on continuous word-sized audio tokens. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pages 3008–3028. Association for Computational Linguistics. Singapore.

In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from wordbased LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio embeddings that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous embeddings. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is five times more memory efficient thanks to its large 200ms units. In addition, the embeddings before and after the Lexical Embedder are phonetically and semantically interpretable.
Robin Algayres, Pablo Diego-Simon, Benoît Sagot and Emmanuel Dupoux. 2023. XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words. In Findings of the Association for Computational Linguistics: EMNLP 2023. pages 12103–12112. Association for Computational Linguistics. Singapore.

Due to the absence of explicit word boundaries in the speech stream, the task of segmenting spoken sentences into word units without text supervision is particularly challenging. In this work, we leverage the most recent selfsupervised speech models that have proved to quickly adapt to new tasks through fine-tuning, even in low resource conditions. Taking inspiration from semi-supervised learning, we finetune an XLS-R model to predict word boundaries themselves produced by top-tier speech segmentation systems: DPDP, VG-HuBERT, GradSeg and DP-Parse. Once XLS-R is finetuned, it is used to infer new word boundary labels that are used in turn for another finetuning step. Our method consistently improves the performance of each system and sets a new state-of-the-art that is, on average 130% higher than the previous one as measured by the F1 score on correctly discovered word tokens on five corpora featuring different languages. Finally, our system can segment speech from languages unseen during fine-tuning in a zero-shot fashion 1 .
Simon Meoni, Eric De la Clergerie and Theo Ryffel. 2023. Large Language Models as Instructors: A Study on Multilingual Clinical Entity Extraction. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. pages 178–190. Association for Computational Linguistics. Toronto, Canada.

In clinical and other specialized domains, data are scarce due to their confidential nature. This lack of data is a major problem when finetuning language models. Nevertheless, very large language models (LLMs) are promising for the medical domain but cannot be used directly in healthcare facilities due to data confidentiality issues. We explore an approach of annotating training data with LLMs to train smaller models more adapted to our problem. We show that this method yields promising results for information extraction tasks.
José Rosales Núñez, Djamé Seddah and Guillaume Wisniewski. 2023. Multi-way Variational NMT for UGC: Improving Robustness in Zero-shot Scenarios via Mixture Density Networks. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa). pages 447–459. University of Tartu Library. Tórshavn, Faroe Islands.

This work presents a novel Variational Neural Machine Translation (VNMT) architecture with enhanced robustness properties, which we investigate through a detailed case-study addressing noisy French user-generated content (UGC) translation to English. We show that the proposed model, with results comparable or superior to state-of-the-art VNMT, improves performance over UGC translation in a zero-shot evaluation scenario while keeping optimal translation scores on in-domain test sets. We elaborate on such results by visualizing and explaining how neural learning representations behave when processing UGC noise. In addition, we show that VNMT enforces robustness to the learned embeddings, which can be later used for robust transfer learning approaches.
Rachel Bawden and Benoît Sagot. 2023. RoCS-MT: Robustness Challenge Set for Machine Translation. In Proceedings of the Eighth Conference on Machine Translation. pages 198–216. Association for Computational Linguistics. Singapore.

RoCS-MT, a Robust Challenge Set for Machine Translation (MT), is designed to test MT systems' ability to translate user-generated content (UGC) that displays non-standard characteristics, such as spelling errors, devowelling, acronymisation, etc. RoCS-MT is composed of English comments from Reddit, selected for their non-standard nature, which have been manually normalised and professionally translated into five languages: French, German, Czech, Ukrainian and Russian. In the context of the WMT23 test suite shared task, we analyse the models submitted to the general MT task for all from-English language pairs, offering some insights into the types of problems faced by state-of-the-art MT models when dealing with non-standard UGC texts. We compare automatic metrics for MT quality, including quality estimation to see if the same conclusions can be drawn without references. In terms of robustness, we find that many of the systems struggle with non-standard variants of words (e.g. due to phonetically inspired spellings, contraction, truncations, etc.), but that this depends on the system and the amount of training data, with the best overall systems performing better across all phenomena. GPT4 is the clear frontrunner. However we caution against drawing conclusions about generalisation capacity as it and other systems could be trained on the source side of RoCS and also on similar data.
Mariana Neves, Antonio Jimeno Yepes, Aurélie Névéol, Rachel Bawden, Giorgio Maria Di Nunzio, Roland Roller, Philippe Thomas, Federica Vezzani, Maika Vicente Navarro, Lana Yeganova, Dina Wiemann and Cristian Grozea. 2023. Findings of the WMT 2023 Biomedical Translation Shared Task: Evaluation of ChatGPT 3.5 as a Comparison System. In Proceedings of the Eighth Conference on Machine Translation. pages 43–54. Association for Computational Linguistics. Singapore.

We present an overview of the Biomedical Translation Task that was part of the Eighth Conference on Machine Translation (WMT23). The aim of the task was the automatic translation of biomedical abstracts from the PubMed database. It included twelve language directions, namely, French, Spanish, Portuguese, Italian, German, and Russian, from and into English. We received submissions from 18 systems and for all the test sets that we released. Our comparison system was based on Chat-GPT 3.5 and performed very well in comparison to many of the submissions.
Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ond\vrej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Philipp Koehn, Benjamin Marie, Christof Monz, Makoto Morishita, Kenton Murray, Makoto Nagata, Toshiaki Nakazawa, Martin Popel, Maja Popović and Mariya Shmatova. 2023. Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet. In Proceedings of the Eighth Conference on Machine Translation. pages 1–42. Association for Computational Linguistics. Singapore.

This paper presents the results of the General Machine Translation Task organised as part of the 2023 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 8 language pairs (covering 14 translation directions), to be evaluated on test sets consisting of up to four different domains. We evaluate system outputs with professional human annotators using a combination of source-based Direct Assessment and scalar quality metric (DA+SQM).
Valentin Taillandier, Dieuwke Hupkes, Benoît Sagot, Emmanuel Dupoux and Paul Michel. 2023. Neural Agents Struggle to Take Turns in Bidirectional Emergent Communication. In Proceedings of 11th International Conference on Learning Representation (ICLR 2023). Kigali, Rwanda.

The spontaneous exchange of turns is a central aspect of human communication. Although turn-taking conventions come to us naturally, artificial dialogue agents struggle to coordinate, and must rely on hard-coded rules to engage in interactive conversations with human interlocutors. In this paper, we investigate the conditions under which artificial agents may naturally develop turn-taking conventions in a simple language game. We describe a cooperative task where success is contingent on the exchange of information along a shared communication channel where talking over each other hinders communication. Despite these environmental constraints, neural-network based agents trained to solve this task with reinforcement learning do not systematically adopt turn-taking conventions. However, we find that agents that do agree on turn-taking protocols end up performing better. Moreover, agents that are forced to perform turn-taking can learn to solve the task more quickly. This suggests that turn-taking may help to generate conversations that are easier for speakers to interpret.
Paul-Ambroise Duquenne, Hongyu Gong, Ning Dong, Jingfei Du, Ann Lee, Vedanuj Goswami, Changhan Wang, Juan Pino, Benoît Sagot and Holger Schwenk. 2023. SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pages 16251–16269. Association for Computational Linguistics. Toronto, Canada.

We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations (S2ST) mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on Europarl-ST, VoxPopuli and FLEURS test sets. Enabled by the multilinguality of SpeechMatrix, we also explore multilingual speech-to-speech translation, a topic which was addressed by few other works. We also demonstrate that model pretraining and sparse scaling using Mixture-of-Experts bring large gains to translation performance. We are open-sourcing the mined data, speech encoders used for mining, multilingual HuBERT models in four language families for target unit generation, languagespecific vocoders for speech synthesis from discrete units, and S2S models trained and presented in this work. 1
Paul-Ambroise Duquenne, Holger Schwenk and Benoît Sagot. 2023. Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer. In Proceedings of the 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023). Dublin, Ireland.

Recent research has shown that independently trained encoders and decoders, combined through a shared fixed-size representation, can achieve competitive performance in speech-to-text translation. In this work, we show that this type of approach can be further improved with multilingual training. We observe significant improvements in zero-shot cross-modal speech translation, even outperforming a supervised approach based on XLSR for several languages.
Jean-Baptiste Camps, Nicolas Baumard, Pierre-Carl Langlais, Olivier Morin, Thibault Clérice and Jade Norindr. 2023. Make Love or War? Monitoring the Thematic Evolution of Medieval French Narratives. In Proceedings of Computational Humanities Research 2023. pages 734–756. Paris, France.

In this paper, we test a famous conjecture in literary history put forward by Seignobos and de Rougemont according to which the French central medieval period (12-13th centuries) is characterized by an important increase in the cultural importance of love. To do that, we focus on the large and culturally important body of manuscripts containing medieval French long narrative fictions, in particular epics (chansons de geste, of the Matter of France) and romances (chiefly romans on the Matters of Britain and of Rome), both in verse and in prose, from the 12th to the 15th century. We introduce the largest available corpus of these texts, the Corpus of Medieval French Epics and Romances, composed of digitised manuscripts drawn from Gallica, and processed through layout analysis and handwritten text recognition. We then use semantic representations based on embeddings to monitor the place given to love and violence in this corpus, through time. We observe that themes (such as the relation between love and death) and emblematic works well identified by literary history do indeed play a central part in the representation of love in the corpus, but our modelling also points to the characteristic nature of more overlooked works. Variation in time seems to show that there is indeed an phase of expansion of love in these fictions, in the 13th and early 14th century, followed by a period of contraction, that seem to correlate with the Crisis of the Late Middle Ages.
Arij Riabi, Menel Mahamdi and Djamé Seddah. 2023. Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting an Under-Resourced Language. In Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII). pages 266–278. Association for Computational Linguistics. Toronto, Canada.

In this paper we address the scarcity of annotated data for NArabizi, a Romanized form of North African Arabic used mostly on social media, which poses challenges for Natural Language Processing (NLP). We introduce an enriched version of NArabizi Treebank (Seddah et al., 2020) with three main contributions: the addition of two novel annotation layers (named entity recognition and offensive language detection) and a re-annotation of the tokenization, morpho-syntactic and syntactic layers that ensure annotation consistency. Our experimental results, using different tokenization schemes, showcase the value of our contributions and highlight the impact of working with non-gold tokenization for NER and dependency parsing. To facilitate future research, we make these annotations publicly available. Our enhanced NArabizi Treebank paves the way for creating sophisticated language models and NLP tools for this under-represented language.
Galo Castillo-lópez, Arij Riabi and Djamé Seddah. 2023. Analyzing Zero-Shot transfer Scenarios across Spanish variants for Hate Speech Detection. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). pages 1–13. Association for Computational Linguistics. Dubrovnik, Croatia.

Hate speech detection in online platforms has been widely studied in the past. Most of these works were conducted in English and a few rich-resource languages. Recent approaches tailored for low-resource languages have explored the interests of zero-shot cross-lingual transfer learning models in resource-scarce scenarios. However, languages variations between geolects such as American English and British English, Latin-American Spanish, and European Spanish is still a problem for NLP models that often relies on (latent) lexical information for their classification tasks. More importantly, the cultural aspect, crucial for hate speech detection, is often overlooked. In this work, we present the results of a thorough analysis of hate speech detection models performance on different variants of Spanish, including a new hate speech toward immigrants Twitter data set we built to cover these variants. Using mBERT and Beto, a monolingual Spanish Bert-based language model, as the basis of our transfer learning architecture, our results indicate that hate speech detection models for a given Spanish variant are affected when different variations of such language are not considered. Hate speech expressions could vary from region to region where the same language is spoken.
Alafate Abulimiti, Chloé Clavel and Justine Cassell. 2023. When to generate hedges in peer-tutoring interactions. In SIGDIAL - 24th Meeting of the Special Interest Group on Discourse and Dialogue. Prague, Czech Republic.

This paper explores the application of machine learning techniques to predict where hedging occurs in peer-tutoring interactions. The study uses a naturalistic face-to-face dataset annotated for natural language turns, conversational strategies, tutoring strategies, and nonverbal behaviours. These elements are processed into a vector representation of the previous turns, which serves as input to several machine learning models. Results show that embedding layers, that capture the semantic information of the previous turns, significantly improves the model's performance. Additionally, the study provides insights into the importance of various features, such as interpersonal rapport and nonverbal behaviours, in predicting hedges by using Shapley values for feature explanation. We discover that the eye gaze of both the tutor and the tutee has a significant impact on hedge prediction. We further validate this observation through a follow-up ablation study.
Thibault Clérice and Anthony Glaise. 2023. Twenty-One* Pseudo-Chrysostoms and more: authorship verification in the patristic world. In Proceedings of the Computational Humanities Research Conference 2023. Paris, France.

As the most prolific of the Church Fathers, John Chrysostom (344-407 CE) has a vast textual mass and theological importance that has led to a significant misattribution of texts, resulting in the existence of a second corpus known as the pseudo-Chrysostomian corpus. Like many Greek-language Church Fathers' works, this corpus comprises anonymous texts, which scholars have attempted to reattribute or group together based on factors such as the person's function, biography, ideology, style, etc. One survey conducted by Voicu in 1981 explored potential groupings of such texts and produced a critical list of 21 Pseudo-Chrysostom works identified by scholars, including Montfaucon (1655-1741), one of the first modern editors of Chrysostom's writings. In this paper, we present a novel approach to addressing pseudonymous work in the context of chrysostomian studies. We propose to employ siamese networks within an authorship verification framework, following the methodology commonly used in recent computational linguistic competitions. Our embedding model is trained using commonly used features in the digital humanities landscape, such as the most frequent words, affixes, and POS trigrams, utilizing a signal-to-noise ratio distance and pair mining. The results of our model show high AUCROC scores (0.855). Furthermore, the article concludes with an analysis of the pseudo-Chrysostoms proposed by Voicu. We validate a significant portion of the hypotheses found in Voicu's survey while also providing counter-arguments for two Pseudo-Chrysostoms. This research contributes to shedding light on the attribution of ancient texts and enriches the field of chrysostomian studies.
Itai Gat, Felix Kreuk, Tu Anh Nguyen, Ann Lee, Jade Copet, Gabriel Synnaeve, Emmanuel Dupoux and Yossi Adi. 2023. Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023). pages 465–477. Association for Computational Linguistics. Toronto, Canada (in-person and online).

Generative Spoken Language Modeling research focuses on optimizing speech Language Models (LMs) using raw audio recordings without accessing any textual supervision. Such speech LMs usually operate over discrete units obtained from quantizing internal representations of self-supervised models. Although such units show impressive modeling results, their robustness capabilities have not been extensively investigated. This work focuses on improving the invariance of discrete input representations to non-spoken augmentations for generative spoken language modeling. First, we formally define how to measure the robustness of such representations to various signal variations that do not alter the spoken information (e.g., time-stretch). Next, we empirically demonstrate how current state-of-the-art representation models lack robustness to such variations. To overcome this, we propose an effective and efficient method to learn invariant discrete speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudolabeling scheme. Our method significantly improves over the evaluated baselines when considering encoding and modeling metrics. We additionally evaluate our method on the speechto-speech translation task, considering Spanish-English and French-English translations, and show the proposed approach outperforms the evaluated baselines.
Tu Anh Nguyen, Wei-Ning Hsu, Antony d'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi and Emmanuel Dupoux. 2023. Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis. In Proceedings of the 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023). pages 4823–4827. ISCA. Dublin, Ireland.

Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce EXPRESSO, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in low-bitrate units and resynthesize it in a target voice while preserving content and style. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders, and explore tradeoffs between quality, bitrate and invariance to speaker and style. The dataset, evaluation metrics and baseline models are open sourced.
Ali Elkahky, Wei-Ning Hsu, Paden Tomasello, Tu Anh Nguyen, Robin Algayres, Yossi Adi, Jade Copet, Emmanuel Dupoux and Abdelrahman Mohamed. 2023. Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training? In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023). IEEE. Ixia-Ialyssos, Greece.

The research community has produced many successful selfsupervised speech representation learning methods over the past few years. Discrete units have been utilized in various self-supervised learning frameworks, such as VQ-VAE [1], wav2vec 2.0 [2], HuBERT [3], and Wav2Seq [4]. This paper studies the impact of altering the granularity and improving the quality of these discrete acoustic units for pre-training encoder-only and encoder-decoder models. We systematically study the current proposals of using Byte-Pair Encoding (BPE) and new extensions that use cluster smoothing and Brown clustering. The quality of learned units is studied intrinsically using zero speech metrics and on the downstream speech recognition (ASR) task. Our results suggest that longer-range units are helpful for encoder-decoder pre-training; however, encoder-only masked-prediction models cannot yet benefit from self-supervised word-like targets.
Maud Bénard, Alexandra Mestivier, Natalie Kubler, Lichao Zhu, Rachel Bawden, Eric De La Clergerie, Laurent Romary, Mathilde Huguin, Jean-François Nominé, Ziqian Peng and François Yvon. 2023. MaTOS: Traduction automatique pour la science ouverte. In Actes de CORIA-TALN 2023. Actes de l'atelier «Analyse et Recherche de Textes Scientifiques»; (ARTS)@TALN 2023. pages 8–15. ATALA. Paris, France.

This contribution presents the MaTOS (Machine Translation for Open Science) project, which aims to develop new methods for the complete machine translation (MT) of scientific documents between English and French, as well as automatic metrics to evaluate the translation quality. To this end, MaTOS is interested in (a) the collection of open resources for specialised MT ; (b) the description of textual coherence markers for scientific articles ; (c) the development of new multilingual processing methods for documents ; and (d) metrics to measure progress in document-level machine translation.
Simon Meoni, Rian Touchent and Eric De La Clergerie. 2023. Passe ta pharma d'abord ! In Actes de CORIA-TALN 2023. Actes du Défi Fouille de Textes@TALN2023. pages 68–76. ATALA. Paris, France.

Nous présentons les 3 expériences menées par l'équipe ALMAnaCH - Arkhn et leurs résultats pour le DÉfi Fouille de Textes (DEFT) 2023. Les scores sont encourageants mais suggèrent surtout de nouveaux éléments à prendre en compte pour réussir ce défi. Nous avons exploré différentes approches avec des modèles de tailles variables et modélisé la tâche de différentes manières (classification multi-labels, implication textuelle, séquence à séquence). Nous n'avons pas observé des gains de performance significatifs. Nos expériences semblent montrer la nécessité de l'utilisation de bases de connaissances externes pour obtenir de bons résultats sur ce type de tâche.
Lionel Tadonfouet Tadjou, Eric De La Clergerie, Fabrice Bourge and Tiphaine Marie. 2023. Constitution de sous-fils de conversations d'emails. In Actes de CORIA-TALN 2023. Actes de la 18e Conférence en Recherche d'Information et Applications (CORIA). pages 157–171. ATALA. Paris, France.

Email conversations in the workplace are sometimes difficult to follow by collaborators because they can deal with multiple topics and involve many interlocutors. To improve understanding of key messages, it’s helpful to create subthreads within the conversation. In our study, we propose a two-stage pipeline to recognize dialogue acts in email text segments and link them to improveinformation accessibility. This pipeline creates pairs of text segments across the conversation, making it easier to understand the key messages. To our knowledge, this is the first time this issue of creating conversation threads has been addressed in email conversations. We annotated the BC3 corpus of emails with dialogue acts and linked conversation email text segments.
Lydia Nishimwe. 2023. Normalisation lexicale de contenus générés par les utilisateurs sur les réseaux sociaux. In Actes de CORIA-TALN 2023. Actes des 16e Rencontres Jeunes Chercheurs en RI (RJCRI) et 25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL). pages 160–183. ATALA. Paris, France.

The boom of natural language processing (NLP) is taking place in a world where more and more content is produced online. On social networks especially, textual content published by users are full of “non-standard” phenomena such as spelling mistakes, jargon, marks of expressiveness, etc. Thus, NLP models, which are largely trained on “standard” data, suffer a decline in performance when applied to user-generated content (UGC). One approach to mitigate this degradation is through lexical normalisation where non-standard words are replaced by their standard forms. In this paper, we review the state of the art of lexical normalisation of UGC, as well as run a preliminary experimental study to show the advantages and difficulties of this task.
Simon Meoni, Théo Ryffel and Eric De La Clergerie. 2023. Annotation d'entités cliniques en utilisant les Larges Modèles de Langue. In Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux–articles longs. pages 190–203. ATALA. Paris, France.

Dans le domaine clinique et dans d'autres domaines spécialisés, les données sont rares du fait de leur caractère confidentiel. Ce manque de données est un problème majeur lors du fine-tuning de modèles de langue.Par ailleurs, les modèles de langue de très grande taille (LLM) ont des performances prometteuses dans le domaine médical. Néanmoins, ils ne peuvent pas être utilisés directement dans les infrastructures des établissements de santé pour des raisons de confidentialité des données. Nous explorons une approche d'annotation des données d'entraînement avec des LLMs pour entraîner des modèles de moins grandes tailles mieux adaptés à notre problématique. Cette méthode donne des résultats prometteurs pour des tâches d'extraction d'information
You Zuo, Benoît Sagot, Kim Gerdes, Houda Mouzoun and Samir Ghamri Doudane. 2023. Exploring Data-Centric Strategies for French Patent Classification: A Baseline and Comparisons. In Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux–articles longs. pages 349–365. ATALA. Paris, France.

This paper proposes a novel approach to French patent classification leveraging data-centric strategies. We compare different approaches for the two deepest levels of the IPC hierarchy: the IPC group and subgroups. Our experiments show that while simple ensemble strategies work for shallower levels, deeper levels require more sophisticated techniques such as data augmentation, clustering, and negative sampling. Our research highlights the importance of language-specific features and data-centric strategies for accurate and reliable French patent classification. It provides valuable insights and solutions for researchers and practitioners in the field of patent classification, advancing research in French patent classification.
Rian Touchent, Laurent Romary and Eric De La Clergerie. 2023. CamemBERT-bio : Un modèle de langue français savoureux et meilleur pour la santé. In Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux–articles longs. pages 323–334. ATALA. Paris, France.

Les données cliniques dans les hôpitaux sont de plus en plus accessibles pour la recherche à travers les entrepôts de données de santé, cependant ces documents sont non-structurés. Il est donc nécessaire d'extraire les informations des comptes-rendus médicaux. L'utilisation du transfert d'apprentissage grâce à des modèles de type BERT comme CamemBERT ont permis des avancées majeures, notamment pour la reconnaissance d'entités nommées. Cependant, ces modèles sont entraînés pour le langage courant et sont moins performants sur des données biomédicales. C'est pourquoi nous proposons un nouveau jeu de données biomédical public français sur lequel nous avons poursuivi le pré-entraînement de CamemBERT. Ainsi, nous présentons une première version de CamemBERT-bio, un modèle public spécialisé pour le domaine biomédical français qui montre un gain de 2,54 points de F-mesure en moyenne sur différents jeux d'évaluations de reconnaissance d'entités nommées biomédicales.
Niyati Bafna, Cristina España-Bonet, Josef Van Genabith, Benoît Sagot and Rachel Bawden. 2023. Cross-lingual Strategies for Low-resource Language Modeling: A Study on Five Indic Dialects. In Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux–articles longs. pages 28–42. ATALA. Paris, France.

Neural language models play an increasingly central role for language processing, given their success for a range of NLP tasks. In this study, we compare some canonical strategies in language modeling for low-resource scenarios, evaluating all models by their (finetuned) performance on a POS-tagging downstream task. We work with five (extremely) low-resource dialects from the Indic dialect continuum (Braj, Awadhi, Bhojpuri, Magahi, Maithili), which are closely related to each other and the standard mid-resource dialect, Hindi. The strategies we evaluate broadly include from-scratch pretraining, and cross-lingual transfer between the dialects as well as from different kinds of off-the- shelf multilingual models; we find that a model pretrained on other mid-resource Indic dialects and languages, with extended pretraining on target dialect data, consistently outperforms other models. We interpret our results in terms of dataset sizes, phylogenetic relationships, and corpus statistics, as well as particularities of this linguistic system.
Wissam Antoun, Virginie Mouilleron, Benoît Sagot and Djamé Seddah. 2023. Towards a Robust Detection of Language Model-Generated Text: Is ChatGPT that easy to detect? In Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux–articles longs. pages 14–27. ATALA. Paris, France.

Recent advances in natural language processing (NLP) have led to the development of large language models (LLMs) such as ChatGPT. This paper proposes a methodology for developing and evaluating ChatGPT detectors for French text, with a focus on investigating their robustness on out-of-domain data and against common attack schemes. The proposed method involves translating an English dataset into French and training a classifier on the translated data. Results show that the detectors can effectively detect ChatGPT-generated text, with a degree of robustness against basic attack techniques in in-domain settings. However, vulnerabilities are evident in out-of-domain contexts, highlighting the challenge of detecting adversarial text. The study emphasizes caution when applying in-domain testing results to a wider variety of content. We provide our translated datasets and models as open-source resources.
Francesca Frontini, Laurent Romary and Anas Fahad Khan. 2023. ISO LMF 24613-6: A Revised Syntax Semantics Module for the Lexical Markup Framework. In Proceedings of the 4th Conference on Language, Data and Knowledge. pages 316–321. NOVA CLUNL, Portugal. Vienna, Austria.

The Lexical Markup Framework (LMF) is a meta-model for representing data in monolingual and multilingual lexical databases with a view to its use in computer applications. The "new LMF" replaces the old LMF standard, ISO 24613:2008, and is being published as a multi-part standard. This short paper introduces one of these new parts, ISO 24613-6, namely the Syntax and Semantics (SynSem) module. The SynSem module allows for the description of syntactic and semantic properties of lexemes, as well as the complex interactions between them. While the new standard remains faithful to (and backwards compatible with) the syntax and semantics coverage of the previous model, the new standard clarifies and simplifies it in a few places, which will be illustrated.
Alix Chagué, Thibault Clérice, Jade Norindr, Maxime Humeau, Baudoin Davoury, Elsa Van Kote, Anaïs Mazoue, Margaux Faure and Soline Doat. 2023. Manu McFrench, from zero to hero: impact of using a generic handwriting recognition model for smaller datasets. In Digital Humanities 2023: Collaboration as Opportunity. Graz, Austria.

Long paper presentation for ADHO's annual conference on Digital Humanities (2023), discussing the importance of using generic transcription models for HTR and how to create them. We use the case of the CREMMA datasets and the Manu McFrench models as an example.
Thibault Clérice, Alix Chagué and Hugo Scheithauer. 2023. Workshop HTR-United: metadata, quality control and sharing process for HTR training data. In DH 2023 - Digital Humanities Conference: Collaboration as Opportunity. Graz, Austria.

Workshop for ADHO's 2023 conference on Digital Humanities, introducing HTR-United's main features and demonstrating how to use them, on top of presenting essential Continuous Integration principles.
Alix Chagué and Thibault Clérice. 2023. ''I'm here to fight for ground truth'': HTR-United, a solution towards a common for HTR training data. In Digital Humanities 2023: Collaboration as Opportunity. Graz, Austria.

Short paper presentation for ADHO's annual conference on the Digital Humanities (DH2023), introducing the HTR-United infrastructure and the stakes of sharing training datasets for HTR of historical documents.
Sonal Sannigrahi and Rachel Bawden. 2023. Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation. pages 181–192. European Association for Machine Translation. Tampere, Finland.

Multilingual language models have shown impressive cross-lingual transfer ability across a diverse set of languages and tasks. To improve the cross-lingual ability of these models, some strategies include transliteration and finer-grained segmentation into characters as opposed to subwords. In this work, we investigate lexical sharing in multilingual machine translation (MT) from Hindi, Gujarati, Nepali into English. We explore the trade-offs that exist in translation performance between data sampling and vocabulary size, and we explore whether transliteration is useful in encouraging cross-script generalisation. We also verify how the different settings generalise to unseen languages (Marathi and Bengali). We find that transliteration does not give pronounced improvements and our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences even for relatively low-resource languages. Our code will be made publicly available. 1
Rachel Bawden and François Yvon. 2023. Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation. pages 157–170. European Association for Machine Translation. Tampere, Finland.

The NLP community recently saw the release of a new large open-access multilingual language model, BLOOM (BigScience et al., 2022) covering 46 languages. We focus on BLOOM's multilingual ability by evaluating its machine translation performance across several datasets (WMT, Flores-101 and DiaBLa) and language pairs (high- and low-resourced). Our results show that 0-shot performance suffers from overgeneration and generating in the wrong language, but this is greatly improved in the few-shot setting, with very good results for a number of language pairs. We study several aspects including prompt design, model sizes, cross-lingual transfer and the use of discursive context.
Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot and Rachel Bawden. 2023. Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pages 5394–5413. Association for Computational Linguistics. Toronto, Canada.

One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as images. However, recent work in multimodal MT (MMT) has shown that obtaining improvements from images is challenging, limited not only by the difficulty of building effective cross-modal representations, but also by the lack of specific evaluation and training data. We present a new MMT approach based on a strong text-only MT model, which uses neural adapters, a novel guided self-attention mechanism and which is jointly trained on both visually-conditioned masking and MMT. We also introduce CoMMuTE, a Contrastive Multilingual Multimodal Translation Evaluation set of ambiguous sentences and their possible translations, accompanied by disambiguating images corresponding to each translation. Our approach obtains competitive results compared to strong text-only models on standard English-to-French, English-to-German and English-to-Czech benchmarks and outperforms baselines and state-of-the-art MMT systems by a large margin on our contrastive test set. Our code and CoMMuTE are freely available.
Wissam Antoun, Benoît Sagot and Djamé Seddah. 2023. Data-Efficient French Language Modeling with CamemBERTa. In Findings of the Association for Computational Linguistics: ACL 2023. pages 5174–5185. Association for Computational Linguistics. Toronto, Canada.

Recent advances in NLP have significantly improved the performance of language models on a variety of tasks. While these advances are largely driven by the availability of large amounts of data and computational power, they also benefit from the development of better training methods and architectures. In this paper, we introduce CamemBERTa, a French DeBERTa model that builds upon the DeBERTaV3 architecture and training objective. We evaluate our model's performance on a variety of French downstream tasks and datasets, including question answering, part-of-speech tagging, dependency parsing, named entity recognition, and the FLUE benchmark, and compare against CamemBERT, the state-of-the-art monolingual model for French. Our results show that, given the same amount of training tokens, our model outperforms BERT-based models trained with MLM on most tasks. Furthermore, our new model reaches similar or superior performance on downstream tasks compared to CamemBERT, despite being trained on only 30% of its total number of input tokens. In addition to our experimental results, we also publicly release the weights and code implementation of CamemBERTa, making it the first publicly available DeBERTaV3 model outside of the original paper and the first openly available implementation of a DeBERTaV3 training objective. https://gitlab.inria.fr/almanach/CamemBERTa
El Haff Karim, Wissam Antoun, Florence Le Ber and Véronique Pitchon. 2023. Reconnaissance des entités nommées pour l'analyse des pharmacopées médiévales. In Proceedings of EGC 2023 - Extraction et Gestion des Connaissances. pages 329–336. Lyon, France.

Today, many projects focus on the application of linguistic technologies on modern medical corpora, especially in the field of Named Entity Recognition. Besides, ancient pharmacopoeias are being explored with manual data entry by specialists in history and biology in order to extract knowledge. These analyses are carried out without necessarily going through the automatic recognition of named entities which could accelerate the exploration of the manuscripts. Therefore, we propose here a link between the two practices by: (1) creating a named entity recognition dataset for English translations of medieval Arabic pharmacopoeias and (2) training and evaluating language models that are pre-trained on multiple domains.

Communications

Hugo Scheithauer, Sarah Bénière, Jean-Philippe Moreux and Laurent Romary. 2023. DataCatalogue : rétro-structuration automatique des catalogues de vente. In Webinaire Culture-Inria. Paris, France.

Hugo Scheithauer. 2023. DataCatalogue : Un projet pour la restructuration automatique de catalogues de vente. In Traitements automatiques pour les humanités numériques - corpus d'histoire de l'art, d'enseignement, d'urbanisme. Nanterre, France.

Chahan Vidal-Gorène, Jean-Baptiste Camps and Thibault Clérice. 2023. Synthetic lines from historical manuscripts: an experiment using GAN and style transfer. In Visual Processing of Digital Manuscripts: Workflows, Pipelines, Best Practices. ICIAP 2023 Workshops. ICIAP 2023. Udine, Italy.

Given enough data of sufficient quality, HTR systems can achieve high accuracy, regardless of language, script or medium. Despite growing pooling of datasets, the question of the required quantity of training material still remains crucial for the transfer of models to out-of-domain documents, or the recognition of new scripts and under-resourced character classes. We propose a new data augmentation strategy, using generative adversarial networks (GAN). Inspired by synthetic lines generation for printed documents, our objective is to generate handwritten lines in order to massively produce data for a given style or under-resourced character class. Our approach, based on a variant of ScrabbleGAN, demonstrates the feasibility for various scripts, either in the presence of a high number and variety of abbreviations (Latin) and spellings or letter forms (Old French), in a situation of data scarcity (Armenian), or in the instance of a very cursive script (Arabic Maghribi). We then study the impact of synthetic line generation on HTR, by evaluating the gain for out-of-domain documents and under-resourced classes.
Ana Salgado, Rute Costa, Sara Carvalho, Anas Fahad Khan, Bruno Almeida, Margarida Ramos, Raquel Silva, Mohamed Khemakhem, Laurent Romary and Toma Tasovac. 2023. Domain labelling in the Morais dictionary: bringing structure to unstructured lexicographic data. In 24th Biennial Dictionary Society of North America Conference (DSNA). Boulder, United States.

This article provides a detailed analysis on the use of domain labels, i.e., special markersidentifying a specialised field of knowledge, in successive editions of the Morais dictionary.Morais is a historical Portuguese language dictionary, commonly known by and disseminated under the name of António de Morais Silva. This monolingual dictionary has relevance for the Portuguese lexicographic tradition as it inaugurates modern Portuguese lexicography and serves as a model for all subsequent lexicographic production throughout the 19th and 20th centuries. The domain labels were retrieved from the abbreviation lists of its various editions. This work is part of an ongoing Portuguese national linguistic project. It has two goals: 1) to encode the first three editions of the Morais dictionary to make them available online (as well as publishing them as lexical resources using two different standards for structured lexicographic datasets) and 2) to provide a description of the lexicographic components of these editions following a rigorous linguistic treatment. This project is not merely of a lexicographic nature, but it also explores the convergence between lexicography and other research domains, such as terminology, ontologies, linked data, and digital humanities. This article analyzes the domain labelling system in Morais from an evolutionary and diachronic perspective, in line with previous works that highlight the theoretical assumptions and methodological aspects of the lexicographical tradition around domain labelling. To organize lexicographic content, it is helpful to establish a hierarchical structure in general language dictionaries to systematize the included terminological information. Each table of abbreviations has two distinct columns: one with the abbreviation and the other with the complete domain designations. Given the importance of domain labels, we conducted a survey of all domain labels found. We identify and demonstrate the previous and newly added domains. After reviewing the flat domain list, we evaluated whether there was a discernible knowledge organizational approach that identified possible generic domains and subdomains. In the organization of domains, we propose three possible levels: superdomain, domain, and subdomain. The superdomain corresponds to the broadest taxonomic grouping followed by a domain, whereas the subdomain is part of a broader domain. To facilitate the analysis and to focus on interoperability issues, we generated a metalabel, a tag that identifies the English equivalent of the corresponding domain. The lists of domains included in general dictionaries’ outside matter follow alphabetical ordering, without any concern for the relationships that can be established between those types of labels. This article describes both onomasiological and semasiological approaches to treating specialized lexicographic content. Following terminological principles and an onomasiological approach, we organize and conceptualize specialized knowledge using structured data formats, such as Text Encoding Initiative, also considering future alignments between different lexicographic resources. The project will contribute towards a more significant presence of lexicographic digital content in Portuguese through open tools and standards.

Tech reports

Yannick Parmentier, Sylvain Pogodalla, Rachel Bawden, Matthieu Labeau and Iris Eshkol-Taravella. 2023. Procédure de diffusion des publications de l'ATALA sur les archives ouvertes. Technical report.

Other

Alix Chagué and Thibault Clérice. 2023. Deploying eScriptorium online: notes on CREMMA's server specifications.

Alix Chagué and Thibault Clérice. 2023. 017 - Deploying eScriptorium online: notes on CREMMA's server specifications.

Laurent Romary. 2023. Monitoring an APC policy - lessons learned and perspective after 7 years.

As part of its open science policy, articulated around a deposit mandate on the French publication repository HAL, Inria decided several years ago to provide internal supervision and support for article processing charges (APC). These charges, which for publishers provide a way of covering publication costs are now part of an ethical debate surrounding open access. We introduced a policy for covering APCs based upon a central budget and forbidding the payment of APCs for hybrid venues. Each request for funding for a publication through APCs is analysed, focusing on raising awareness, providing support and making recommendations, targeting so-called 'ethical' journals. We will present the results of this policy over a period of several years and elicit some of the further directions we want to follow in the future.

Preprints

Paul-Ambroise Duquenne, Kevin Heffernan, Alexandre Mourachko, Benoît Sagot and Holger Schwenk. 2023. SONAR EXPRESSIVE: Zero-shot Expressive Speech-to-Speech Translation. Preprint.

Massively multilingual and multimodal sentence representations like SONAR are usually trained to capture only the meaning of the encoded text or speech. We complement this semantic embedding by a generic speech characteristic embedding which captures the expressive properties of a speech signal. We describe an iterative training procedure which aims to disentangle the semantics and expressive speech properties, and which does not need labeled data. We show the effectiveness of our method on the FLEURS and mEXPRESSO benchmark test sets using multiple metrics which aim to measure the preservation of the meaning and prosody for zero-shot speech-to-speech translation from five languages into English.
Beatrice Biancardi, Mathieu Chollet and Chloé Clavel. 2023. Introducing the 3MT_French Dataset to Investigate the Timing of Public Speaking Judgements. Preprint.

Abstract In most public speaking datasets, judgements are given after watching the entire performance, or on thin slices randomly selected from the presentations, without focusing on the temporal location of these slices. This does not allow to investigate how people's judgements develop over time during presentations. This contrasts with primacy and recency theories, which suggest that some moments of the speech could be more salient than others and contribute disproportionately to the perception of the speaker's performance.To provide novel insights on this phenomenon, we present the 3MT_French dataset. It contains a set of public speaking annotations collected on a crowd-sourcing platform through a novel annotation scheme and protocol. Global evaluation, persuasiveness, perceived self-confidence of the speaker and audience engagement were annotated on different time windows (i.e., the beginning, middle or end of the presentation, or the full video). This new resource will be useful to researchers working on public speaking assessment and training. It will allow to fine-tune the analysis of presentations under a novel perspective relying on socio-cognitive theories rarely studied before in this context, such as first impressions and primacy and recency theories. An exploratory correlation analysis on the annotations provided in the dataset suggests that the early moments of a presentation have a stronger impact on the judgements.
Alix Chagué and Thibault Clérice. 2023. Données ouvertes, données propres, et autres vies : Testaments de Poilus et CREMMA. Preprint.

Niyati Bafna, Cristina España-Bonet, Josef van Genabith, Benoît Sagot and Rachel Bawden. 2023. A Simple Method for Unsupervised Bilingual Lexicon Induction for Data-Imbalanced, Closely Related Language Pairs. Preprint.

Existing approaches for unsupervised bilingual lexicon induction (BLI) often depend on good quality static or contextual embeddings trained on large monolingual corpora for both languages. In reality, however, unsupervised BLI is most likely to be useful for dialects and languages that do not have abundant amounts of monolingual data. We introduce a simple and fast method for unsupervised BLI for low-resource languages with a related mid-to-high resource language, only requiring inference on the higher-resource language monolingual BERT. We work with two low-resource languages ($<5M$ monolingual tokens), Bhojpuri and Magahi, of the severely under-researched Indic dialect continuum, showing that state-of-the-art methods in the literature show near-zero performance in these settings, and that our simpler method gives much better results. We repeat our experiments on Marathi and Nepali, two higher-resource Indic languages, to compare approach performances by resource range. We release automatically created bilingual lexicons for the first time for five languages of the Indic dialect continuum.
Nathan Godey, Eric Villemonte de La Clergerie and Benoît Sagot. 2023. Headless Language Models: Learning without Predicting with Contrastive Weight Tying. Preprint.

Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies. In this study, we propose an innovative method that shifts away from probability prediction and instead focuses on reconstructing input embeddings in a contrastive fashion via Constrastive Weight Tying (CWT). We apply this approach to pretrain Headless Language Models in both monolingual and multilingual contexts. Our method offers practical advantages, substantially reducing training computational requirements by up to 20 times, while simultaneously enhancing downstream performance and data efficiency. We observe a significant +1.6 GLUE score increase and a notable +2.7 LAMBADA accuracy improvement compared to classical LMs within similar compute budgets.
Paul-Ambroise Duquenne, Holger Schwenk and Benoît Sagot. 2023. SONAR: Sentence-Level Multimodal and Language-Agnostic Representations. Preprint.

We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space. Our single text encoder, covering 200 languages, substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. Our encoders outperform existing speech encoders on similarity search tasks. We also provide a text decoder for 200 languages, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations. Our text-to-text results are competitive compared to the state-of-the-art NLLB 1B model, despite the fixed-size bottleneck representation. Our zero-shot speech-to-text translation results compare favorably with strong supervised baselines such as Whisper.
Nathan Godey, Eric Villemonte de La Clergerie and Benoît Sagot. 2023. Is Anisotropy Inherent to Transformers? Preprint.

The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which makes them unexpectedly close to each other in terms of angular distance (cosine-similarity). Some recent works tend to show that anisotropy is a consequence of optimizing the cross-entropy loss on long-tailed distributions of tokens. We show in this paper that anisotropy can also be observed empirically in language models with specific objectives that should not suffer directly from the same consequences. We also show that the anisotropy problem extends to Transformers trained on other modalities. Our observations tend to demonstrate that anisotropy might actually be inherent to Transformers-based models.
Alix Chagué and Hippolyte Souvay. 2023. Image Acquisition and Layout Analysis. Preprint.

Presentation of key information and processes to work with images in the context of automatic text recognition pipelines and in particular for the detection of the layout, using the eScriptorium application as example.
Floriane Chiffoleau. 2023. TEI Publisher, a platform for sustainable digital editions. Preprint.

Alix Chagué and Floriane Chiffoleau. 2023. What can you do next? Choice of output and reuse of your transcription. Preprint.

Alix Chagué and Floriane Chiffoleau. 2023. ATR: What can eScriptorium do for you? Preprint.

C. Annemieke Romein, Tobias Hodel, Femke Gordijn, Joris Zundert, Alix Chagué, Milan Van Lange, Helle Strandgaard Jensen, Andy Stauder, Jake Purcell, Melissa Terras, Pauline van Den Heuvel, Carlijn Keijzer, Achim Rabus, Chantal Sitaram, Aakriti Bhatia, Katrien Depuydt, Mary Aderonke Afolabi-Adeolu, Anastasiia Anikina, Elisa Bastianello, Lukas Vincent Benzinger, Arno Bosse, David Brown, Ash Charlton, André Nilsson Dannevig, Klaas Van Gelder, Sabine C.P.J. Go, Marcus J.C. Goh, Silvia Gstrein, Sewa Hasan, Stefan von Der Heide, Maximilian Hindermann, Dorothee Huff, Ineke Huysman, Ali Idris, Liesbeth Keijzer, Simon Kemper, Sanne Koenders, Erika Kuijpers, Lisette Rønsig Larsen, Sven Lepa, Tommy Link, Annelies Van Nispen, Joe Nockels, Laura Noort, Joost Johannes Oosterhuis, Vivien Popken, María Estrella Puertollano, Joosep Puusaag, Ahmed Sheta, Lex Stoop, Ebba Strutzenbladh, Nicoline van Der Sijs, Jan Paul van Der Spek, Barry Benaissa Trouw, Geertrui van Synghel, Vladimir Vučković, Heleen Wilbrink, Sonia Weiss, David Joseph Wrisley and Riet Zweistra. 2023. Exploring Data Provenance in Handwritten Text Recognition Infrastructure: Sharing and Reusing Ground Truth Data, Referencing Models, and Acknowledging Contributions. Starting the Conversation on How We Could Get It Done. Preprint.

This paper discusses best practices for sharing and reusing Ground Truth in Handwritten Text Recognition infrastructures, as well as ways to reference and acknowledge contributions to the creation and enrichment of data within these systems. We discuss how one can place Ground Truth data in a repository and, subsequently, inform others through HTR-United. Furthermore, we want to suggest appropriate citation methods for ATR data, models, and contributions made by volunteers. Moreover, when using digitised sources (digital facsimiles), it becomes increasingly important to distinguish between the physical object and the digital collection. These topics all relate to the proper acknowledgement of labour put into digitising, transcribing, and sharing Ground Truth HTR data. This also points to broader issues surrounding the use of machine learning in archival and library contexts, and how the community should begin to acknowledge and record both contributions and data provenance.
Tu Anh Nguyen, Maureen De Seyssel, Robin Algayres, Patricia Rozé, Ewan Dunbar and Emmanuel Dupoux. 2023. Are word boundaries useful for unsupervised language learning? Preprint.

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina Mcmillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco de Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro von Werra, Leon Weber, Long Phan, Loubna Ben Allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-Shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh Hajihosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Ferrandis, Danish Contractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael Mckenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel de Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-Aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada and Thomas Wolf. 2023. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. Preprint.

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

2022

PhD theses and Habiliations

Benjamin Muller. 2022. How Can We Make Language Models Better at Handling the Diversity and Variability of Natural Languages ? PhD thesis. Sorbonne Université.

Deep Learning for NLP has led to impressive empirical progress in recent years. In essence, this progress is based on better contextualized representations that can be easily used for a wide variety of tasks. However, these models usually require substantial computing power and large amounts of raw textual data. This makes language’s inherent diversity and variability a vivid challenge in NLP. We focus on the following: How can we make language models better at handling the variability and diversity of natural languages?. First, we explore the generalizability of language models by building and analyzing one of the first large-scale replication of a BERT model for a non-English language. Our results raise the question of using these language models on highly-variable domains such as these found online. Focusing on lexical normalization, we show that this task can be approached with BERT-like models. However, we show that it only partially helps downstream performance. In consequence, we focus on adaptation techniques using what we refer to as representation transfer and explore challenging settings such as the zero-shot setting, low-resource languages. We show that multilingual language models can be adapted and used efficiently with low-resource languages, even with the ones unseen during pretraining, and that the script is a critical component in this adaptation.
Clémentine Fourrier. 2022. Neural Approaches to Historical Word Reconstruction. PhD thesis. École Pratique des Hautes Études (PSL).

In historical linguistics, cognates are words that descend in direct line from a common ancestor, called their proto-form, andtherefore are representative of their respective languages evolutions through time, as well as of the relations between theselanguages synchronically. As they reflect the phonetic history of the languages they belong to, they allow linguists to betterdetermine all manners of synchronic and diachronic linguistic relations (etymology, phylogeny, sound correspondences).Cognates of related languages tend to be linked through systematic phonetic correspondence patterns, which neuralnetworks could well learn to model, being especially good at learning latent patterns. In this dissertation, we seek tomethodically study the applicability of machine translation inspired neural networks to historical word prediction, relyingon the surface similarity of both tasks. We first create an artificial dataset inspired by the phonetic and phonotactic rules ofRomance languages, which allow us to vary task complexity and data size in a controlled environment, therefore identifyingif and under which conditions neural networks were applicable. We then extend our work to real datasets (after havingupdated an etymological database to gather a correct amount of data), study the transferability of our conclusions toreal data, then the applicability of a number of data augmentation techniques to the task, to try to mitigate low-resourcesituations. We finally investigat in more detail our best models, multilingual neural networks. We first confirm that, onthe surface, they seem to capture language relatedness information and phonetic similarity, confirming prior work. Wethen discover, by probing them, that the information they store is actually more complex: our multilingual models actuallyencode a phonetic language model, and learn enough latent historical information to allow decoders to reconstruct the(unseen) proto-form of the studied languages as well or better than bilingual models trained specifically on the task. Thislatent information is likely the explanation for the success of multilingual methods in the previous works
Pedro Ortiz Suarez. 2022. A Data-driven Approach to Natural Language Processing for Contemporary and Historical French. PhD thesis. Sorbonne Université.

In recent years, neural methods for Natural Language Processing (NLP) have consistently and repeatedly improved the state of the art in a wide variety of NLP tasks. One of the main contributing reasons for this steady improvement is the increased use of transfer learning techniques. These methods consist in taking a pre-trained model and reusing it, with little to no further training, to solve other tasks. Even though these models have clear advantages, their main drawback is the amount of data that is needed to pre-train them. The lack of availability of large-scale data previously hindered the development of such models for contemporary French, and even more so for its historical states.In this thesis, we focus on developing corpora for the pre-training of these transfer learning architectures. This approach proves to be extremely effective, as we are able to establish a new state of the art for a wide range of tasks in NLP for contemporary, medieval and early modern French as well as for six other contemporary languages. Furthermore, we are able to determine, not only that these models are extremely sensitive to pre-training data quality, heterogeneity and balance, but we also show that these three features are better predictors of the pre-trained models' performance in downstream tasks than the pre-training data size itself. In fact, we determine that the importance of the pre-training dataset size was largely overestimated, as we are able to repeatedly show that such models can be pre-trained with corpora of a modest size.

Journal articles

Manuela Sanguinetti, Cristina Bosco, Lauren Cassidy, Özlem Çetinoğlu, Alessandra Teresa Cignarella, Teresa Lynn, Ines Rehbein, Josef Ruppenhofer, Djamé Seddah and Amir Zeldes. 2022. Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations. Language Resources and Evaluation 57 pages 493–544. Springer Verlag.

Abstract This article presents a discussion on the main linguistic phenomena which cause difficulties in the analysis of user-generated texts found on the web and in social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework of syntactic analysis. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this article is twofold: (1) to provide a condensed, though comprehensive, overview of such treebanks—based on available literature—along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The overarching goal of this article is to provide a common framework for researchers interested in developing similar resources in UD, thus promoting cross-linguistic consistency, which is a principle that has always been central to the spirit of UD.
Alix Chagué. 2022. eScriptorium~: une application libre pour la transcription automatique des manuscrits. Arabesques page 25. Agence bibliographique de l'enseignement supérieur (ABES).

Alix Chagué and Laurent Romary. 2022. L'intelligence artificielle, une ouverture du champ des possibles. Arabesques pages 4–5. Agence bibliographique de l'enseignement supérieur (ABES).

Robin Algayres, Tristan Ricoul, Julien Karadayi, Hugo Laurençon, Salah Zaiem, Abdelrahman Mohamed, Benoît Sagot and Emmanuel Dupoux. 2022. DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon. Transactions of the Association for Computational Linguistics 10 pages 1051–1065. The MIT Press.

Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-theart in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark. 1
Tu Anh Nguyen, Benoit Sagot and Emmanuel Dupoux. 2022. Are Discrete Units Necessary for Spoken Language Modeling? IEEE Journal of Selected Topics in Signal Processing 16 pages 1415–1423.

Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels. The approach relies first on transforming the audio into a sequence of discrete units (or pseudo-text) and then training a language model directly on such pseudo-text. Is such a discrete bottleneck necessary, potentially introducing irreversible errors in the encoding of the speech signal, or could we learn a language model without discrete units at all? In this work, we study the role of discrete versus continuous representations in spoken language modeling. We show that discretization is indeed essential for good results in spoken language modeling. We show that discretization removes linguistically irrelevant information from the continuous features, helping to improve language modeling performances. On the basis of this study, we train a language model on the discrete units of the HuBERT features, reaching new state-of-the-art results in the lexical, syntactic and semantic metrics of the Zero Resource Speech Challenge 2021 (Track 1-Speech Only).
Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Barone, Jindřich Helcl and Alexandra Birch. 2022. Survey of Low-Resource Machine Translation. Computational Linguistics 48 pages 673–732. The MIT Press.

We present a survey covering the state of the art in low-resource machine translation (MT) research. There are currently around 7,000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available. We present a summary of this topical research field and provide a description of the techniques evaluated by researchers in several recent shared tasks in low-resource MT.
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Rubungo Andre Niyongabo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Balli, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal and Mofetoluwa Adeyemi. 2022. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics 10 pages 50–72. The MIT Press.

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.
Jack Bowers, Axel Herold, Laurent Romary and Toma Tasovac. 2022. TEI Lex-0 Etym–towards terse recommendations for the encoding of etymological information. Journal of the Text Encoding Initiative TEI Consortium.

The present paper describes the etymological component of the TEI Lex-0 initiative which aims at defining a terser subset of the TEI guidelines for the representation of etymological features in dictionary entries. Going beyond the basic provision of etymological mechanisms in the TEI guidelines, TEI Lex-0 Etym proposes a systematic representation of etymological and cognate descriptions by means of embedded constructs based on the <etym> (for etymologies) and <cit> (for etymons and cognates) elements. In particular, given that all the potential contents of etymons are highly analogous to those of dictionary entries in general, the contents presented herein heavily re-use many of the corresponding features and constraints introduced in other components of the TEI Lex-0 to the encoding of etymologies and etymons. The TEI Lex-0 Etym model is also closely aligned to ISO 24613-3 on modelling etymological data and the corresponding TEI serialisation available in ISO 24613-4.

Conference proceedings

Anna Chepaikina, Robert Bossy, Catherine Roussey and Stephan Bernard. 2022. Thesaurus Enrichment via Coordination Extraction. In 16th International Conference on Metadata and Semantics Research (MTSR 2022). 1789 pages 191–202. London, United Kingdom.

We advance a method of thesaurus enrichment, based on the extraction of coordinations in a domain-related corpus. Our hypothesis is that there is a semantic homogeneity between the conjuncts located in a coordination. We conducted an experiment that allowed us to evaluate the effectiveness of our method. This experiment aims to enrich the concept hierarchy of a French agricultural thesaurus named French Crop Usage (FCU), thanks to the texts of the Plant Health Bulletins (PHB). The FCU thesaurus is published on the Web using the SKOS model.
Tom Kocmi, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Novák, Martin Popel, Maja Popović and Mariya Shmatova. 2022. Findings of the 2022 Conference on Machine Translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT). pages 1–45. Abu Dhabi, United Arab Emirates.

This paper presents the results of the General Machine Translation Task organised as part of the Conference on Machine Translation (WMT) 2022. In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of four different domains. We evaluate system outputs with human annotators using two different techniques: reference-based direct assessment and (DA) and a combination of DA and scalar quality metric (DA+SQM).
Mariana Neves, Antonio Jimeno Yepes, Amy Siu, Roland Roller, Philippe Thomas, Maika Vicente Navarro, Lana Yeganova, Dina Wiemann, Giorgio Maria Di Nunzio, Federica Vezzani, Christel Gérardin, Rachel Bawden, Darryl Johan Estrada, Salvador Lima-López, Eulàlia Farré-Maduell, Martin Krallinger, Cristian Grozea and Aurélie Névéol. 2022. Findings of the WMT 2022 Biomedical Translation Shared Task: Monolingual Clinical Case Reports. In Proceedings of the Seventh Conference on Machine Translation (WMT). pages 694–723. Association for Computational Linguistics. Abu Dhabi, United Arab Emirates.

In the seventh edition of the WMT Biomedical Task, we addressed a total of seven language pairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian. This year’s test sets covered three types of biomedical text genre. In addition to scientific abstracts and terminology items used in previ- ous editions, we released test sets of clinical cases. The evaluation of clinical cases translations were given special attention by involving clinicians in the preparation of reference translations and manual evaluation. For the main MEDLINE test sets, we received a total of 609 submissions from 37 teams. For the ClinSpEn sub-task, we had the participation of five teams.
Omer Goldman, Francesco Tinner, Hila Gonen, Benjamin Muller, Victoria Basmov, Shadrack Kirimi, Lydia Nishimwe, Benoît Sagot, Djamé Seddah, Reut Tsarfaty and Duygu Ataman. 2022. The MRL 2022 Shared Task on Multilingual Clause-level Morphology. In 1st Shared Task on Multilingual Clause-level Morphology. Abu Dhabi, United Arab Emirates.

The 2022 Multilingual Representation Learning (MRL) Shared Task was dedicated to clause-level morphology. As the first ever benchmark that defines and evaluates morphology outside its traditional lexical boundaries, the shared task on multilingual clause-level morphology sets the scene for competition across different approaches to morphological modeling, with 3 clause-level sub-tasks: morphological inflection, reinflection and analysis, where systems are required to generate, manipulate or analyze simple sentences centered around a single content lexeme and a set of morphological features characterizing its syntactic clause. This year's tasks covered eight typologically distinct languages: English, French, German, Hebrew, Russian, Spanish, Swahili and Turkish. The tasks has received submissions of four systems from three teams which were compared to two baselines implementing prominent multilingual learning methods. The results show that modern NLP models are effective in solving morphological tasks even at the clause level. However, there is still room for improvement, especially in the task of morphological analysis.
Nathan Godey, Roman Castagné, Éric de la Clergerie and Benoît Sagot. 2022. MANTa: Efficient Gradient-Based Tokenization for End-to-End Robust Language Modeling. In Findings of the Association for Computational Linguistics: EMNLP 2022. pages 2859–2870. Association for Computational Linguistics. Abu Dhabi, United Arab Emirates.

Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this work, we propose MANTa, a Module for Adaptive Neural TokenizAtion. MANTa is a differentiable tokenizer trained end-to-end with the language model. The resulting system offers a trade-off between the expressiveness of byte-level models and the speed of models trained using subword tokenization. In addition, our tokenizer is highly explainable since it produces an explicit segmentation of sequences into blocks. We evaluate our pretrained model on several English datasets from different domains as well as on synthetic noise. We find that MANTa improves robustness to character perturbations and out-of-domain data. We then show that MANTa performs comparably to other models on the general-domain GLUE benchmark. Finally, we show that it is considerably faster than strictly byte-level models.
Syrielle Montariol, Arij Riabi and Djamé Seddah. 2022. Multilingual Auxiliary Tasks Training: Bridging the Gap between Languages for Zero-Shot Transfer of Hate Speech Detection Models. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022. pages 347–363. Association for Computational Linguistics. Online.

Zero-shot cross-lingual transfer learning has been shown to be highly challenging for tasks involving a lot of linguistic specificities or when a cultural gap is present between languages, such as in hate speech detection. In this paper, we highlight this limitation for hate speech detection in several domains and languages using strict experimental settings. Then, we propose to train on multilingual auxiliary tasks -- sentiment analysis, named entity recognition, and tasks relying on syntactic information -- to improve zero-shot transfer of hate speech detection models across languages. We show how hate speech detection models benefit from a cross-lingual knowledge proxy brought by auxiliary tasks fine-tuning and highlight these tasks' positive impact on bridging the hate speech linguistic and cultural gap between languages.
Syrielle Montariol, Étienne Simon, Arij Riabi and Djamé Seddah. 2022. Fine-tuning and Sampling Strategies for Multimodal Role Labeling of Entities under Class Imbalance. In Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations. pages 55–65. Association for Computational Linguistics. Dublin, Ireland.

We propose our solution to the multimodal semantic role labeling task from the CON-STRAINT’22 workshop. The task aims at clas-sifying entities in memes into classes such as “hero” and “villain”. We use several pre-trained multi-modal models to jointly encode the text and image of the memes, and implement three systems to classify the role of the entities. We propose dynamic sampling strategies to tackle the issue of class imbalance. Finally, we per-form qualitative analysis on the representations of the entities.
Jesujoba Alabi, Lydia Nishimwe, Benjamin Muller, Camille Rey, Benoît Sagot and Rachel Bawden. 2022. Inria-ALMAnaCH at WMT 2022: Does Transcription Help Cross-Script Machine Translation? In Proceedings of the Seventh Conference on Machine Translation (WMT). pages 233–243. Association for Computational Linguistics. Abu Dhabi, United Arab Emirates (Hybrid).

This paper describes the Inria ALMAnaCH team submission to the WMT 2022 general translation shared task. Participating in the language directions {cs,ru,uk}→en and cs↔uk, we experiment with the use of a dedicated Latin-script transcription convention aimed at representing all Slavic languages involved in a way that maximises character-and word-level correspondences between them as well as with the English language. Our hypothesis was that bringing the source and target language closer could have a positive impact on machine translation results. We provide multiple comparisons, including bilingual and multilingual baselines, with and without transcription. Initial results indicate that the transcription strategy was not successful, resulting in lower results than baselines. We nevertheless submitted our multilingual, transcribed models as our primary systems, and in this paper provide some indications as to why we got these negative results.
Paul-Ambroise Duquenne, Hongyu Gong, Benoît Sagot and Holger Schwenk. 2022. T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. pages 5794–5806. Association for Computational Linguistics. Abu Dhabi, United Arab Emirates.

We present a new approach to perform zeroshot cross-modal transfer between speech and text for translation tasks. Multilingual speech and text are encoded in a joint fixed-size representation space. Then, we compare different approaches to decode these multimodal and multilingual fixed-size representations, enabling zero-shot translation between languages and modalities. All our models are trained without the need of cross-modal labeled translation data. Despite a fixed-size representation, we achieve very competitive results on several text and speech translation tasks. In particular, we outperform the state of the art for zero-shot speech translation on Must-C. We also introduce the first results for zero-shot direct speechto-speech and text-to-speech translation.
Louis Martin, Angela Fan, Éric Villemonte de la Clergerie, Antoine Bordes and Benoît Sagot. 2022. MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. pages 1651–1664. European Language Resources Association. Marseille, France.

Progress in sentence simplification has been hindered by a lack of labeled parallel simplification data, particularly in languages other than English. We introduce MUSS, a Multilingual Unsupervised Sentence Simplification system that does not require labeled simplification data. MUSS uses a novel approach to sentence simplification that trains strong models using sentencelevel paraphrase data instead of proper simplification data. These models leverage unsupervised pretraining and controllable generation mechanisms to flexibly adjust attributes such as length and lexical complexity at inference time. We show that this paraphrase data can be mined in any language from Common Crawl using semantic sentence embeddings, thus removing the need for labeled data. We evaluate our approach on English, French, and Spanish simplification benchmarks and closely match or outperform the previous best supervised results, despite not using any labeled simplification data. We push the state of the art further by incorporating labeled simplification data.
Robin Algayres, Adel Nabli, Benoît Sagot and Emmanuel Dupoux. 2022. Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning. In Proceedings of the 23rd Annual Conference of the International Speech Communication Association. pages 2123–2127. Incheon, South Korea.

We introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search. We show that when built on top of recent self-supervised audio representations [1, 2, 3], this method can be applied iteratively and yield competitive SSE as evaluated on two tasks: query-by-example of random sequences of speech, and spoken term discovery. On both tasks our method pushes the state-of-the-art by a significant margin across 5 different languages. Finally, we establish a benchmark on a query-byexample task on the LibriSpeech dataset to monitor future improvements in the field.
Ghazi Felhi, Joseph Le Roux and Djamé Seddah. 2022. Exploiting Inductive Bias in Transformers for Unsupervised Disentanglement of Syntax and Semantics with VAEs. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 5763–5776. Association for Computational Linguistics. Seattle, United States.

We propose a generative model for text generation, which exhibits disentangled latent representations of syntax and semantics. Contrary to previous work, this model does not need syntactic information such as constituency parses, or semantic information such as paraphrase pairs. Our model relies solely on the inductive bias found in attention-based architectures such as Transformers. In the attention of Transformers, keys handle information selection while values specify what information is conveyed. Our model, dubbed QKVAE, uses Attention in its decoder to read latent variables where one latent variable infers keys while another infers values. We run experiments on latent representations and experiments on syntax/semantics transfer which show that QKVAE displays clear signs of disentangled syntax and semantics. We also show that our model displays competitive syntax transfer capabilities when compared to supervised models and that comparable supervised models need a fairly large amount of data (more than 50K samples) to outperform it on both syntactic and semantic transfer. The code for our experiments is publicly available.
Loïc Grobol, Mathilde Regnault, Pedro Ortiz Suarez, Benoît Sagot, Laurent Romary and Benoit Crabbé. 2022. BERTrade: Using Contextual Embeddings to Parse Old French. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. pages 1104–1113. European Language Resources Association. Marseille, France.

The successes of contextual word embeddings learned by training large-scale language models, while remarkable, have mostly occurred for languages where significant amounts of raw texts are available and where annotated data in downstream tasks have a relatively regular spelling. Conversely, it is not yet completely clear if these models are also well suited for lesser-resourced and more irregular languages. We study the case of Old French, which is in the interesting position of having relatively limited amount of available raw text, but enough annotated resources to assess the relevance of contextual word embedding models for downstream NLP tasks. In particular, we use POS-tagging and dependency parsing to evaluate the quality of such models in a large array of configurations, including models trained from scratch from small amounts of raw text and models pre-trained on other languages but fine-tuned on Medieval French data.
Simon Gabay, Pedro Ortiz Suarez, Rachel Bawden, Alexandre Bartz, Philippe Gambette and Benoît Sagot. 2022. Le projet FREEM : ressources, outils et enjeux pour l'étude du français d'Ancien Régime (The FREEM project: Resources, tools and challenges for the study of Ancien Régime French). In Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale. pages 154–165. ATALA. Avignon, France.

Despite their undoubted quality, the resources and tools available for the analysis of Ancien Régime French are no longer able to meet the challenges of research in linguistics and literature for this period. After having precisely defined the chronological framework, we present the corpora made available and the results obtained with them for several NLP tasks, fundamental to the study of language and literature.
Arij Riabi, Syrielle Montariol and Djamé Seddah. 2022. Tâches Auxiliaires Multilingues pour le Transfert de Modèles de Détection de Discours Haineux (Multilingual Auxiliary Tasks for Zero-Shot Cross-Lingual Transfer of Hate Speech Detection). In Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale. pages 413–423. ATALA. Avignon, France.

La tâche de détection de contenus haineux est ardue, car elle nécessite des connaissances culturelles et contextuelles approfondies ; les connaissances nécessaires varient, entre autres, selon la langue du locateur ou la cible du contenu. Or, des données annotées pour des domaines et des langues spécifiques sont souvent absentes ou limitées. C’est là que les données dans d’autres langues peuvent être exploitées ; mais du fait de ces variations, le transfert cross-lingue est souvent difficile. Dans cet article, nous mettons en évidence cette limitation pour plusieurs domaines et langues et montrons l’impact positif de l’apprentissage de tâches auxiliaires multilingues - analyse de sentiments, reconnaissance des entités nommées et tâches reposant sur des informations morpho-syntaxiques - sur le transfert cross-lingue zéro-shot des modèles de détection de discours haineux, afin de combler ce fossé culturel.
Benjamin Muller, Antonios Anastasopoulos, Benoît Sagot and Djamé Seddah. 2022. Quand être absent de mBERT n'est que le commencement : Gérer de nouvelles langues à l'aide de modèles de langues multilingues (When Being Unseen from mBERT is just the Beginning : Handling New Languages With Multilingual Language Models). In Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale. pages 450–451. ATALA. Avignon, France.

Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. We show that transliterating those languages significantly improves the potential of large-scale multilingual language models on downstream tasks. This result provides a promising direction towards making these massively multilingual models useful for a new set of unseen languages.
Simon Gabay, Rachel Bawden, Philippe Gambette, Jonathan Poinhos, Eleni Kogkitsidou and Benoît Sagot. 2022. Le changement linguistique au XVIIe s. : nouvelles approches scriptométriques. In CMLF 2022 - 8e Congrès Mondial de Linguistique Française. 138 pages 02006.1–14. EDP Sciences. Orléans, France.

Linguistic change in 17th c. France: new scriptometric approaches The end of the 17th c. remains a blind spot of the research on the spelling system, despite its importance for French at this period, during which a strict norm, still (more or less) in place, was created and imposed. Focusing on a practical rather than a theoretical approach, we propose to lay the foundation for a computational scriptometric study of early modern French and analyse the evolution of the spelling system over the 17th c. To do so, we measure and evaluate the distance between the early modern and the contemporary versions of the language, thanks to two automatic normalisers: one rule-based and another one neural-based.
Thibault Charmet, Inès Cherichi, Matthieu Allain, Urszula Czerwinska, Amaury Fouret, Benoît Sagot and Rachel Bawden. 2022. Complex Labelling and Similarity Prediction in Legal Texts: Automatic Analysis of France's Court of Cassation Rulings. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. pages 4754–4766. European Language Resources Association. Marseille, France.

Detecting divergences in the applications of the law (where the same legal text is applied differently by two rulings) is an important task. It is the mission of the French Cour de Cassation. The first step in the detection of divergences is to detect similar cases, which is currently done manually by experts. They rely on summarised versions of the rulings (syntheses and keyword sequences), which are currently produced manually and are not available for all rulings. There is also a high degree of variability in the keyword choices and the level of granularity used. In this article, we therefore aim to provide automatic tools to facilitate the search for similar rulings. We do this by (i) providing automatic keyword sequence generation models, which can be used to improve the coverage of the analysis, and (ii) providing measures of similarity based on the available texts and augmented with predicted keyword sequences. Our experiments show that the predictions improve correlations of automatically obtained similarities against our specially colelcted human judgments of similarity.
Francesco De Toni, Christopher Akiki, Javier De La Rosa, Clémentine Fourrier, Enrique Manjavacas, Stefan Schweter and Daniel Van Strien. 2022. Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0. In Proceedings of BigScience Episode #5–Workshop on Challenges & Perspectives in Creating Large Language Models. pages 75–83. Association for Computational Linguistics. virtual+Dublin.

In this work, we explore whether the recently demonstrated zero-shot abilities of the T0 model extend to Named Entity Recognition for out-of-distribution languages and time periods. Using a historical newspaper corpus in 3 languages as test-bed, we use prompts to extract possible named entities. Our results show that a naive approach for prompt-based zero-shot multilingual Named Entity Recognition is error-prone, but highlights the potential of such an approach for historical languages lacking labeled datasets. Moreover, we also find that T0-like models can be probed to predict the publication date and language of a document, which could be very relevant for the study of historical texts.
Clémentine Fourrier and Syrielle Montariol. 2022. Caveats of Measuring Semantic Change of Cognates and Borrowings using Multilingual Word Embeddings. In Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change. pages 97–112. Association for Computational Linguistics. Dublin, Ireland.

Cognates and borrowings carry different aspects of etymological evolution. In this work, we study semantic change of such items using multilingual word embeddings, both static and contextualised. We underline caveats identified while building and evaluating these embeddings. We release both said embeddings and a newly-built historical words lexicon, containing typed relations between words of varied Romance languages.
Clémentine Fourrier and Benoît Sagot. 2022. Probing Multilingual Cognate Prediction Models. In Findings of the Association for Computational Linguistics: ACL 2022. pages 3786–3801. Association for Computational Linguistics. Dublin, Ireland.

Character-based neural machine translation models have become the reference models for cognate prediction, a historical linguistics task. So far, all linguistic interpretations about latent information captured by such models have been based on external analysis (accuracy, raw results, errors). In this paper, we investigate what probing can tell us about both models and previous interpretations, and learn that though our models store linguistic and diachronic information, they do not achieve it in previously assumed ways.
Simon Gabay, Pedro Ortiz Suarez, Alexandre Bartz, Alix Chagué, Rachel Bawden, Philippe Gambette and Benoît Sagot. 2022. From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. pages 3367–3374. European Language Resources Association. Marseille, France.

Language models for historical states of language are becoming increasingly important to allow the optimal digitisation and analysis of old textual sources. Because these historical states are at the same time more complex to process and more scarce in the corpora available, specific efforts are necessary to train natural language processing (NLP) tools adapted to the data. In this paper, we present our efforts to develop NLP tools for Early Modern French (historical French from the 16th to the 18th centuries). We present the FreEMmax corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on FreEMmax. We evaluate the usefulness of D'AlemBERT by fine-tuning it on a part-of-speech tagging task, outperforming previous work on the test set. Importantly, we find evidence for the transfer learning capacity of the language model, since its performance on lesser-resourced time periods appears to have been boosted by the more resourced ones. We release D'AlemBERT and the open-sourced subpart of the FreEMmax corpus.
Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou, Philippe Gambette, Benoît Sagot and Simon Gabay. 2022. Automatic Normalisation of Early Modern French. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. pages 3354–3366. European Language Resources Association. Marseille, France.

Spelling normalisation is a useful step in the study and analysis of historical language texts, whether it is manual analysis by experts or automatic analysis using downstream natural language processing (NLP) tools. Not only does it help to homogenise the variable spelling that often exists in historical texts, but it also facilitates the use of off-the-shelf contemporary NLP tools, if contemporary spelling conventions are used for normalisation. We present FREEMnorm, a new benchmark for the normalisation of Early Modern French (from the 17th century) into contemporary French and provide a thorough comparison of three different normalisation methods: ABA, an alignment-based approach and MT-approaches, (both statistical and neural), including extensive parameter searching, which is often missing in the normalisation literature.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng-Xin Yong, Harshit Pandey, Michael Mckenna, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf and Alexander M. Rush. 2022. Multitask Prompted Training Enables Zero-Shot Task Generalization. In Proceedings of the The Tenth International Conference on Learning Representations. Online.

Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models’ pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pre-trained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at https://github.com/bigscience-workshop/t-zero, and all prompts are available at https://github.com/bigscience-workshop/promptsource.
Julien Abadji, Pedro Ortiz Suarez, Laurent Romary and Benoît Sagot. 2022. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. pages 4344–4355. European Language Resources Association. Marseille, France.

The need for raw large raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods to Natural Language Processing. And while there have been some recent attempts to manually curate the amount of data necessary to train large language models, the main way to obtain this data is still through automatic web crawling. In this paper we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR that could prove more suitable to pre-train large generative language models as well as hopefully other applications in Natural Language Processing and Digital Humanities.

Communications

Anas Fahad Khan, Ana Salgado, Rute Costa, Sara Carvalho, Laurent Romary, Bruno Almeida, Margarida Ramos, Mohamed Khemakhem, Raquel Silva and Toma Tasovac. 2022. Interlinking lexicographic data in the MORDigital project. In LLODREAM2022 - LLOD approaches for language data research and management. Mykolas Romeris University. Vilnius, Lithuania.

Rute Costa, Ana Salgado, Margarida Ramos, Fahad Khan, Sara Carvalho, Toma Tasovac, Bruno Almeida, Mohamed Khemakhem, Laurent Romary and Raquel Silva. 2022. Integrating Terminological and Ontological Principles into a Lexicographic Resource. In 1st International Conference on «Multilingual digital terminology today. Design, representation formats and management systems»; Vol-3161 CEUR-WS.org. Padova, Italy.

In this paper we will present the research that is taking place at the NOVA CLUNL where an international team is working on a financed project MORDigital. MORDigital's goal is to encode the selected editions of Diccinario de Lingua Portugueza by António de Morais Silva (MOR), first published in 1789.
Yves Rychener, Xavier Renard, Djamé Seddah, Pascal Frossard and Marcin Detyniecki. 2022. On the Granularity of Explanations in Model Agnostic NLP Interpretability. In XKDD 2022 - ECML PKDD 2022 International Workshop on eXplainable Knowledge Discovery in Data Mining. Grenoble, France.

Current methods for Black-Box NLP interpretability, like LIME or SHAP, are based on altering the text to interpret by removing words and modeling the Black-Box response. In this paper, we outline limitations of this approach when using complex BERT-based classifiers: The word-based sampling produces texts that are out-of-distribution for the classifier and further gives rise to a high-dimensional search space, which can't be sufficiently explored when time or computation power is limited. Both of these challenges can be addressed by using segments as elementary building blocks for NLP interpretability. As illustration, we show that the simple choice of sentences greatly improves on both of these challenges. As a consequence, the resulting explainer attains much better fidelity on a benchmark classification task.
Benoît Sagot, Laurent Romary, Rachel Bawden, Pedro Javier Ortiz Suárez, Kelly Christensen, Simon Gabay, Ariane Pinche and Jean-Baptiste Camps. 2022. Gallic(orpor)a : Extraction, annotation et diffusion de l'information textuelle et visuelle en diachronie longue. In DataLab de la BnF : Restitution des travaux 2022. Paris, France.

Restitution des travaux du Projet BNF DataLab Gallic(orpor)a
Aurélia Rostaing and Hugo Scheithauer. 2022. LectAuRep : Un projet de recherche et développement pour la transcription automatique de répertoires de notaires. In La reconnaissance des écritures manuscrites et ses usages dans les archives. Pierrefitte-sur-Seine, France.

Chadi Helwe, Simon Coumes, Chloé Clavel and Fabian Suchanek. 2022. TINA: Textual Inference with Negation Augmentation. In The 2022 Conference on Empirical Methods in Natural Language Processing ( EMNLP 2022 ). Abu Dhabi, United Arab Emirates.

Transformer-based language models achieve state-of-the-art results on several natural language processing tasks. One of these is textual entailment, i.e., the task of determining whether a premise logically entails a hypothesis. However, the models perform poorly on this task when the examples contain negations. In this paper, we propose a new definition of textual entailment that captures also negation. This allows us to develop TINA (Textual Inference with Negation Augmentation), a principled technique for negated data augmentation that can be combined with the unlikelihood loss function. Our experiments with different transformer-based models show that our method can significantly improve the performance of the models on textual entailment datasets with negation-without sacrificing performance on datasets without negation.
Chadi Helwe, Chloé Clavel and Fabian Suchanek. 2022. LogiTorch: A PyTorch-based library for logical reasoning on natural language. In The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Abu Dhabi, United Arab Emirates.

Logical reasoning on natural language is one of the most challenging tasks for deep learning models. There has been an increasing interest in developing new benchmarks to evaluate the reasoning capabilities of language models such as BERT. In parallel, new models based on transformers have emerged to achieve ever better performance on these datasets. However, there is currently no library for logical reasoning that includes such benchmarks and models. This paper introduces LogiTorch, a PyTorch-based library that includes different logical reasoning benchmarks, different models, as well as utility functions such as co-reference resolution. This makes it easy to directly use the preprocessed datasets, to run the models, or to finetune them with different hyperparameters. LogiTorch is open source and can be found on GitHub .
Simon Gabay, Rachel Bawden, Benoît Sagot and Philippe Gambette. 2022. Vers l'étude linguistique sur données artificielles. In Variation(s) en français. Nancy, France.

Depuis désormais des décennies, plusieurs disciplines ont pris l'habitude de travailler sur des données dites « synthétiques » plutôt que « réelles », c’est-à-dire sur des données générées par une simulation computationnelle reflétant le monde réel. Notre présentation se propose d'expérimenter cette méthode en linguistique diachronique par la génération de corpus pseudo-anciens. Nous reviendrons donc sur cette approche, tant du point de vue méthodologique que technique, en prenant comme cas d'étude celui de la variation graphique du français et de son évolution pendant l'Ancien Régime.
Aurélia Rostaing and Hugo Scheithauer. 2022. LectAuRep (2018-2021) :Projet de lecture automatique de répertoires de notaires. In Segmenter et annoter les images : déconstruire pour reconstruire. Paris, France.

You Zuo, Houda Mouzoun, Samir Ghamri Doudane, Kim Gerdes and Benoît Sagot. 2022. Patent Classification using Extreme Multi-label Learning: A Case Study of French Patents. In SIGIR 2022 - PatentSemTech workshop - 3rd Workshop on Patent Text Mining and Semantic Technologies. Madrid, Spain.

Most previous patent classification methods have treated the task as a general text classification task, and others have tried to implement XML (extreme multi-label learning) methods designed to handle vast numbers of classes. However, they focus only on the IPC subclass level, which has fewer than 700 labels and is far from "extreme." This paper presents a French Patents corpus INPI-CLS extracted from the INPI internal database. It contains all parts of patent texts (title, abstract, claims, description) published from 2002 to 2021, with IPC labels at all levels. We test different XML methods and other classification models at the subclass and group levels of the INPI-CLS dataset with about 600 and 7k labels, respectively, demonstrating the XML approach's validity to patent classification.
You Zuo, Yixuan Li, Alma Parias García and Kim Gerdes. 2022. Technological taxonomies for hypernym and hyponym retrieval in patent texts. In ToTh 2022 - Terminology & Ontology: Theories and applications. Chambéry, France.

This paper presents an automatic approach to creating taxonomies of technical terms based on the Cooperative Patent Classification (CPC). The resulting taxonomy contains about 170k nodes in 9 separate technological branches and is freely available. We also show that a Text-to-Text Transfer Transformer (T5) model can be fine-tuned to generate hypernyms and hyponyms with relatively high precision, confirming the manually assessed quality of the resource. The T5 model opens the taxonomy to any new technological terms for which a hypernym can be generated, thus making the resource updateable with new terms, an essential feature for the constantly evolving field of technological terminology.
Laurent Romary and Hugo Scheithauer. 2022. DataCatalogue : enjeux et réalisations. In Un outil numérique pour interroger les catalogues de vente : le projet DataCatalogue. Paris, France.

Aurélia Rostaing and Hugo Scheithauer. 2022. Enrichir le patrimoine écrit archivistique grâce aux technologies numériques : Ingénierie du projet LectAuRep (Lecture automatique de répertoires). In DHNord 2022 - Travailler en Humanités Numériques : collaborations, complémentarités et tensions. Online, France.

Floriane Chiffoleau and Hugo Scheithauer. 2022. From a collection of documents to a published edition : how to use an end-to-end publication pipeline. In TEI 2022 - Text Encoding Initiative 2022 Conference. Newcastle, United Kingdom.

The goal of the workshop is to demonstrate how a corpus could be processed for publication with TEI Publisher. The workshop participants will learn to experiment with a ready-to-use solution that provides an easy and quick publication of a corpus. They will also get tips and shortcuts to help speed up the creation of a digital edition. Moreover, by the end of the session, this workshop will provide the participants with a visualization of their respective corpus, with side by side transformed text and original image; all of which then showing what can be achieved while working with TEI in the context of an end-to-end publication pipeline.
Ariane Pinche, Kelly Christensen and Simon Gabay. 2022. Between automatic and manual encoding. In TEI 2022 conference : Text as data. Newcastle, United Kingdom.

Cultural heritage institutions today aim to digitise their collections of prints andmanuscripts (Bermès 2020) and are generating more and more digital images (Gray2009). To enrich these images, many institutions work with standardised formats such asIIIF, preserving as much of the source’s information as possible. To take full advantage oftextual documents, an image alone is not enough. Thanks to automatic text recognitiontechnology, it is now possible to extract images’ content on a large scale. The TEI seemsto provide the perfect format to capture both an image’s formal and textual data (Janèset al. 2021). However, this poses a problem. To ensure compatibility with a range ofuse cases, TEI XML files must guarantee IIIF or RDF exports and therefore must bebased on strict data structures that can be automated. But a rigid structure contradictsthe basic principles of philology, which require maximum flexibility to cope with varioussituations. The solution proposed by the Gallic(orpor)a project1 attempted to deal with such acontradiction, focusing on French historical documents produced between the 15th andthe 18th c. It aims to enrich the digital facsimiles distributed by the French NationalLibrary (BnF).
Alix Chagué, Hugo Scheithauer, Lucas Terriel, Floriane Chiffoleau and Yves Tadjo-Takianpi. 2022. Take a sip of TEI and relax: a proposition for an end-to-end workflow to enrich and publish data created with automatic text recognition. In Digital Humanities 2022 : Responding to Asian Diversity. Tokyo, Japan.

Alix Chagué and Thibault Clérice. 2022. Sharing HTR datasets with standardized metadata: the HTR-United initiative. In Documents anciens et reconnaissance automatique des écritures manuscrites. Paris, France.

Hugo Scheithauer. 2022. LectAuRep : Données d'archives en français des XIXe et XXe siècles. In Transkribus / eScriptorium : Transcrire, annoter et éditer numériquement des documents d'archives. Paris, France.

Alix Chagué. 2022. Corpus, méthodes et ressources pour la transcription automatique des documents manuscrits patrimoniaux francophones contemporains. In 89e Congrès de l'Acfas, Section 310 - Le numérique dans les sciences humaines : édition et visualisation. Montréal, Canada.

Résumé en 5 minutes du projet de recherche doctorale intitulé "Corpus, méthodes et ressources pour la transcription automatique des documents manuscrits patrimoniaux francophones contemporains" débuté en novembre 2021 et récompensé par le Bourse d'Excellence 2022 du GREN. La communication replaçait le projet dans le contexte de la disponibilité actuelle des logiciels grand public pour l'application de la transcription automatique de documents manuscrits et le manque de ressources conceptuelles et méthodologiques permettant d'en tirer pleinement parti. L'une des principales difficultés évoquées était celle de la convergence des pratiques vers les modèles et des données interopérables.
Florence Clavaud, Laurent Romary, Pauline Charbonnier, Lucas Terriel, Gaetano Piraino and Vincent Verdese. 2022. NER4Archives (named entity recognition for archives) : Conception et réalisation d'un outil de détection, de classification et de résolution des entités nommées dans les instruments de recherche archivistiques encodés en XML/EAD. In Atelier Culture-INRIA. Pierrefitte sur Seine, France.

Hugo Scheithauer, Laurent Romary, Frédérique Duyrat and Federico Nurra. 2022. DataCatalogue : présentation du projet. In Atelier Culture-Inria. Pierrefitte-sur-Seine, France.

Presentation on the DataCatalogue project, jointly led by Inria, the National Library of France (BnF) and the National Institute for Art History (INHA), at the "journée Atelier culture-Inria," held at the Archives nationales on 03/22/2022.
Ghazi Felhi, Joseph Le Roux and Djamé Seddah. 2022. Towards Unsupervised Content Disentanglement in Sentence Representations via Syntactic Roles. In CtrlGen: Controllable Generative Modeling in Language and Vision. virtual, France.

Linking neural representations to linguistic factors is crucial in order to build and analyze NLP models interpretable by humans. Among these factors, syntactic roles (e.g. subjects, direct objects,.. .) and their realizations are essential markers since they can be understood as a decomposition of predicative structures and thus the meaning of sentences. Starting from a deep probabilistic generative model with attention, we measure the interaction between latent variables and realizations of syntactic roles, and show that it is possible to obtain, without supervision, representations of sentences where different syntactic roles correspond to clearly identified different latent variables. The probabilistic model we propose is an Attention-Driven Variational Autoencoder (ADVAE). Drawing inspiration from Transformer-based machine translation models, ADVAEs enable the analysis of the interactions between latent variables and input tokens through attention. We also develop an evaluation protocol to measure disentanglement with regard to the realizations of syntactic roles. This protocol is based on attention maxima for the encoder and on disturbing individual latent variables for the decoder. Our experiments on raw English text from the SNLI dataset show that i) disentanglement of syntactic roles can be induced without supervision, ii) ADVAE separates more syntactic roles than classical sequence VAEs, iii) realizations of syntactic roles can be separately modified in sentences by mere intervention on the associated latent variables. Our work constitutes a first step towards unsupervised controllable content generation. The code for our work is publicly available 1 .

Book chapters

Alix Chagué, Victoria Le Fourner, Manuela Martini and Eric Villemonte de La Clergerie. 2022. Deux siècles de sources disparates sur l'industrie textile en France : comment automatiser les