ALMAnaCH organises regular seminars in NLP and digital humanities. Everyone is welcome!
Sign up to email@example.com to receive seminar announcements.
Data quality for low-resource MT
Abstract: In this talk I will present the findings of a collaborative audit of multilingual corpora, with special attention for low-resourced languages. We will discuss the challenges that come with building such corpora, and the risks of using them without inspection. With a case study on a subset of African languages I will illustrate the implications of building machine translation on low-quality parallel data.
Propositions pratiques pour l’édition numérique des textes français modernes
Abstract: La littérature du Grand siècle a manqué il y a près d’un siècle sa rencontre avec la philologie romane, ce qui n’a pas été sans conséquence sur la qualité des éditions de textes pourtant qualifiés de « classiques » : il est crucial que cette erreur ne se répète pas avec la philologie computationnelle. Prolongeant la célèbre tradition des Instructions pour la publication et autres Règles pour l’édition, nous souhaitons partager quelques propositions pour l’édition numérique des textes français modernes. En présentant la chaîne de traitement au développement de laquelle nous travaillons, nous nous attacherons à donner une dimension pratique à nos réflexions théoriques quant au renouveau ecdotique que nous appelons de nos vœux.
Modélisation, synthèse et représentation éditable des langues des signes
Abstract: Les langues des signes sont des langues à part entière, gestuelles et non phonatoires. Le travail présenté s'intéresse à leur traitement en informatique, un domaine de recherche encore à ses débuts. Trois volets seront présentés, à commencer par la représentation formelle des langues de signes. Nous y présentons la construction d'une approche et d'un modèle (AZee), qui permet entre autres leur synthèse par un signeur virtuel (avatar 3D), ce qui fera l'objet d'un deuxième volet. En guise d'ouverture, une dernière partie s'intéresse à la question d'une forme éditable pour la langue, celle-ci ne possédant pas de forme écrite. En observant des productions graphiques spontanées de signeurs mettant sur papier les discours de leur langue, nous avons pu les rapprocher de résultats issus d'AZee. Nous pensons qu'une piste existe là pour la définition d'un système de représentation graphique intuitif de la langue des signes, voire d'une piste pour en élaborer une écriture.
L'étude des langues des signes étant bien plus récente que celle de leurs homologues vocales ou écrites, les connaissances linguistiques sur elles sont plus limitées et leur traitement automatique n'est possible que de manière interdisciplinaire en avançant conjointement sur les fronts linguistique et informatique. Ainsi, les avancées en représentation formelle ont des implications ou des contrastes en linguistique et nous mettrons en lumière certains d'entre eux.
Self-Supervised Representation Learning for Pre-training Speech Systems
Abstract: Self-supervised learning using huge unlabeled data has been successfully explored for image processing and natural language processing. Since 2019, recent works also investigated self-supervised representation learning from speech. They were notably successful to improve performance on downstream tasks such as speech recognition. These recent works suggest that it is possible to reduce dependence on labeled data for building speech systems through acoustic representation learning. In this talk I will present an overview of these recent approaches to self-supervised learning from speech and show my own investigations to use them in a end-to-end automatic speech translation (AST) task for which the size of training data is generally limited.
LECTAUREP : Lecture Automatique des Répertoires
Abstract: Il s'agit de faire un point d'avancement sur le projet LECTAUREP, au sein duquel collaborent depuis 2018, l'équipe ALMAnaCH et les Archives Nationales. L'objectif de ce projet est de faciliter l'accès au très grand corpus des répertoires d'actes de notaires parisiens en ayant recours à la transcription automatique d'écritures manuscrites et à la fouille de texte. Au-delà de la collecte des données, cette collaboration est l'occasion d'explorer les implications méthodologiques et infrastructurelles de tels projets. (joint work with Laurent Romary)
Natural Language Generation: Training, Inference & Evaluation
Abstract: Recent advances in the field of natural language generation are undoubtedly impressive. Yet, little has changed from training and inference to evaluation. Models are learned with Teacher Forcing, inferred via Beam Search, and evaluated with BLEU or ROUGE. However, these algorithms suffer from many well-known limitations. How can these limitations be overcome? In this talk, we will present recently proposed methods that could be part of the solution, paving the way for a better NLG.
Joint work with Jacopo Staiano
CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters
Abstract: Due to the compelling improvements brought by BERT, many recent representation models adopted the Transformer architecture as their main building block, consequently inheriting the wordpiece tokenization system despite it not being intrinsically linked to the notion of Transformers. While this system is thought to achieve a good balance between the flexibility of characters and the efficiency of full words, using predefined wordpiece vocabularies from the general domain is not always suitable, especially when building models for specialized domains (e.g., the medical domain). Moreover, adopting a wordpiece tokenization shifts the focus from the word level to the subword level, making the models conceptually more complex and arguably less convenient in practice. For these reasons, we propose CharacterBERT, a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters. We show that this new model improves the performance of BERT on a variety of medical domain tasks while at the same time producing robust, word-level and open-vocabulary representations.
Joint work with Olivier Ferret, Thomas Lavergne, Hiroshi Noji, Pierre Zweigenbaum and Junichi Tsujii
Learning Sound Correspondences: What about Neural Networks?
Abstract: Cognate and proto-form prediction are key tasks in computational historical linguistics, which rely heavily on sound correspondences identification, and could help low resource translation. In the last two decades, a combination of sequence alignement, statistical models and clustering methods have emerged to try and solve these. But where are the neural networks? In this talk, I will present my ongoing research in investigating the learnability of sound correspondences between a proto-language and daughter languages by neural models.
I will introduce: (i) MEDeA, a Multiway Encoder Decoder Architecture inspired by NMT, (ii) EtymDB2.0, the etymological database that we updated to generate much needed data, (iii) our experiments on plausible artificial languages as well as on real languages.
Limits, open questions and current trends in Transfer Learning for NLP
Abstract: This talk is a subjective walk through my favorite papers and research directions in late-2019/early-2020. I’ll roughly cover the topics of model size and computational efficiency, model evaluation, fine tuning, out of domain generalization, sample efficiency, common sens and inductive biases. The talk is adapted from the sessions I gave in early 2020 at the NLPL Winter School.
Can multilingual BERT transfer to an Out-of-Distribution dialect? A case study on North African Arabizi
Abstract: Building natural language processing systems for highly variable and low resource languages is a hard challenge. The recent success of large-scale multilingual pretrained language models provides us with new modeling tools to tackle it. In this talk, I will present my ongoing research in testing the ability of the multilingual version of BERT to model an unseen dialect. We take user-generated North African Arabic text as our case study. We show in different scenarios that multilingual language models are able to transfer to an unseen dialect, specifically in two extreme cases: across script (Arabic to Latin) and from Maltese, a distantly related language, unseen during pretraining.
Joint work with Benoît Sagot and Djamé Seddah.
Neural Semantic Role Labeling for French FrameNet With Deep Syntactic Information
Abstract: A recent graph-based neural architecture for semantic role labeling (SRL) developed by He et al. (2018)  jointly predicts argument spans, predicates and the relations between them without using gold predicates as input features. Although working well on Propbank-style data, this architecture makes some systematic mistakes when being used on a more semantically-oriented resource such as French FrameNet .
We adapt He's (2018)  system for the semantic roles prediction for French FrameNet. Contrasting to , we do not predict the full spans of the arguments directly, but implement a two-step pipeline of predicting syntactic heads of the argument spans first and reconstructing the full spans using surface and deep syntax in the second step. While the idea of reconstructing the argument spans using syntactic information is not new , the novelty of our work lies in using deep syntactic dependency relations for the full span recovery. We obtain deep syntactic information using symbolic conversion rules described in Michalon et al. (2016) . We present the results of the ongoing semantic role labeling experiments for French FrameNet and discuss the advantages and challenges of our approach.
Neural and Symbolic Representations of Speech and Language
Abstract: As end-to-end architectures based on neural networks became the tool of choice for processing speech and language, there has been increased interest in techniques for analyzing and interpreting the representations emerging in these models. A large array of analytical techniques have been proposed and applied to diverse architectures. Given that the developments in this field have been so fast, it is perhaps inevitable that some of them also turn out to be loose.
In this talk I firstly focus on one pitfall not always successfully avoided in work on neural representation analysis: the role of learning. In many cases non-trivial representations can be found in the activation patterns of randomly initialized, untrained neural networks. In past studies this phenomenon has not always been properly accounted for, which means that the results reported in them need to be reconsidered. Here I revisit the issue of the representations of phonology in neural models of spoken language.
Secondly I present two methods based on Representational Similarity Analysis (RSA) and Tree Kernels (TK) which allow us to directly quantify how strongly the information encoded in neural activation patterns corresponds to information represented by symbolic structures such as syntax trees. I first validate the methods on the case of a simple synthetic language for arithmetic expressions with clearly defined syntax and semantics, and show that they exhibit the expected pattern of results. I then apply these methods to correlate neural representations of English sentences with their constituency parse trees.
How to choose the test set size? Some observations on the evaluation of PoS taggers on the Universal Dependencies project
Abstract: This presentation questions the usual framework of statistical learning in which test set and train sets are fixed arbitrarily and independently of the model considered. Taking the evaluation of PoS taggers on the UD project as an example, we show that, in many cases, it is possible to consider smaller test sets than those generally available without hurting evaluation quality and that the examples that have been `saved' can be added to the train set to improve system performance, especially in the context of domain adaptation.
La production participative (crowdsourcing ) : miroir grossissant sur l’annotation manuelle
Abstract: L'annotation manuelle de corpus est au coeur du Traitement automatique des langues actuel : elle fournit non seulement les exemples utilisés pour entraîner les outils par apprentissage, mais elle fait également référence lors des campagnes d'évaluation. Elle est, de fait, l'endroit où s'est réfugiée la linguistique dans le domaine. Pour autant, elle reste encore largement sous-étudiée. Aborder le sujet par le prisme de la production participative (crowdsourcing) ludifiée, c'est en regarder les points les plus durs dans un miroir grossissant. Les questions essentielles de la qualité de la production, des biais liés à l'outillage et de l'expertise des annotateurs sont en effet magnifiées par le nombre et la distance. Cet effet de loupe complexifie les expériences, mais nous pousse également à imaginer des solutions originales, qui enrichissent la réflexion sur l'annotation manuelle traditionnelle et remettent l'annotateur au coeur du processus.
Where’s My Head? Definition, Dataset and Models for Numeric Fused-Heads Identification and Resolution
Abstract: In this talk, I will describe our on-going work on fused-heads. We provide the first computational treatment of fused-heads constructions (FH), focusing on the numeric fused-heads (NFH). FHs constructions are noun phrases (NPs) in which the head noun is missing and is said to be “fused” with its dependent modifier. This missing information is implicit and is important for sentence understanding. The missing references are easily filled in by humans but pose a challenge for computational models. We pose the handling of FH as a two stages process: identification of the FH construction and resolution of the missing head. We explore the NFH phenomena in large corpora of English text and create (1) a dataset and a highly accurate method for NFH identification; (2) a 10k examples (1M tokens) crowd-sourced dataset of NFH resolution; and (3) a neural baseline for the NFH resolution task.
Adapting an Existing French Metagrammar for Old and Middle French
Abstract: Although many texts in Old French (9th-13th c.) and Middle French (14th-15th c.) are now available, only a few of them are annotated with dependency syntax. Our goal is to extend the already existing data, the Old French treebank SRCMF “Syntactic Reference Corpus of Medieval French” (Prévost and Stein 2013) to obtain an annotated corpus of one million words also covering Middle French.
These stages of French are submitted to strong variation (language evolution, dialects, forms and domains) and are characterised by a free word-order, as well as null subjects. To deal with these difficulties, we have opted for the formalism of metagrammars (Candito 1999), for a modular constraint-based representation of syntactic phenomena through classes. More precisely, we are adapting the French Metagrammar (FRMG) (Villemonte de la Clergerie 2005) for Old and Middle French because there are enough similarities between these stages of French. In this talk, we will present the processing chain developed by the Almanach team and our choices to adapt the metagrammar to former stages of a language.
Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them
Abstract: Word embeddings are widely used in the NLP community for a vast range of tasks. It was shown that word embeddings derived from text corpora reflect gender biases in society. This phenomenon is pervasive and consistent across different word embedding models, causing serious concern. Several recent works tackle this problem, and propose methods for significantly reducing this gender bias in word embeddings, demonstrating convincing results. However, we argue that this removal is superficial. While the bias is indeed substantially reduced according to the provided bias definition, the actual effect is only hiding the bias, not removing it. The gender bias information is still reflected in the distances between "gender-neutralized" words in the debiased embeddings, and can be recovered from them. We present a series of experiments to support this claim, for two debiasing methods. We conclude that existing bias removal techniques are insufficient, and should not be trusted for providing gender-neutral modeling.
French MultiWord Expressions representation and parsing
Abstract: Many NLP tasks, such as natural language understanding, require a representation of syntax and semantics in texts. MultiWord Expressions (MWEs), which can be described as a set of (not necessarily contiguous) tokens that exhibit some idiosyncratic properties (Baldwin and Kim, 2010), to quote Sag et al. 2001 are "a pain in the neck for NLP" . MWEs are difficult to predict as their syntactic behavior tends to be unpredictable: they can have an irregular internal syntax and a non-compositional meaning. MWEs-aware NLP systems are also hard to evaluate, because until recently and the PARSEME COST initiative (Savary et al, 2017) there were only few annotated corpora annotated with MWEs (Laporte et al. 2008). I will first present my previous works on named entity recognition (Dupont et al, 2017), showing how they are related to MWEs, before delving deeper into MWEs.
I will present in more details how they are a challenge, and how we can represent them using metagrammars (Savary et al., 2018), more precisely within the FRMG framework of de la Clergerie, (2010).
Joint work with Eric Villemonte de la Clergerie and Yannick Parmentier
Tackling Machine Translation of Noisy Text
Abstract: Despite their recent success, neural machine translation systems have proven to be brittle in the face of non-standard inputs that are far from their training domain. This is particularly salient for the kind of noisy, user-generated content ubiquitous on social media and the internet in general.
In this talk I will present MTNT, our first step to remedy this situation by proposing a testbed for Machine Translation of Noisy Text. MTNT consists of parallel Reddit comments in three languages (English, French, Japanese) exhibiting a large amount of typos, grammar errors, code switching and more. I will discuss the challenges of the collection process, preliminary MT experiments and outlook for future work (and a sneak peek of ongoing follow-up research).
A Multi-Source Trainable Parser with Deep Contextualized Lexical Representations with Case Studies
Abstract: Despite their recent success, neural machine translation In this talk, we describe a multi-source trainable parser developed at Lattice for the CoNLL 2018 Shared Task (Multilingual Parsing from Raw Text to Universal Dependencies). The main characteristic of our work is the encoding of three different modes of contextual information for parsing: (i) Treebank feature representations, (ii) Multilingual word representations, (iii) ELMo representations obtained via unsupervised learning from external resources. In the talk, we investigated more about parsing low-resource languages with very small training corpora using multilingual word embeddings and annotated corpora of larger languages. The study demonstrates that specific language combinations enable improved dependency parsing when compared to previous work, allowing for wider reuse of pre-existing resources when parsing low resource languages. The study also explores the question of whether contemporary contact languages or genetically related languages would be the most fruitful starting point for multilingual parsing scenarios.
Predictive processing in lexical and syntactic acquisition
Abstract: There is a general consensus in the field of language acquisition that infants use syntactic context to bootstrap their learning of the meaning of words. This is known as the syntactic bootstrapping hypothesis. For example, toddlers use the distributional information that articles tend to be followed by nouns (e.g., "la balle"), and pronouns tend to be followed by verbs (e.g., "elle saute"), to infer whether a novel word is likely to refer to an object or an action (e.g., "la dase" is likely to refer to an object and not an action). Previous modeling studies show that the distribution of syntactic contexts in the input is indeed a reliable cue to class membership. Thus, models that rely on frequent contexts show good categorization of unfamiliar words into nouns and verbs. My talk will focus on the question of whether children and infants can keep track of changes in the distribution of structures in their input, and update their predictions accordingly. I will present experimental results from my own studies with children, and suggest how we could model such effects.
Cooking entities at low heat: a receipt for entity disambiguation in scientific publications
Abstract: Entity ambiguity is a frequently encountered problem in digital publication libraries. Author/organisation is one of the most known use case, but there are others. We present “entity-cooking”: a generic, Machine Learning-based framework for entity matching/disambiguation. Developed with the help of Patrice Lopez, it is a tool offering a reusable entity disambiguation engine with “minimal" adaptations, independent by any specific domain. Lightly designed, it provides a standardised REST API and it supports XML-TEI or PDF (via Grobid) as input data. This project started in 2016; as of today we have implemented an author/organisation disambiguation solution, we have produced a manually annotated corpus (including affiliations references) and we are investigating the application to geographical location and toponym resolution (Semeval 2019, task 12).
Le Web et ses publics
Abstract: In the seminar, we will discuss the social and political consequences of the organization of digital media. We will consider the limits of a simplistic reading of the power-law distribution of online visibility and the hopes raised by the thematic clustering and the dynamism of the Web. We will also study the risks that these dynamics entail exploring the causes of the recent proliferation of 'junk news'. Dans ce séminaire, nous discuterons des conséquences sociales et politiques de l'organisation des médias numériques. Nous considérerons les limites d'une lecture simpliste de la distribution en loi de puissance de la visibilité en ligne et les espoirs soulevés par la clusterisation thématique et le dynamisme du Web. Nous nous pencherons aussi sur les risques que ces dynamiques comportent, en explorant les causes de la récente prolifération des 'junk news’.
Building a Treebank for Naija, the English-based Creole of Nigeria.
Abstract: As an example of treebank development without pre-existing language specific NLP tools, we will present the ongoing work of constructing a 750 000 word treebank for Naija. The annotation project, part of the NaijaSynCor ANR project, has a social dimension because the language, NaijaSynCor ANR project, has a social dimension because the language, that is not fully recognized as such by the speakers themselves, is not yet institutionalized in any way. Yet, Naija, spoken by close to 100 million speakers, could play an important role in the nation-building process of Nigeria. We will briefly present a few particularities of Naija such as serial verbs, reduplications, and emphatic adverbial particles. We used a bootstrapping process of manual annotation and parser training to enhance and speed up the annotation process. The annotation is done in the Syntactic Universal Dependencies scheme (SUD) which allows seamless transformation into Universal Dependencies (UD) by means of Grew http://grew.fr/, a rule based graph rewriting system. We will present the different tools involved in this process, and we will show a few preliminary quantitative measures on the annotated sentences.
New Resources and Ideas for Semantic Parsing
Abstract: In this talk, I will give an overview of research being done at the University of Stuttgart on semantic parser induction and natural language understanding. The main topic, semantic parser induction, relates to the problem of learning to map input text to full meaning representations from parallel datasets. Such resulting “semantic parsers” are often a core component in various downstream natural language understanding applications, including automated question-answering and generation systems. We look at learning within several novel domains and datasets being developed in Stuttgart (e.g., software documentation for text-to-code translation) and under various types of data supervision (e.g., learning from entailment, « polyglot » modeling, or learning from multiple datasets).
Bio: Kyle Richardson is a finishing PhD student at the University of Stuttgart (IMS), working on semantic parsing and various applications thereof. Prior to this, he was a researcher in the Intelligent Systems Lab at the Palo Alto Research Center (PARC), and holds a B.A. from the University of Rochester, USA. He’ll be joining the Allen Institute for AI in November.
Historical text normalization with neural networks
Abstract: With the increasing availability of digitized historical documents, interest in effective NLP tools for these documents is on the rise. The abundance of variant spellings, however, makes them challenging to work with for both humans and machines. For my PhD thesis, I worked on automatic normalization—mapping historical spellings to modern ones—as a possible approach to this problem. I looked at datasets of historical texts in eight different languages and evaluated normalization using rule-based, statistical, and neural approaches, with a particular focus on tuning a neural encoder–decoder model. In this talk, I will highlight what I learned from different perspectives: Why, what, and how to normalize? How do the different approaches compare and which one should I use? And what can we learn from this about neural networks that might be useful for other NLP tasks?
Text readability assessment for second language learners
Abstract: In this talk, I will present our work on readability assessment for the texts aimed at second language (L2) learners. I will discuss the approaches to this task and the features that we use in the machine learning framework. One of the major challenges in this task is the lack of significantly sized level-annotated data for L2 learners, as most models are aimed at and trained on the large amounts of texts for native English speakers. I will overview the methods of adapting models trained on larger native corpora to estimate text readability for L2 learners. Once the readability level of the text is assessed, the text can be adapted (e.g., simplified) to the level of the reader. The first step in this process is identification of words and phrases in need of simplification or adaptation. This task is called Complex Word Identification (CWI), and it has recently attracted much attention. In the second part of the talk, I will discuss the approaches to CWI and present our winning submission to the CWI Shared Task 2018.
Dynamiques circadiennes du langage : comment les données massives permettent de sonder de nouvelles échelles
Abstract: La linguistique s'est intéressée aux dynamiques langagières dans des gammes d'échelles allant de quelques décennies à quelques millénaires. Depuis quelques années, certaines études basées sur des média en ligne, notamment des forums, se sont penché sur des échelles de l'ordre de l'année, voire du mois. Qu'en est-il des échelles encore plus petites ? Peut-on observer des phénomène de l'ordre du jour ? de l'heure ? Si la chronobiologie a montré que nos capacités cognitives variait selon des rythmes circadiens, peu a été dit à propos du langage. En utilisant des données issues de Twitter, nous montrerons qu'il est possible d'observer des dynamiques linguistiques à des échelles nouvelles et mettrons en évidence l'existence des rythmes circadiens dans l'utilisation du lexique.
Quality Evaluation of Machine Translation into Arabic
Abstract: In machine translation, automatically obtaining a reliable assessment of translation quality is a challenging problem. Several techniques for automatically assessing translation quality for different purposes have been proposed, but these are mostly limited to strict string comparisons between the generated translation and translations produced by humans. This approach is too simplistic and ineffective for languages with flexible word order and rich morphology such as Arabic, a language for which machine translation evaluation is still an under-studied problem, despite posing many challenges. In this talk, I will first introduce AL-BLEU, a metric for Arabic machine translation evaluation that uses a rich set of morphological, syntactic and lexical features to extend the evaluation beyond the exact matching. We showed that AL-BLEU has a stronger correlation with human judgments than the state-of-the-art classical metrics. Then, I will present a more advanced study in which we explore the use of embeddings obtained from different levels of lexical and morpho-syntactic linguistic analysis and show that they improve MT evaluation into an Arabic. Our results show that using a neural-network model with different input representations produces results that clearly outperform the state-of-the-art for MT evaluation into Arabic, by almost over 75% increase in correlation with human judgments on pairwise MT evaluation quality task.
Socioeconomic dependencies of linguistic patterns in Twitter: Correlation and learning
Abstract: Our usage of language is not solely reliant on cognition but is arguably determined by myriad external factors leading to a global variability of linguistic patterns. This issue, which lies at the core of sociolinguistics and is backed by many small-scale studies on face-to-face communication, is addressed here by constructing a dataset combining the largest French Twitter corpus to date with detailed socioeconomic maps obtained from national census in France. We show how key linguistic variables measured in individual Twitter streams depend on factors like socioeconomic status, location, time, and the social network of individuals. We found that (1) people of higher socioeconomic status, active to a greater degree during the daytime, use a more standard language; (ii) the southern part of the country is more prone to using more standard language than the northern one, while locally the used variety or dialect is determined by the spatial distribution of socioeconomic status; and (iii) individuals connected in the social network are closer linguistically than disconnected ones, even after the effects of status homophily have been removed. In the second part of the talk we will discuss how linguistic information and the detected correlations can be used for the inference of socioeconomic status.