Software and Resources

Navigate using the side menu  

Language models

CamemBERT

CamemBERT

Neural BERT-like language model for French
PAGnol

PAGnol

Neural GPT-based language model for French
FrELMo

FrELMo

ELMo language model for French
MRELMo

MRELMo

ELMo language models for 5 mid-resource languages (Bulgarian, Catalan, Danish, Finnish, Indonesian)
CamemBERTa

CamemBERTa

A DeBERTa v3-based French language model
CamemBERT-bio

CamemBERT-bio

Neural BERT-like language model for the French biomedical domain
CharacterBERT-UGC

CharacterBERT-UGC

A CharacterBERT language model for North-African Arabizi and French user-generated content
D'AlemBERT

D'AlemBERT

Neural BERT-like language model for Early Modern French
MANTa-LM

MANTa-LM

A robust T5-like model based on a neural tokenizer

Raw corpora

OSCAR

OSCAR

Huge multilingual web-based corpus
goclassy

goclassy

Asynchronous concurrent pipeline for classifying Common Crawl
Ungoliant

Ungoliant

Asynchronous concurrent pipeline for classifying Common Crawl

Speech corpora

Expresso ☕

Expresso ☕

A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
SpeechMatrix

SpeechMatrix

Speech parallel corpus mined from VoxPopuli

HTR and OCR

KaMI-Lib

KaMI-Lib

KaMI-lib is an HTR and OCR engine agnostic Python package for evaluating transcription models
HTR-United

HTR-United

HTR-United is an open Github ecosystem designed to share training data for HTR and OCR tasks
WikiCremma

WikiCremma

Dataset for HTR training on Contemporary French
CATMuS Medieval

CATMuS Medieval

Handwritten Text Recognition model for medieval manuscripts- in Latin scripts
eScriptorium Documentation

eScriptorium Documentation

Open and collaborative documentation for eScriptorium
HTRomance

HTRomance

Ground-truth for training HTR models

Machine translation

DiscEvalMT

DiscEvalMT

Contrastive test sets for the evaluation of discourse phenomena in English-to-French machine translation
PFSMB

PFSMB

FR-EN parallel corpus of noisy user-generated content
PMUMT

PMUMT

FR-EN Annotated parallel corpus of noisy user-generated content
DiaBLa

DiaBLa

Parallel dataset of English-French bilingual dialogues
CoMMuTE

CoMMuTE

A contrastive evaluation dataset for multimodal (text-image) machine translation.
RoCS-MT

RoCS-MT

Robust Challenge Set for Machine Translation
SONAR

SONAR

SONAR is a multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders
T-modules

T-modules

Approach to cross-modal transfer between speech adn text for translation tasks
VGAMT

VGAMT

A multimodal machine translation model

Text simplification

ACCESS

ACCESS

Controllable Text Simplification Model
ASSET

ASSET

Text Simplification Evaluation Dataset
EASSE

EASSE

Text Simplification Evaluation Library
tseval

tseval

Text Simplification Evaluation Library

Lexicons

WOLF

WOLF

Free Wordnet for French
Alexina

Alexina

Morphological (and sometimes syntactic) lexicons (including the Lefff)
EtymDB

EtymDB

Etymological database extracted from wiktionary
OFrLex-modifier

OFrLex-modifier

UDLexicons

UDLexicons

Multilingual collection of morphological lexicons

Standardisation

Standardization Survival Kit

Standardization Survival Kit

SSK

SSK

Collection of research use case scenarios illustrating best practices in Digital Humanities and Heritage research

Treebanks

FSMB

FSMB

French social media bank
FQB

FQB

Multi-layered treebank made of questions for French
Sequoia corpus

Sequoia corpus

French corpus with surface and deep syntactic annotations

Parsing

FRMG

FRMG

A large-coverage meta-grammar for French
dyalog-sr

dyalog-sr

Transition-based parser built on top of DyALog
DyALog

DyALog

Environment for building tabular parsers and programs
ELMoLex

ELMoLex

Neural parsing system developed for ALMAnaCH's submission to the CoNLL-18 multilingual parsing shared task
Mgwiki

Mgwiki

Linguistic Wiki for FRMG
SYNTAX

SYNTAX

Lexical and syntactic parser generator

Shallow processing and tagging

GROBID

GROBID

Library for extracting, parsing and re-structuring raw documents
GROBID-Dictionaries

GROBID-Dictionaries

GROBID module for structuring digitised lexical resources and entry-based documents
SxPipe

SxPipe

Shallow language pipeline
entity-fishing

entity-fishing

Entity recognition and disambiguation
MElt

MElt

Statistical part-of-speech tagger
CCASS-sim

CCASS-sim

Similarity detection tool for legal texts from the Cour de Cassation
D'AlemBERT POS

D'AlemBERT POS

POS tagger for Early Modern French
D'AlemBERT NER

D'AlemBERT NER

NER model for Early Modern French
DESIR-CodeSprint-TrackA-TextMining

DESIR-CodeSprint-TrackA-TextMining

A tool for extracting scholarly documents and visualizing the results on PDF files using GROBID.
grobid-medical-report

grobid-medical-report

GROBID module for extracting and restructuring medical reports from PDF documents into encoded XML/TEI documents
ModFr-norm

ModFr-norm

Normalisation of Modern (17th c.) French
nerdKid

nerdKid

NerdKid is a tool for grouping Wikidata entities into 27 classes (e.g., ANIMAL, LOCATION, MEDIA, PERSON).

Industrial software

Enqi

Enqi

vera

vera

Automatic analysis of answers to open-ended questions in employee surveys
feats2notes

feats2notes

Génération de commentaires à partir des données structurées

Other annotated corpora

VerDI project release

VerDI project release

3MT_French Dataset

3MT_French Dataset

3 Minutes Thesis Corpus
FreEM-corpora

FreEM-corpora

Corpora and NLP tools for Early Modern French (16th-18th c.)