ALMAnaCH lab
Inria project-team
ALMAnaCH
People
Seminars
Software and Resources
(current)
Publications
Projects
Contact
Software and Resources
Navigate using the side menu
☰
×
Language models
Raw corpora
Speech corpora
HTR and OCR
Machine translation
Text simplification
Lexicons
Standardisation
Treebanks
Parsing
Shallow processing and tagging
Industrial software
Other annotated corpora
Language models
CamemBERT
Neural BERT-like language model for French
PAGnol
Neural GPT-based language model for French
FrELMo
ELMo language model for French
MRELMo
ELMo language models for 5 mid-resource languages (Bulgarian, Catalan, Danish, Finnish, Indonesian)
CamemBERTa
A DeBERTa v3-based French language model
CamemBERT-bio
Neural BERT-like language model for the French biomedical domain
CharacterBERT-UGC
A CharacterBERT language model for North-African Arabizi and French user-generated content
D'AlemBERT
Neural BERT-like language model for Early Modern French
MANTa-LM
A robust T5-like model based on a neural tokenizer
Raw corpora
OSCAR
Huge multilingual web-based corpus
goclassy
Asynchronous concurrent pipeline for classifying Common Crawl
Ungoliant
Asynchronous concurrent pipeline for classifying Common Crawl
Speech corpora
Expresso ☕
A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
SpeechMatrix
Speech parallel corpus mined from VoxPopuli
HTR and OCR
KaMI-Lib
KaMI-lib is an HTR and OCR engine agnostic Python package for evaluating transcription models
HTR-United
HTR-United is an open Github ecosystem designed to share training data for HTR and OCR tasks
WikiCremma
Dataset for HTR training on Contemporary French
CATMuS Medieval
Handwritten Text Recognition model for medieval manuscripts- in Latin scripts
eScriptorium Documentation
Open and collaborative documentation for eScriptorium
HTRomance
Ground-truth for training HTR models
Machine translation
DiscEvalMT
Contrastive test sets for the evaluation of discourse phenomena in English-to-French machine translation
PFSMB
FR-EN parallel corpus of noisy user-generated content
PMUMT
FR-EN Annotated parallel corpus of noisy user-generated content
DiaBLa
Parallel dataset of English-French bilingual dialogues
CoMMuTE
A contrastive evaluation dataset for multimodal (text-image) machine translation.
RoCS-MT
Robust Challenge Set for Machine Translation
SONAR
SONAR is a multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders
T-modules
Approach to cross-modal transfer between speech adn text for translation tasks
VGAMT
A multimodal machine translation model
Text simplification
ACCESS
Controllable Text Simplification Model
ASSET
Text Simplification Evaluation Dataset
EASSE
Text Simplification Evaluation Library
tseval
Text Simplification Evaluation Library
Lexicons
WOLF
Free Wordnet for French
Alexina
Morphological (and sometimes syntactic) lexicons (including the Lefff)
EtymDB
Etymological database extracted from wiktionary
OFrLex-modifier
UDLexicons
Multilingual collection of morphological lexicons
Standardisation
Standardization Survival Kit
SSK
Collection of research use case scenarios illustrating best practices in Digital Humanities and Heritage research
Treebanks
FSMB
French social media bank
FQB
Multi-layered treebank made of questions for French
Sequoia corpus
French corpus with surface and deep syntactic annotations
Parsing
FRMG
A large-coverage meta-grammar for French
dyalog-sr
Transition-based parser built on top of DyALog
DyALog
Environment for building tabular parsers and programs
ELMoLex
Neural parsing system developed for ALMAnaCH's submission to the CoNLL-18 multilingual parsing shared task
Mgwiki
Linguistic Wiki for FRMG
SYNTAX
Lexical and syntactic parser generator
Shallow processing and tagging
GROBID
Library for extracting, parsing and re-structuring raw documents
GROBID-Dictionaries
GROBID module for structuring digitised lexical resources and entry-based documents
SxPipe
Shallow language pipeline
entity-fishing
Entity recognition and disambiguation
MElt
Statistical part-of-speech tagger
CCASS-sim
Similarity detection tool for legal texts from the Cour de Cassation
D'AlemBERT POS
POS tagger for Early Modern French
D'AlemBERT NER
NER model for Early Modern French
DESIR-CodeSprint-TrackA-TextMining
A tool for extracting scholarly documents and visualizing the results on PDF files using GROBID.
grobid-medical-report
GROBID module for extracting and restructuring medical reports from PDF documents into encoded XML/TEI documents
ModFr-norm
Normalisation of Modern (17th c.) French
nerdKid
NerdKid is a tool for grouping Wikidata entities into 27 classes (e.g., ANIMAL, LOCATION, MEDIA, PERSON).
Industrial software
Enqi
vera
Automatic analysis of answers to open-ended questions in employee surveys
feats2notes
Génération de commentaires à partir des données structurées
Other annotated corpora
VerDI project release
3MT_French Dataset
3 Minutes Thesis Corpus
FreEM-corpora
Corpora and NLP tools for Early Modern French (16th-18th c.)