ALMAnaCH, Inria

Description

MElt is a freely available (LGPL) state-of-the-art sequence labeller that is meant to be trained on both an annotated corpus and an external lexicon. It was initially developed by Pascal Denis and Benoît Sagot. Recent evolutions have been carried out by Benoît Sagot. MElt allows for the use of multiclass Maximum-Entropy Markov models (MEMMs) or multiclass perceptrons (multitrons) as underlying statistical devices. Its output is in the Brown format (one sentence per line, each sentence being a space-separated sequence of annotated words in the word/tag format).

MElt was trained on various annotated corpus, using for instance Alexina lexicons as source of lexical information.

MElt also includes a normalisation wrapper aimed at helping processing noisy text, such as user-generated data retrieved on the web. This wrapper is only available for French and English. It was used for parsing web data for both English and French, respectively during the 2012 SANCL shared task (Google Web Bank) and for developing the French Social Media Bank (Facebook, twitter and blog data).

You can retrain MElt on your own data, provided you put it in the Brown format, using the MElt-train script. You have to provide an external lexicon file, but it can be an empty file if you don’t want to use external lexical information.

Download

The latest version of MElt can be downloaded from the gitlab here.

MElt is distributed under a GNU LGPLv3.0 licence.

Publication(s)

If you use this work, please cite:

Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging

Pascal Denis and Benoît Sagot. 2012. Language Resources and Evaluation. 46(4). Springer Verlag. 721-736.
HAL PDF

@article{denis_Coupling-an-annotated-corpus_2012,
 author = {Denis, Pascal and Sagot, Benoît},
 doi = {10.1007/s10579-012-9193-0},
 title = {Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging},
 year = {2012}
 journal = {Language Resources and Evaluation},
 volume = {46},
 number = {4},
 publisher = {Springer Verlag},
 pages = {721-736},
 url = {https://hal.inria.fr/inria-00614819},
 pdf = {https://hal.inria.fr/inria-00614819/file/lre12-denis-sagot.pdf},
}

External Lexical Information for Multilingual Part-of-Speech Tagging

Benoît Sagot. 2016. Research Report. RR-8924. Inria Paris.
HAL PDF

@techreport{sagot_External-Lexical-Information-for_2016,
 author = {Sagot, Benoît},
 institution = {Inria Paris},
 title = {External Lexical Information for Multilingual Part-of-Speech Tagging},
 year = {2016}
 type = {Research Report},
 number = {RR-8924},
 url = {https://hal.inria.fr/hal-01330301},
 pdf = {https://hal.inria.fr/hal-01330301v3/file/RR-8924.pdf},
}

Tagset

The current tagset used by MElt is as follows (Crabbé & Candito, 2008):

Tag	Description
ADJ	adjective
ADJWH	interrogative adjective
ADV	adverb
ADVWH	interrogative adverb
CC	coordination conjunction
CLO	object clitic pronoun
CLR	reflexive clitic pronoun
CLS	subject clitic pronoun
CS	subordination conjunction
DET	determiner
DETWH	interrogative determiner
ET	foreign word
I	interjection
NC	common noun
NPP	proper noun
P	preposition
P+D	preposition+determiner amalgam
P+PRO	prepositon+pronoun amalgam
PONCT	punctuation mark
PREF	prefix
PRO	full pronoun
PROREL	relative pronoun
PROWH	interrogative pronoun
V	indicative or conditional verb form
VIMP	imperative verb form
VINF	infinitive verb form
VPP	past participle
VPR	present participle
VS	subjunctive verb form

When using normalisation options, other tags may appear:

when using -n, Y means "non-last token of a multi-token unit", X means "multiword/multitag token"
when using -N, Y means "non-last token of a multi-token unit", multiword/multitag tokens are annotated with tags of the form T1+T2+...+Tn (e.g. chépa/CLS+V+ADV)

Contact

For more information or if you have any questions, please contact Benoît Sagot

Benoit.Sagot[at]inria.fr

MElt