MElt is a freely available (LGPL) state-of-the-art sequence labeller that is meant to be trained on both an annotated corpus and an external lexicon. It was initially developed by Pascal Denis and Benoît Sagot. Recent evolutions have been carried out by Benoît Sagot. MElt allows for the use of multiclass Maximum-Entropy Markov models (MEMMs) or multiclass perceptrons (multitrons) as underlying statistical devices. Its output is in the Brown format (one sentence per line, each sentence being a space-separated sequence of annotated words in the word/tag format).
MElt was trained on various annotated corpus, using for instance Alexina lexicons as source of lexical information.
MElt also includes a normalisation wrapper aimed at helping processing noisy text, such as user-generated data retrieved on the web. This wrapper is only available for French and English. It was used for parsing web data for both English and French, respectively during the 2012 SANCL shared task (Google Web Bank) and for developing the French Social Media Bank (Facebook, twitter and blog data).
You can retrain MElt on your own data, provided you put it in the Brown format, using the MElt-train script. You have to provide an external lexicon file, but it can be an empty file if you don’t want to use external lexical information.
The latest version of MElt can be downloaded from the gitlab here.
MElt is distributed under a GNU LGPLv3.0 licence.
@article{denis_Coupling-an-annotated-corpus_2012,
author = {Denis, Pascal and Sagot, Benoît},
doi = {10.1007/s10579-012-9193-0},
title = {Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging},
year = {2012}
journal = {Language Resources and Evaluation},
volume = {46},
number = {4},
publisher = {Springer Verlag},
pages = {721-736},
url = {https://hal.inria.fr/inria-00614819},
pdf = {https://hal.inria.fr/inria-00614819/file/lre12-denis-sagot.pdf},
}
@techreport{sagot_External-Lexical-Information-for_2016,
author = {Sagot, Benoît},
institution = {Inria Paris},
title = {External Lexical Information for Multilingual Part-of-Speech Tagging},
year = {2016}
type = {Research Report},
number = {RR-8924},
url = {https://hal.inria.fr/hal-01330301},
pdf = {https://hal.inria.fr/hal-01330301v3/file/RR-8924.pdf},
}
The current tagset used by MElt is as follows (Crabbé & Candito, 2008):
Tag | Description |
---|---|
ADJ | adjective |
ADJWH | interrogative adjective |
ADV | adverb |
ADVWH | interrogative adverb |
CC | coordination conjunction |
CLO | object clitic pronoun |
CLR | reflexive clitic pronoun |
CLS | subject clitic pronoun |
CS | subordination conjunction |
DET | determiner |
DETWH | interrogative determiner |
ET | foreign word |
I | interjection |
NC | common noun |
NPP | proper noun |
P | preposition |
P+D | preposition+determiner amalgam |
P+PRO | prepositon+pronoun amalgam |
PONCT | punctuation mark |
PREF | prefix |
PRO | full pronoun |
PROREL | relative pronoun |
PROWH | interrogative pronoun |
V | indicative or conditional verb form |
VIMP | imperative verb form |
VINF | infinitive verb form |
VPP | past participle |
VPR | present participle |
VS | subjunctive verb form |
When using normalisation options, other tags may appear: