× Description Download Publication(s) Contact
 Back to Software and Resources

MElt

Statistical part-of-speech tagger

Description

MElt is a freely available (LGPL) state-of-the-art sequence labeller that is meant to be trained on both an annotated corpus and an external lexicon. It was initially developed by Pascal Denis and Benoît Sagot. Recent evolutions have been carried out by Benoît Sagot. MElt allows for the use of multiclass Maximum-Entropy Markov models (MEMMs) or multiclass perceptrons (multitrons) as underlying statistical devices. Its output is in the Brown format (one sentence per line, each sentence being a space-separated sequence of annotated words in the word/tag format).

MElt was trained on various annotated corpus, using for instance Alexina lexicons as source of lexical information.

MElt also includes a normalisation wrapper aimed at helping processing noisy text, such as user-generated data retrieved on the web. This wrapper is only available for French and English. It was used for parsing web data for both English and French, respectively during the 2012 SANCL shared task (Google Web Bank) and for developing the French Social Media Bank (Facebook, twitter and blog data).

You can retrain MElt on your own data, provided you put it in the Brown format, using the MElt-train script. You have to provide an external lexicon file, but it can be an empty file if you don’t want to use external lexical information.

Download

The latest version of MElt can be downloaded from the gitlab here.

MElt is distributed under a GNU LGPLv3.0 licence.

Publication(s)

If you use this work, please cite:

Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging

Pascal Denis and Benoît Sagot. 2012. Language Resources and Evaluation. 46(4). Springer Verlag. 721-736.
HAL PDF
@article{denis_Coupling-an-annotated-corpus_2012,
 author = {Denis, Pascal and Sagot, Benoît},
 doi = {10.1007/s10579-012-9193-0},
 title = {Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging},
 year = {2012}
 journal = {Language Resources and Evaluation},
 volume = {46},
 number = {4},
 publisher = {Springer Verlag},
 pages = {721-736},
 url = {https://hal.inria.fr/inria-00614819},
 pdf = {https://hal.inria.fr/inria-00614819/file/lre12-denis-sagot.pdf},
}

External Lexical Information for Multilingual Part-of-Speech Tagging

Benoît Sagot. 2016. Research Report. RR-8924. Inria Paris.
HAL PDF
@techreport{sagot_External-Lexical-Information-for_2016,
 author = {Sagot, Benoît},
 institution = {Inria Paris},
 title = {External Lexical Information for Multilingual Part-of-Speech Tagging},
 year = {2016}
 type = {Research Report},
 number = {RR-8924},
 url = {https://hal.inria.fr/hal-01330301},
 pdf = {https://hal.inria.fr/hal-01330301v3/file/RR-8924.pdf},
}

Tagset

The current tagset used by MElt is as follows (Crabbé & Candito, 2008):

TagDescription
ADJ adjective
ADJWH interrogative adjective
ADV adverb
ADVWH interrogative adverb
CC coordination conjunction
CLO object clitic pronoun
CLR reflexive clitic pronoun
CLS subject clitic pronoun
CS subordination conjunction
DET determiner
DETWH interrogative determiner
ET foreign word
I interjection
NC common noun
NPP proper noun
P preposition
P+D preposition+determiner amalgam
P+PRO prepositon+pronoun amalgam
PONCT punctuation mark
PREF prefix
PRO full pronoun
PROREL relative pronoun
PROWH interrogative pronoun
V indicative or conditional verb form
VIMP imperative verb form
VINF infinitive verb form
VPP past participle
VPR present participle
VS subjunctive verb form

When using normalisation options, other tags may appear:

  • when using -n, Y means "non-last token of a multi-token unit", X means "multiword/multitag token"
  • when using -N, Y means "non-last token of a multi-token unit", multiword/multitag tokens are annotated with tags of the form T1+T2+...+Tn (e.g. chépa/CLS+V+ADV)

Contact

For more information or if you have any questions, please contact Benoît Sagot

Benoit.Sagot[at]inria.fr