Lemmatization of non-English words?

Question

I would like to apply lemmatization to reduce the inflectional forms of words. I know that for English language WordNet provides such a functionality, but I am also interested in applying lemmatization for Dutch, French, Spanish and Italian words. Is there any trustworthy and confirmed way to go about this? Thank you!

See also https://stackoverflow.com/questions/13131139/lemmatize-french-text?rq=1 — DNA, Mar 03 '14 at 10:39
The responses on the cited question discuss French stemmers but not lemmatizers — duhaime, Dec 09 '14 at 14:48

score 11 · Accepted Answer · edited Jun 08 '16 at 09:20

11

Try pattern library from CLIPS, they have support for German, English, Spanish, French and Italian. Just what you needed: http://www.clips.ua.ac.be/pattern

Unfortunately it only works with Python 2, no support for Python3 provided yet.

edited Jun 08 '16 at 09:20

Toni Piza

505
3
11

answered Mar 03 '14 at 12:24

alvas

115,346
109
446
738

Thanks, that's perfect! Just what I was looking for! – Crista23 Mar 05 '14 at 20:29
any library known for Finnish language ? – Sarang Manjrekar Jul 25 '18 at 12:50
Try https://github.com/flammie/omorfi and http://morfessor.readthedocs.io/en/latest/ – alvas Jul 25 '18 at 15:29

Gerardo Orellana · Answer 2 · 2017-12-13T16:50:13.010

The textacy library http://textacy.readthedocs.io/en/latest/api_reference.html provides the essential tools for building a bag of words or bag of terms with lemmatization included as part of the options on it. I've tried it with Spanish and works quite OK.

doc.to_bag_of_terms(ngrams=2, named_entities=True, lemmatize=True, as_strings=True)

The library automatically checks the language you're writing in and lemmatize according to it. However, you can also specify it here.

import textacy
text = 'Los gatos y los perros juegan juntos en el patio de su casa'
doc = textacy.Doc(text, lang='es')
print(doc.to_bag_of_words(normalize='lemma', as_strings=True))

You'll get an output as the following {'perro': 1, 'y': 1, 'gato': 1, 'jugar': 1, 'casar': 1, 'Los': 1, 'patio': 1}

The library recognizes well some of the words, however, the lemmas were not perfectly recognized. Hope this helps.

It would be useful if you explained a bit more how the library can be used for non-English languages and show some example output. — vpekar, Dec 13 '17 at 16:33
also, the link provided http://textacy.readthedocs.io/en/latest/api_reference.html doesn't give me access — Way Too Simple, Jul 29 '19 at 22:17

Lemmatization of non-English words?

2 Answers2

Linked