29

I have some text in French that I need to process in some ways. For that, I need to:

  • First, tokenize the text into words
  • Then lemmatize those words to avoid processing the same root more than once

As far as I can see, the wordnet lemmatizer in the NLTK only works with English. I want something that can return "vouloir" when I give it "voudrais" and so on. I also cannot tokenize properly because of the apostrophes. Any pointers would be greatly appreciated. :)

yelsayed
  • 5,236
  • 3
  • 27
  • 38

5 Answers5

25

The best solution I found is spacy, it seems to do the job

To install:

pip3 install spacy
python3 -m spacy download fr_core_news_md

To use:

import spacy
nlp = spacy.load('fr_core_news_md')

doc = nlp(u"voudrais non animaux yeux dors couvre.")
for token in doc:
    print(token, token.lemma_)

Result:

voudrais vouloir
non non
animaux animal
yeux oeil
dors dor
couvre couvrir

checkout the documentation for more details: https://spacy.io/models/fr && https://spacy.io/usage

cakraww
  • 2,493
  • 28
  • 30
karimsaieh
  • 387
  • 4
  • 6
19

Here's an old but relevant comment by an nltk dev. Looks like most advanced stemmers in nltk are all English specific:

The nltk.stem module currently contains 3 stemmers: the Porter stemmer, the Lancaster stemmer, and a Regular-Expression based stemmer. The Porter stemmer and Lancaster stemmer are both English- specific. The regular-expression based stemmer can be customized to use any regular expression you wish. So you should be able to write a simple stemmer for non-English languages using the regexp stemmer. For example, for french:

from nltk import stem
stemmer = stem.Regexp('s$|es$|era$|erez$|ions$| <etc> ')

But you'd need to come up with the language-specific regular expression yourself. For a more advanced stemmer, it would probably be necessary to add a new module. (This might be a good student project.)

For more information on the regexp stemmer:

http://nltk.org/doc/api/nltk.stem.regexp.Regexp-class.html

-Edward

Note: the link he gives is dead, see here for the current regexstemmer documentation.

The more recently added snowball stemmer appears to be able to stem French though. Let's put it to the test:

>>> from nltk.stem.snowball import FrenchStemmer
>>> stemmer = FrenchStemmer()
>>> stemmer.stem('voudrais')
u'voudr'
>>> stemmer.stem('animaux')
u'animal'
>>> stemmer.stem('yeux')
u'yeux'
>>> stemmer.stem('dors')
u'dor'
>>> stemmer.stem('couvre')
u'couvr'

As you can see, some results are a bit dubious.

Not quite what you were hoping for, but I guess it's a start.

Junuxx
  • 14,011
  • 5
  • 41
  • 71
  • yea it's disappointing there's no stemmer for non-english languages. what I ended up doing actually is that I tokenized the words on punctuation, then I removed all residual one-letter articles (such as the remaining l in "l'ensemble" for example). I then used a listing of words and corresponding lemmata, specifically the one hosted at http://www.limsi.fr/Individu/anne/OLDlexique.txt, which was referenced by several posts online, it did the trick. The snowball stemmer looks like it's working too, thanks Junuxx. :) – yelsayed Nov 06 '12 at 00:35
  • My French is not too good, but I'm unclear what you're expecting here. It is stemming the words, afaict; it is not lemmatizing them, which is a different task. That is, it's returning the stem, not the form that you would find in a dictionary (hence the lack of infinitival suffixes on verbs). That is by design; that is what a stemmer does. – Mike Maxwell Jan 24 '20 at 22:28
  • Yes that's a good distinction to make. Nevertheless I'd say animaux -> anima, voudrais -> vou might make more sense than the above output. Completely irregular plurals like yeux are tricky; I guess one has to grudgingly accept it as a stem. – Junuxx Jan 24 '20 at 23:14
2

Maybe with TreeTagger ? I haven't try but this app can work in french

http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
http://txm.sourceforge.net/installtreetagger_fr.html

Marek Grzenkowicz
  • 17,024
  • 9
  • 81
  • 111
Klemm
  • 141
  • 4
  • 1
    gosh, treetaggers give unsupervised lemmas, i would advise to stay away from it if possible. – alvas Mar 03 '14 at 12:22
  • Can I please know how do you use treetagger for stemming the words? from what I understood with treetagger we can just pos tag words. – sel Sep 22 '15 at 15:07
1

If you are performing Machine Learning algorithms on your text, you may use n-grams instead of word tokens. It is not strictly lemmatization but it detects series of n similar letters and it is supprisingly powerful to gather words with the same meaning.

I use sklearn's function CountVectorizer(analyzer='char_wb') and for some specific text, it is way more efficient than bag of words.

O'Neil
  • 3,790
  • 4
  • 16
  • 30
Brice
  • 352
  • 2
  • 16
1

If you are doing a text mining project in a French bank, I recommend the package cltk.

install cltk from cltk.lemmatize.french.lemma import LemmaReplacer

more details in cltk

Z.LI
  • 369
  • 3
  • 11
  • 3
    The cltk appears to be for French up until the 14th century, no? Am I misreading its documentation? Surely the spelling of French has changed (some) since then, no? And there are new words? Like the infamous 'weekend'... – Mike Maxwell Jan 24 '20 at 22:49
  • Exactly, it is written french, but it's for Old French. (Contributor of CLTK here). – clemsciences Apr 17 '20 at 20:21