7

I need to find a common root word matched for all related words for a keyword extractor.

How to convert words into the same root using the python nltk lemmatizer?

  • Eg:
    1. generalized, generalization -> general
    2. optimal, optimized -> optimize (maybe)
    3. configure, configuration, configured -> configure

The python nltk lemmatizer gives 'generalize', for 'generalized' and 'generalizing' when part of speech(pos) tag parameter is used but not for 'generalization'.

Is there a way to do this?

Shanika Ediriweera
  • 1,975
  • 2
  • 24
  • 31

1 Answers1

12

Use SnowballStemmer:

>>> from nltk.stem.snowball import SnowballStemmer
>>> stemmer = SnowballStemmer("english")
>>> print(stemmer.stem("generalized"))
general
>>> print(stemmer.stem("generalization"))
general

Note: Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

A general issue I have seen with lemmatizers is that it identifies even bigger words as lemmas.

Example: In WordNet Lemmatizer(checked in NLTK),

  • Genralized => Generalize
  • Generalization => Generalization
  • Generalizations => Generalization

POS tag was not given as input in the above cases, so it was always considered noun.

Ani Menon
  • 27,209
  • 16
  • 105
  • 126
  • Stemming isn't really lemma though ;P take a look at http://stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers – alvas Sep 03 '16 at 06:44
  • @ani-menon I previously used the porter stemmer. What is the difference between porter stemmer and the snowball stemmer? – Shanika Ediriweera Sep 03 '16 at 07:08
  • @Ani I was going to move to a lemmatizer to get better results than the porter stemmer to match all similar keywords to a same (meaningful)root word. Isn't it possible with the lemmatizer? – Shanika Ediriweera Sep 03 '16 at 07:14
  • 2
    @ShanikaEdiriweera Porter is an old algorithm(1980), which was modified later but snowball is newer. – Ani Menon Sep 03 '16 at 15:33
  • @ShanikaEdiriweera by definition yes its supposed to be lemmatized to get you the best results(that is by providing the POS tag also as an input) but most lemmatizers are not good enough, you may have to write one for your use-case. – Ani Menon Sep 03 '16 at 15:50
  • @Ani Thanks. How to write a lemmatizer? Any tips, guides? – Shanika Ediriweera Sep 03 '16 at 17:56
  • Use/Create a dictionary of words you need. Write code to use stemming to identify stem words of inputs given, then use the dictionary to find appropriate *lemma* words(i.e. smallest form of the word) to them (POS is important). Also for example, "Lawyers" => "Lawyer" is correct but if you just try to find the smallest word you would get "Law". So find the smallest word with the correct meaning. I have just given you a brief idea of how it should work, there are lot more thinks you may consider. – Ani Menon Sep 03 '16 at 18:28
  • But the problem is stemmer is converting bullying to bulli where lemmatizer is converting bullying to bully. On the other hand lemmatizer is converting prevention to prevention. No change at all. But stemmer is doing it right i.e converting prevention to prevent. How can we get the word root perfectly? Any idea? I've used lemmatizer with pos='v' as well. – InsParbo May 25 '18 at 06:10