45

I tried all the nltk methods for stemming but it gives me weird results with some words.

Examples

It often cut end of words when it shouldn't do it :

  • poodle => poodl
  • article articl

or doesn't stem very good :

  • easily and easy are not stemmed in the same word
  • leaves, grows, fairly are not stemmed

Do you know other stemming libs in python, or a good dictionary?

Thank you

PeYoTlL
  • 3,144
  • 2
  • 17
  • 18
  • 1
    these results are not weird since `stemming` is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. For more details, check [here](http://en.wikipedia.org/wiki/Stemming) – eliasah Jul 09 '14 at 07:53
  • btw NLTK is the best platform for building Python programs to work with human language data. – eliasah Jul 09 '14 at 07:56
  • You probably ask for a stemmer for English language only, right? – dzieciou Oct 04 '19 at 09:28

7 Answers7

180

The results you are getting are (generally) expected for a stemmer in English. You say you tried "all the nltk methods" but when I try your examples, that doesn't seem to be the case.

Here are some examples using the PorterStemmer

import nltk
ps = nltk.stemmer.PorterStemmer()
ps.stem('grows')
'grow'
ps.stem('leaves')
'leav'
ps.stem('fairly')
'fairli'

The results are 'grow', 'leav' and 'fairli' which, even if they are what you wanted, are stemmed versions of the original word.

If we switch to the Snowball stemmer, we have to provide the language as a parameter.

import nltk
sno = nltk.stem.SnowballStemmer('english')
sno.stem('grows')
'grow'
sno.stem('leaves')
'leav'
sno.stem('fairly')
'fair'

The results are as before for 'grows' and 'leaves' but 'fairly' is stemmed to 'fair'

So in both cases (and there are more than two stemmers available in nltk), words that you say are not stemmed, in fact, are. The LancasterStemmer will return 'easy' when provided with 'easily' or 'easy' as input.

Maybe you really wanted a lemmatizer? That would return 'article' and 'poodle' unchanged.

import nltk
lemma = nltk.wordnet.WordNetLemmatizer()
lemma.lemmatize('article')
'article'
lemma.lemmatize('leaves')
'leaf'
dzieciou
  • 4,049
  • 8
  • 41
  • 85
Spaceghost
  • 6,835
  • 3
  • 28
  • 42
  • 23
    This should be the selected answer. – o-90 May 12 '17 at 18:31
  • 6
    Difference b/w lemmantizer and stemmer: https://stackoverflow.com/questions/1787110/what-is-the-true-difference-between-lemmatization-vs-stemming – Vipul Jain Nov 16 '17 at 11:51
  • 5
    One thing to add: The lemmatizer produces better results when paired with a POS tagger; the default POS it tries to match for is nouns (try the lemmatizer with the word "ate"). – Michael Mar 28 '18 at 18:22
  • lemmatizer.lemmatize('randomly') # output 'randomly', stemmer.stem('randomly') # output 'random'. You can't win. ( nltk lemmatizer, stem from stemmer package) – jason Feb 12 '21 at 08:05
  • 2
    For the first example, why is it called `stemmer`? It didn't work for me but `stem` did. – Shayan Nov 18 '21 at 13:15
15

All these stemmers that have been discussed here are algorithmic stemmer,hence they can always produce unexpected results such as

In [3]: from nltk.stem.porter import *

In [4]: stemmer = PorterStemmer()

In [5]: stemmer.stem('identified')
Out[5]: u'identifi'

In [6]: stemmer.stem('nonsensical')
Out[6]: u'nonsens'

To correctly get the root words one need a dictionary based stemmer such as Hunspell Stemmer.Here is a python implementation of it in the following link. Example code is here

>>> import hunspell
>>> hobj = hunspell.HunSpell('/usr/share/myspell/en_US.dic', '/usr/share/myspell/en_US.aff')
>>> hobj.spell('spookie')
False
>>> hobj.suggest('spookie')
['spookier', 'spookiness', 'spooky', 'spook', 'spoonbill']
>>> hobj.spell('spooky')
True
>>> hobj.analyze('linked')
[' st:link fl:D']
>>> hobj.stem('linked')
['link']
0xF
  • 546
  • 4
  • 20
7

Stemmers vary in their aggressiveness. Porter is one of the monst aggressive stemmer for English. I find it usually hurts more than it helps. On the lighter side you can either use a lemmatizer instead as already suggested, or a lighter algorithmic stemmer. The limitation of lemmatizers is that they cannot handle unknown words.

Personally I like the Krovetz stemmer which is a hybrid solution, combing a dictionary lemmatizer and a light weight stemmer for out of vocabulary words. Krovetz also kstem or light_stemmer option in Elasticsearch. There is a python implementation on pypi https://pypi.org/project/KrovetzStemmer/, though that is not the one that I have used.

Another option is the the lemmatizer in spaCy. Afte processing with spaCy every token has a lemma_ attribute. (note the underscore lemma hold a numerical identifier of the lemma_) - https://spacy.io/api/token

Here are some papers comparing various stemming algorithms:

Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
4

Stemming is all about removing suffixes(usually only suffixes, as far as I have tried none of the nltk stemmers could remove a prefix, forget about infixes). So we can clearly call stemming as a dumb/ not so intelligent program. It doesn't check if a word has a meaning before or after stemming. For eg. If u try to stem "xqaing", although not a word, it will remove "-ing" and give u "xqa".

So, in order to use a smarter system, one can use lemmatizers. Lemmatizers uses well-formed lemmas (words) in form of wordnet and dictionaries. So it always returns and takes a proper word. However, it is slow because it goes through all words in order to find the relevant one.

David Jones
  • 4,766
  • 3
  • 32
  • 45
Ritveak
  • 2,930
  • 2
  • 13
  • 28
0

There are already very good answers available in this question but I also wanted to add some information which I think it might be useful. On my research I found out a link which gives great details about the Stemming and lemmatization https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html.

To give a short summary here is some insights of the page:

Stemming and lemmatization

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

am, are, is -> be
car, cars, car's, cars' -> car

The result of this mapping of text will be something like:
the boy's cars are different colors -> the boy car be differ color

Also nltk package has been updated and you can import WordNetLemmatizer with from nltk.stem import WordNetLemmatizer. And lemmatizer requires a package already downloaded before use, the command below works well with version 3.6.1.

import nltk

nltk.download("wordnet")
abdullahselek
  • 7,893
  • 3
  • 50
  • 40
-1

In my chatbot project I have used PorterStemmer However LancasterStemmer also serves the purpose. Ultimate objective is to stem the word to its root so that we can search and compare with the search words inputs.

For Example: from nltk.stem import PorterStemmer ps = PorterStemmer()

def SrchpattrnStmmed(self):
    KeyWords =[]
    SrchpattrnTkn = word_tokenize(self.input)
    for token in SrchpattrnTkn:
        if token not in stop_words:
            KeyWords.append(ps.stem(token))
            continue
    #print(KeyWords)
    return KeyWords

Hope this will help..

-8

Python implementations of the Porter, Porter2, Paice-Husk, and Lovins stemming algorithms for English are available in the stemming package

David Jones
  • 4,766
  • 3
  • 32
  • 45
Stephen Lin
  • 4,852
  • 1
  • 13
  • 26
  • It seems that they use different algorithms. I will try it, thanks! – PeYoTlL Jul 09 '14 at 08:05
  • 2
    Note that stemming is a pure Python implementation and will not be as quick as PyStemmer which is a wrapper around a c library and also available in PyPi. – Spaceghost Jul 09 '14 at 20:17