Need a python module for stemming of text documents

Question

I need a good python module for stemming text documents in the pre-processing stage.

I found this one

http://pypi.python.org/pypi/PyStemmer/1.0.1

but i cannot find the documentation int the link provided.

I anyone knows where to find the documentation or any other good stemming algorithm please help.

score 33 · Accepted Answer · edited Aug 21 '17 at 10:57

33

You may want to try NLTK

>>> from nltk import PorterStemmer
>>> PorterStemmer().stem('complications')

edited Aug 21 '17 at 10:57

umeshksingla

17
3

answered Apr 29 '12 at 03:15

ditkin

6,774
1
35
37

Wasn't the PorterStemmer developed in the 1980s? Surely there is a more advanced option? – kalu Feb 15 '14 at 21:19
2

You are correct that there are other stemmers. From the preview of the [Natural Language Processing with Python section on stemmers](http://www.nltk.org/book3/ch03.html#stemmers) they do a simple comparison of Lancaster to Porter and then state "Stemming is not a well-defined process, and we typically pick the stemmer that best suits the application we have in mind. The Porter Stemmer is a good choice if you are indexing some texts and want to support search using alternative forms of words." – ditkin Feb 15 '14 at 22:23

score 8 · Answer 2 · answered Sep 02 '15 at 16:58

All these stemmers that have been discussed here are algorithmic stemmer,hence they can always produce unexpected results such as

In [3]: from nltk.stem.porter import *

In [4]: stemmer = PorterStemmer()

In [5]: stemmer.stem('identified')
Out[5]: u'identifi'

In [6]: stemmer.stem('nonsensical')
Out[6]: u'nonsens'

To correctly get the root words one need a dictionary based stemmer such as Hunspell Stemmer.Here is a python implementation of it in the following link. Example code is here

>>> import hunspell
>>> hobj = hunspell.HunSpell('/usr/share/myspell/en_US.dic', '/usr/share/myspell/en_US.aff')
>>> hobj.spell('spookie')
False
>>> hobj.suggest('spookie')
['spookier', 'spookiness', 'spooky', 'spook', 'spoonbill']
>>> hobj.spell('spooky')
True
>>> hobj.analyze('linked')
[' st:link fl:D']
>>> hobj.stem('linked')
['link']

-1: The object of stemmers is not to find the root word (or lemmatization, which nltk also has a module for), but rather find a shortened version of the word that other inflections would also shorten to. It doesn't matter if the stemmer does not find the root word; as long as `stem('nonsense') == stem('nonsensical') != stem('bananas')`, it's fine. — umop aplsdn, Jul 07 '16 at 00:18

score 7 · Answer 3 · answered Apr 29 '12 at 06:50

7

Python stemming module has implementations of various stemming algorithms like Porter, Porter2, Paice-Husk, and Lovins. http://pypi.python.org/pypi/stemming/1.0

    >> from stemming.porter2 import stem
    >> stem("factionally")
    faction

answered Apr 29 '12 at 06:50

shiva

2,674
4
23
37

Be aware that this is a pure python implementation and performs slower at scale than things like PyStemmer which are wrappers for a fast C implementation – Varun Balupuri Jan 10 '18 at 17:31

score 3 · Answer 4 · answered Aug 18 '17 at 09:34

The gensim package for topic modelling comes with a Porter Stemmer algorithm:

>>> from gensim import parsing
>>> gensim.parsing.stem_text("trying writing nonsense")
'try write nonsens'

The PorterStemmer is the only stemming option implemented in gensim.

An a side note: I can imagine (without further references) that most text-mining-related modules have their own implementations for simple pre-processing procedures like Porter's stemming, white-space removal and stop-word removal.

score 1 · Answer 5 · answered Feb 11 '17 at 18:42

1

PyStemmer is a Python interface to the Snowball stemming library.

Documentation can be found here: https://github.com/snowballstem/pystemmer/blob/master/docs/quickstart.txt https://github.com/snowballstem/pystemmer/blob/master/docs/quickstart_python3.txt

answered Feb 11 '17 at 18:42

Brice M. Dempsey

1,985
20
16

Need a python module for stemming of text documents

5 Answers5

Linked