nltk stemmer: string index out of range

Question

I have a set of pickled text documents which I would like to stem using nltk's PorterStemmer. For reasons specific to my project, I would like to do the stemming inside of a django app view.

However, when stemming the documents inside the django view, I receive an IndexError: string index out of range exception from PorterStemmer().stem() for the string 'oed'. As a result, running the following:

# xkcd_project/search/views.py
from nltk.stem.porter import PorterStemmer

def get_results(request):
    s = PorterStemmer()
    s.stem('oed')
    return render(request, 'list.html')

raises the mentioned error:

Traceback (most recent call last):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/exception.py", line 39, in inner
    response = get_response(request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/Users/jkarimi91/Projects/xkcd_search/xkcd_project/search/views.py", line 15, in get_results
    s.stem('oed')
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 665, in stem
    stem = self._step1b(stem)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 376, in _step1b
    lambda stem: (self._measure(stem) == 1 and
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 258, in _apply_rule_list
    if suffix == '*d' and self._ends_double_consonant(word):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
    word[-1] == word[-2] and
IndexError: string index out of range

Now what is really odd is running the same stemmer on the same string outside django (be it a seperate python file or an interactive python console) produces no error. In other words:

# test.py
from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')

followed by:

python test.py
# successfully prints 'o'

what is causing this issue?

Are you using Python 2? It just might be a character set difference-- just guessing though. — alexis, Jan 07 '17 at 08:42
What version of NLTK are you using? You can check it with `nltk.__version__` once you have imported it. Maybe you use two different versions for django and external python. Could you also check the python version that you use in django and to run the external script? I suppose it's always `2.7`, given the `print` statement. — Kurt Bourbaki, Jan 07 '17 at 10:28
Almost unrelated to the issue, `s = PorterStemmer()` should be put somewhere in your global variables are. Putting them in the view means loading the `PorterStemmer` object for every page that loads this view function. — alvas, Jan 07 '17 at 11:27
Within `get_result`, can you do a `x = 'oed'` and then `print x` and see what you get on your console where you use `python manage.py runserver`? I suspect it's django swallowing words. — alvas, Jan 07 '17 at 11:30
Also, try in your `views.py` add this: `# coding: utf-8` in the first line and `from __future__ import unicode_literals`. The django and nltk version should also be reported in the OP as well as the github issue. — alvas, Jan 07 '17 at 11:33
Somehow this is also the case when Django gobbles up some `str` or `char` in http://stackoverflow.com/questions/41503127/python-wordnet-nltk-keyerror =( — alvas, Jan 07 '17 at 11:35
@KurtBourbaki turns out I was using two different versions of nltk. I was using version 3.2.2 in my django project's virtual environment `//anaconda/envs/xkcd/bin/` but I had been running test.py using ipython, not python as stated above. The ipython installation was defined my root environment `//anaconda/bin/ipython` which must have given it access to the nltk version specified in my root environment (version 3.2.0). I downgraded my virtual environment's nltk to version 3.2.0 and ran the code successfully on the django app. Does this mean it is an issue with nltk 3.2.2? — jkarimi, Jan 07 '17 at 18:32
@KurtBourbaki also any ideas as to why I was able to access the ipython installation specified in my root environment despite having a project environment activated which did not have ipython? — jkarimi, Jan 07 '17 at 18:33

Mark Amery · Accepted Answer · 2017-09-04T19:08:47.033

31

This is an NLTK bug specific to NLTK version 3.2.2, for which I am to blame. It was introduced by PR https://github.com/nltk/nltk/pull/1261 which rewrote the Porter stemmer.

I wrote a fix which went out in NLTK 3.2.3. If you're on version 3.2.2 and want the fix, just upgrade - e.g. by running

pip install -U nltk

edited Sep 04 '17 at 19:08

answered Jan 07 '17 at 20:45

Mark Amery

143,130
81
406
459

5

As it stands, this answer is at +20 and so I've effectively received 200 Stack Overflow rep as a reward for breaking an open source library. I feel rather guilty. – Mark Amery May 10 '17 at 11:11
1

Don't be guilty, this is one way to incentivize OSS =) – alvas May 13 '17 at 03:13

Kurt Bourbaki · Answer 2 · 2017-01-07T19:41:27.283

I debugged nltk.stem.porter module using pdb. After a few iterations, in _apply_rule_list() you get:

>>> rule
(u'at', u'ate', None)
>>> word
u'o'

At this point the _ends_double_consonant() method tries to do word[-1] == word[-2] and it fails.

If I'm not mistaken, in NLTK 3.2 the relative method was the following:

def _doublec(self, word):
    """doublec(word) is TRUE <=> word ends with a double consonant"""
    if len(word) < 2:
        return False
    if (word[-1] != word[-2]):      
        return False        
    return self._cons(word, len(word)-1)

As far as I can see, the len(word) < 2 check is missing in the new version.

Changing _ends_double_consonant() to something like this should work:

def _ends_double_consonant(self, word):
      """Implements condition *d from the paper

      Returns True if word ends with a double consonant
      """
      if len(word) < 2:
          return False
      return (
          word[-1] == word[-2] and
          self._is_consonant(word, len(word)-1)
      )

I just proposed this change in the related NLTK issue.

nltk stemmer: string index out of range

2 Answers2

Linked