Stemming unstructured text in NLTK

Question

I tried the regex stemmer, but I get hundreds of unrelated tokens. I'm just interested in the "play" stem. Here is the code I'm working with:

import nltk
from nltk.book import *
f = open('tupac_original.txt', 'rU')
text = f.read()
text1 = text.split()
tup = nltk.Text(text1)
lowtup = [w.lower() for w in tup if w.isalpha()]
import sys, re
tupclean = [w for w in lowtup if not w in nltk.corpus.stopwords.words('english')]
from nltk import stem
tupstem = stem.RegexpStemmer('az$|as$|a$')
[tupstem.stem(i) for i in tupclean]

The result of the above is;

['like', 'ed', 'young', 'black', 'like'...]

I'm trying to clean up .txt files (all lowercase, remove stopwords, etc), normalize multiple spellings of a word into one and do a frequency dist/count. I know how to do FreqDist, but any suggestions as to where I'm going wrong with the stemming?

Isn't stemming the normalization you are looking for? You say you are having trouble.. what have you tried? — Spaceghost, Sep 26 '13 at 20:22
What is your expected output? depending on what's your task, you might need a lemmatizer instead of a stemmer, see http://stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers — alvas, Sep 27 '13 at 07:24

score 12 · Answer 1 · edited Jun 07 '15 at 20:33

12

There are several pre-coded well-known stemmers in NLTK, see http://nltk.org/api/nltk.stem.html and below shows an example.

>>> from nltk import stem
>>> porter = stem.porter.PorterStemmer()
>>> lancaster = stem.lancaster.LancasterStemmer()
>>> snowball = stem.snowball.EnglishStemmer()
>>> tokens =  ['player', 'playa', 'playas', 'pleyaz'] 
>>> [porter(i) for i in tokens]
>>> [porter.stem(i) for i in tokens]
['player', 'playa', 'playa', 'pleyaz']
>>> [lancaster.stem(i) for i in tokens]
['play', 'play', 'playa', 'pleyaz']
>>> [snowball.stem(i) for i in tokens]
[u'player', u'playa', u'playa', u'pleyaz']

But what you probably need is some sort of a regex stemmer,

>>> from nltk import stem
>>> rxstem = stem.RegexpStemmer('er$|a$|as$|az$')
>>> [rxstem.stem(i) for i in tokens]
['play', 'play', 'play', 'pley']

edited Jun 07 '15 at 20:33

Alexey Grigorev

2,415
28
47

answered Sep 27 '13 at 07:23

alvas

115,346
109
446
738

I edited my question. Y=I tried your regexStem and got multiple tokens back. Not sure where I'm going wrong. – user2221429 Sep 27 '13 at 19:46
change your last line to `[tupstem.stem(i) for i in tupclean if "pl" in tupclean and "y" in tupstem.stem(i)]`. In linguistics, vowel shift occurs and assuming that the diphthongs remains and as well as the onset, then the consonant cluster "pl" will also be present in orthography. – alvas Sep 28 '13 at 04:08
tried this but it didn't really do what i was hoping it would do. thanks anyway! – user2221429 Sep 30 '13 at 16:43
I have nltk installed and can use it in other cases, but I'm getting module import errors on all the above---`>>> from nltk import stem >>> snowball = stem.snowball.EnglishStemmer() >>> [snowball(i) for i in ['Playing', "swimming", "dancing"]] Traceback (most recent call last): File "", line 1, in TypeError: 'EnglishStemmer' object is not callable ``` – Mittenchops Nov 25 '13 at 16:28
have you downloaded all the packages when you do `>>> import nltk` and then `>>> nltk.download()`? – alvas Nov 25 '13 at 16:46
1

nice choice of examples that show interesting corner cases for the nltk stemmers – hobs Feb 08 '14 at 01:58

Stemming unstructured text in NLTK

1 Answers1