Stop Word Removal with NLTK

Question

I've been working with NLTK and Database Classification. I'm having a problem with stop word removal. When I print the list of stop words all of the words are listed with "u'" before them. For example: [u'all', u'just', u'being', u'over', u'both', u'through'] I'm not sure if this is normal or part of the issue.

When I print (1_feats) I get a list of words, with some of them being the stopwords listed in the corpus.

import os
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import stopwords

stopset = list(set(stopwords.words('english')))
morewords = 'delivery', 'shipment', 'only', 'copy', 'attach', 'material'
stopset.append(morewords)

def word_feats(words):
    return dict([(word, True) for word in words.split() if word not in stopset])

ids_1 = {}
ids_2 = {}
ids_3 = {}
ids_4 = {}
ids_5 = {}
ids_6 = {}
ids_7 = {}
ids_8 = {}
ids_9 = {}

path1 = "/Users/myname/Documents/Data Classifier Files/1/"
for name in os.listdir(path1):
    if name[-4:] == '.txt':
        f = open(path1 + "/" + name, "r")
        ids_1[name] = f.read()
        f.close()    

path2 = "/Users/myname/Documents/Data Classifier Files/2/"
for name in os.listdir(path2):
    if name[-4:] == '.txt':
        f = open(path2 + "/" + name, "r")
        ids_2[name] = f.read()
        f.close()    

path3 = "/Users/myname/Documents/Data Classifier Files/3/"
for name in os.listdir(path3):
    if name[-4:] == '.txt':
        f = open(path3 + "/" + name, "r")
        ids_3[name] = f.read()
        f.close()    

path4 = "/Users/myname/Documents/Data Classifier Files/4/"
for name in os.listdir(path4):
    if name[-4:] == '.txt':
        f = open(path4 + "/" + name, "r")
        ids_4[name] = f.read()
        f.close()   

path5 = "/Users/myname/Documents/Data Classifier Files/5/"
for name in os.listdir(path5):
    if name[-4:] == '.txt':
        f = open(path5 + "/" + name, "r")
        ids_5[name] = f.read()
        f.close()     

path6 = "/Users/myname/Documents/Data Classifier Files/6/"
for name in os.listdir(path6):
    if name[-4:] == '.txt':
        f = open(path6 + "/" + name, "r")
        ids_6[name] = f.read()
        f.close()    

path7 = "/Users/myname/Documents/Data Classifier Files/7/"
for name in os.listdir(path7):
    if name[-4:] == '.txt':
        f = open(path7 + "/" + name, "r")
        ids_7[name] = f.read()
        f.close()    

path8 = "/Users/myname/Documents/Data Classifier Files/8/"
for name in os.listdir(path8):
    if name[-4:] == '.txt':
        f = open(path8 + "/" + name, "r")
        ids_8[name] = f.read()
        f.close()   

path9 = "/Users/myname/Documents/Data Classifier Files/9/"
for name in os.listdir(path9):
    if name[-4:] == '.txt':
        f = open(path9 + "/" + name, "r")
        ids_9[name] = f.read()
        f.close()         

feats_1 = [(word_feats(ids_1[f]), '1') for f in ids_1 ]
feats_2 = [(word_feats(ids_2[f]), "2") for f in ids_2 ]
feats_3 = [(word_feats(ids_3[f]), '3') for f in ids_3 ]
feats_4 = [(word_feats(ids_4[f]), '4') for f in ids_4 ]
feats_5 = [(word_feats(ids_5[f]), '5') for f in ids_5 ]
feats_6 = [(word_feats(ids_6[f]), '6') for f in ids_6 ]
feats_7 = [(word_feats(ids_7[f]), '7') for f in ids_7 ]
feats_8 = [(word_feats(ids_8[f]), '8') for f in ids_8 ]
feats_9 = [(word_feats(ids_9[f]), '9') for f in ids_9 ]



trainfeats = feats_1 + feats_2 + feats_3 + feats_4 + feats_5 + feats_6 + feats_7 + feats_8 + feats_9
classifier = NaiveBayesClassifier.train(trainfeats)

The `u'word'` just indicates it is using Unicode encoding for the string (which is normal). — user812786, Sep 13 '16 at 15:31
I'm not sure how you're running that code, since all the variables starting with a number aren't legal in Python. — L3viathan, Sep 13 '16 at 15:32
`list(set(stopwords.words('english')))` why the additional `list`, which makes the later lookup `O(n)` instead of `O(1)`? — dhke, Sep 13 '16 at 15:33
I changed the variable names just to hide the data that I'm working on. The actual names do not start with numbers. And I added the additional list to try to add more words to the stopset. When I remove the list the append feature doesn't work anymore — A Gross, Sep 13 '16 at 15:51
@AGross: ok but now your code is not executable, hence a total pain to reproduce. You could change `#_feats` -> `feats_#` and `#ids# -> `ids_#`. Better is to make `path[]`, `feats[]`, `ids[]` each arrays of length 10, i.e. vectorize the code. Your boilerplate file-reading code can also be vectorized. (also makes it way shorter) — smci, Sep 13 '16 at 20:57

score 3 · Accepted Answer · answered Sep 13 '16 at 20:08

After executing these three lines,

stopset = list(set(stopwords.words('english')))
morewords = 'delivery', 'shipment', 'only', 'copy', 'attach', 'material'
stopset.append(morewords)

have a look at stopset (output shortened):

>>> stopset
[u'all',
 u'just',
 u'being',
 ...
 u'having',
 u'once',
 ('delivery', 'shipment', 'only', 'copy', 'attach', 'material')]

The additional entries from morewords aren't on the same level as the previous words: instead, the whole tuple of words is seen as a single stop word, which makes no sense.

The reason for that is simple: list.append() adds one element, list.extend() adds many.

So, change stopset.append(morewords) to stopset.extend(morewords).
Or even better, keep the stop words as a set, for faster lookup. The right method to add multiple elements is set.update():

stopset = set(stopwords.words('english'))
morewords = ['delivery', 'shipment', 'only', 'copy', 'attach', 'material']
stopset.update(morewords)

Btw., you should definitely use a tokeniser. Don't do `words.split()` with natural language text. — lenz, Sep 13 '16 at 20:12
That works much better, thank you! Is tokenizing as simple as importing word_tokenize and changing words.split() to word_tokenize(words)? — A Gross, Sep 13 '16 at 20:34
If you're happy with the default tokeniser (and if your texts are written in English), then it is indeed as simple. — lenz, Sep 13 '16 at 21:36

Stop Word Removal with NLTK

1 Answers1