NLTK Stopword List

Question

I have the code beneath and I am trying to apply a stop word list to list of words. However the results still show words such as "a" and "the" which I thought would have been removed by this process. Any ideas what has gone wrong would be great .

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
filtered_words = [w for w in word_list if not w in stopwords.words('english')]
print filtered_words

Possible duplicate of [Stopword removal with NLTK](http://stackoverflow.com/questions/19130512/stopword-removal-with-nltk) — Salvador Dali, Apr 02 '17 at 07:02

score 29 · Accepted Answer · answered Mar 31 '14 at 14:02

A few things of note.

If you are going to be checking membership against a list over and over, I would use a set instead of a list.
stopwords.words('english') returns a list of lowercase stop words. It is quite likely that your source has capital letters in it and is not matching for that reason.
You aren't reading the file properly, you are checking over the file object not a list of the words split by spaces.

Putting it all together:

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
stops = set(stopwords.words('english'))

for line in word_list:
    for w in line.split():
        if w.lower() not in stops:
            print w

Note that you still aren't filtering for punctuation, you'll want to remove things like `';"{}[]/?.,!` for example. — Hooked, Mar 31 '14 at 14:08
brilliant that worked, must have been reading over the file incorrectly, thanks. — saph_top, Mar 31 '14 at 14:16

NLTK Stopword List

1 Answers1