15

I have the code beneath and I am trying to apply a stop word list to list of words. However the results still show words such as "a" and "the" which I thought would have been removed by this process. Any ideas what has gone wrong would be great .

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
filtered_words = [w for w in word_list if not w in stopwords.words('english')]
print filtered_words
saph_top
  • 677
  • 1
  • 6
  • 23
  • Possible duplicate of [Stopword removal with NLTK](http://stackoverflow.com/questions/19130512/stopword-removal-with-nltk) – Salvador Dali Apr 02 '17 at 07:02

1 Answers1

29

A few things of note.

  • If you are going to be checking membership against a list over and over, I would use a set instead of a list.

  • stopwords.words('english') returns a list of lowercase stop words. It is quite likely that your source has capital letters in it and is not matching for that reason.

  • You aren't reading the file properly, you are checking over the file object not a list of the words split by spaces.

Putting it all together:

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
stops = set(stopwords.words('english'))

for line in word_list:
    for w in line.split():
        if w.lower() not in stops:
            print w
Hooked
  • 84,485
  • 43
  • 192
  • 261
  • 2
    Note that you still aren't filtering for punctuation, you'll want to remove things like `';"{}[]/?.,!` for example. – Hooked Mar 31 '14 at 14:08
  • brilliant that worked, must have been reading over the file incorrectly, thanks. – saph_top Mar 31 '14 at 14:16