My function remove_stopwords. Removes every stopwords in a word

Question

So im trying to remove all the stop-words from a text file. The problem is, it is removing the stopwords each in every word.

def remove_stopwords(input):
    stop_words  = set(stopwords.words('english'))
    filtered_words = [word for word in input if not word in stop_words]
    return filtered_words

Sample Input: Damage from Typhoon Lando soars to P6B
Output: Dge fr Tphn Ln r  P6B

If `input` is a string... you need to break it up into words... eg: `[word for word in input.split() if word not in stop_words]`, then do what you want with the resultant list... otherwise you're iterating over each character and removing that where that character exists in the stop words. — Jon Clements, Sep 30 '17 at 09:36

alvas · Accepted Answer · 2017-10-02T02:21:05.863

2

Tokenize your str input before removing stop words.

from nltk.corpus import stopwords
from nltk import word_tokenize

stoplist  = set(stopwords.words('english'))

def remove_stopwords(text):
    return [word for word in word_tokenize(text) if not word in stoplist]

edited Oct 02 '17 at 02:21

answered Sep 30 '17 at 11:23

alvas

115,346
109
446
738

1

Just to note `not word in` vs `word not in`... https://stackoverflow.com/questions/8738388/order-of-syntax-for-using-not-and-in-keywords (`not in` is generally considered to be clearer as it's closer to English - although there's no real technical difference) – Jon Clements Sep 30 '17 at 11:51
I've changed the `not word in` -> `word not in`. Then again, I thought maybe this would be great to point people to https://stackoverflow.com/questions/8738388/order-of-syntax-for-using-not-and-in-keywords and so I changed it back to `not word in`. Thanks @JonClements for pointing this out ;P – alvas Oct 02 '17 at 02:24

My function remove_stopwords. Removes every stopwords in a word

1 Answers1