0

So im trying to remove all the stop-words from a text file. The problem is, it is removing the stopwords each in every word.

def remove_stopwords(input):
    stop_words  = set(stopwords.words('english'))
    filtered_words = [word for word in input if not word in stop_words]
    return filtered_words

Sample Input: Damage from Typhoon Lando soars to P6B
Output: Dge fr Tphn Ln r  P6B
Jack-Jack
  • 119
  • 9
  • 1
    If `input` is a string... you need to break it up into words... eg: `[word for word in input.split() if word not in stop_words]`, then do what you want with the resultant list... otherwise you're iterating over each character and removing that where that character exists in the stop words. – Jon Clements Sep 30 '17 at 09:36
  • @JonClements thank you sir! – Jack-Jack Sep 30 '17 at 09:46

1 Answers1

2

Tokenize your str input before removing stop words.

from nltk.corpus import stopwords
from nltk import word_tokenize

stoplist  = set(stopwords.words('english'))

def remove_stopwords(text):
    return [word for word in word_tokenize(text) if not word in stoplist]
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    Just to note `not word in` vs `word not in`... https://stackoverflow.com/questions/8738388/order-of-syntax-for-using-not-and-in-keywords (`not in` is generally considered to be clearer as it's closer to English - although there's no real technical difference) – Jon Clements Sep 30 '17 at 11:51
  • I've changed the `not word in` -> `word not in`. Then again, I thought maybe this would be great to point people to https://stackoverflow.com/questions/8738388/order-of-syntax-for-using-not-and-in-keywords and so I changed it back to `not word in`. Thanks @JonClements for pointing this out ;P – alvas Oct 02 '17 at 02:24