1

So I am reading in a csv file and the getting the words in it. I am trying to remove stop words. Here is my code.

import pandas as pd
from nltk.corpus import stopwords as sw

def loadCsv(fileName):
    df = pd.read_csv(fileName, error_bad_lines=False)
    df.dropna(inplace = True)
    return df

def getWords(dataframe):
    words = []
    for tweet in dataframe['SentimentText'].tolist():
        for word in tweet.split():
            word = word.lower()

        words.append(word)

    return set(words) #Create a set from the words list

def removeStopWords(words):
    for word in words: # iterate over word_list
        if word in sw.words('english'): 
            words.remove(word) # remove word from filtered_word_list if it is a stopword

    return set(words)

df = loadCsv("train.csv")
words = getWords(df)
words = removeStopWords(words)

On this line

if word in sw.words('english'):

I get the following error.

exception: no description

Further down the line I am going to try to remove punctuation, any pointers for that too would be great. Any help is much appreciated.

EDIT

def removeStopWords(words):
    filtered_word_list = words #make a copy of the words
    for word in words: # iterate over words
        if word in sw.words('english'): 
            filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

    return set(filtered_word_list)
Dan Murphy
  • 225
  • 1
  • 5
  • 15
  • There is a problem in the `removeStopWords` as you are modifying a list that you are iterating over. Not sure if that's what is causing your problem, but could you replace the body of that function with just: `return set([w for w in words if not w in sw.words('english')])` ? – sal Oct 23 '18 at 04:44
  • @sal I tried just that line as the method and I still get the same error. See I read in all words into a list and then want to modify this list by removing the stop words, is this possible or am I going about this wrong? – Dan Murphy Oct 23 '18 at 04:48
  • Removing items from a list that you are iterating over is a no-go (in general). The best way is to generate a new list from the old by using a list comprehension or other way. I am puzzled, because I tried with a simple sentence, and it works. – sal Oct 23 '18 at 04:53
  • @sal Well this is really strange then. I edited my method because of your suggestion it's in the post but I'm still getting the same error on the if statement. Is there something i'm missing like an import or something? – Dan Murphy Oct 23 '18 at 05:01
  • `filtered_word_list = words` doesn't make a copy: those are pointing to the same list, and hence not resolving the problem I pointed out. So you're saying that `def removeStopWords(words): return set([w for w in words if not w in sw.words('english')])` doesn't resolve the problem? (fix the indentation) – sal Oct 23 '18 at 05:03
  • @sal ah right didn't know it worked like that, And yes that gives me the same "Exception: no description" error. What does this error even mean? – Dan Murphy Oct 23 '18 at 05:10
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/182312/discussion-between-sal-and-dan-murphy). – sal Oct 23 '18 at 05:11
  • Duplicate of https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python – alvas Oct 24 '18 at 03:18

3 Answers3

0

Change removeStopWords function to the following:

def getFilteredStopWords(words):
    list_stopWords=list(set(sw.words('english')))
    filtered_words=[w for w in words if not w in list_stopWords# remove word from filtered_words if it is a stopword
    return filtered_words
myhaspldeep
  • 226
  • 2
  • 7
0

Here is a simplified version of the problem, without Panda. I believe the issue with the original code is with modifying the set words while iterating over it. By using a conditional list comprehension, we can test for each word, creating a new list, and ultimately converting it into a set, as per the original code.

from nltk.corpus import stopwords as sw

def removeStopWords(words):
    return set([w for w in words if not w in sw.words('english')])

sentence = 'this is a very common english sentence with a finite set of words from my imagination'
words = set(sentence.split())
print(removeStopWords(words))
sal
  • 3,515
  • 1
  • 10
  • 21
0
def remmove_stopwords(sentence):
    list_stop_words = set(stopwords.words('english'))
    words = sentence.split(' ')
    filtered_words = [w for w in words if w not in list_stop_words]
    sentence_list = ' '.join(w for w in filtered_words)
    return sentence_list
  • While this code snippet may be the solution, [including an explanation](//meta.stackexchange.com/questions/114762/explaining-entirely-‌​code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – Neo Anderson Sep 03 '20 at 17:55