Unable to remove english stopwords from a dataframe

Question

I have been trying to perform sentiment analysis over a movie reviews dataset and I am stuck at a point where I am unable to remove english stopwords from the data. What am I doing wrong?

from nltk.corpus import stopwords
stop = stopwords.words("English")
list_ = []
for file_ in dataset:
    dataset['Content'] = dataset['Content'].apply(lambda x: [item for item in x.split(',') if item not in stop])
    list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)

@open-source There is no error - nothing happens when I execute this code. — ykombinator, Jun 26 '17 at 02:56
Is your content something in format `'i, am, the, computer, machine.'`? Can you post one line that you expect to be stop words removed from? — niraj, Jun 26 '17 at 03:56
Possibly this is what you need https://github.com/alvations/earthy/blob/master/FAQ.md#what-else-can-earthy-do =) — alvas, Jun 26 '17 at 03:58

score 1 · Accepted Answer · answered Jun 26 '17 at 04:53

I think the code should work with information so far. The assumption I am making is with data has extra space while separated with comma. Below is the test ran: (hope it helps!)

import pandas as pd
from nltk.corpus import stopwords
import nltk

stop = nltk.corpus.stopwords.words('english')

dataset = pd.DataFrame([{'Content':'i, am, the, computer, machine'}])
dataset = dataset.append({'Content':'i, play, game'}, ignore_index=True)
print(dataset)
list_ = []
for file_ in dataset:
    dataset['Content'] = dataset['Content'].apply(lambda x: [item.strip() for item in x.split(',') if item.strip() not in stop])
    list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)

print(dataset)

Input with stopwords:

                          Content
0   i, am, the, computer, machine
1                   i, play, game

Output:

                Content
 0  [computer, machine]
 1         [play, game]

score 0 · Answer 2 · 2017-06-26T03:26:08.800

0

Well through your comment I think that you don't need to loop over dataset. (Maybe dataset contains only the single column named Content)

You can simply do:

 dataset["Content"] = dataset["Content"].str.split(",").apply(lambda x: [item for item in x if item not in stop])

edited Jun 26 '17 at 03:26

answered Jun 26 '17 at 01:31

I get a `TypeError: string indices must be integers` – ykombinator Jun 26 '17 at 02:58

tvashtar · Answer 3 · 2017-06-26T11:22:27.063

0

You are looping over dataset, but appending the whole frame each time and not using file_ Try:

from nltk.corpus import stopwords
stop = stopwords.words("English")
dataset['Cleaned'] = dataset['Content'].apply(lambda x: ','.join([item for item in x.split(',') if item not in stop]))

That returns a Series containing lists of words, if you want to flatten that to a single list:

flat_list = [item for sublist in list(dataset['Cleaned'].values) for item in sublist]

With a hat tip to Making a flat list out of list of lists in Python

edited Jun 26 '17 at 11:22

answered Jun 26 '17 at 01:34

tvashtar

4,055
1
14
12

I get a `TypeError: string indices must be integers` for this code as well. `dataset` is type `DataFrame` btw. – ykombinator Jun 26 '17 at 03:00
Ah ok, that wasn't clear, and what was the form of result you wanted? A single list of words, or a list per row? – tvashtar Jun 26 '17 at 03:15
I updated my answer to give you both options. I'm assuming dataset['Content'] elements contain a comma seperated list of words, if not please give an example dataset – tvashtar Jun 26 '17 at 03:21
And to clarify you were getting those errors in both examples because iterating over a dataframe, actually iterates over the columns not the rows. For that you can use iterrows, but in this case you can just use apply as shown, since iterrows returns tuples. You could also iterate over the index of dataset if you really wanted to do something like your code. – tvashtar Jun 26 '17 at 03:25
Yes the dataset is a comma separated dataframe consisting of movie reviews. Punctuations have been removed from each row. Expected output : 3 rows have say ~50 words, there are 2, 5, 7 stopwords in those rows. Output should be comma separated dataframe of 48, 45 and 43 words. – ykombinator Jun 26 '17 at 03:35
Got it, ok modified the answer above to recombine the list of words to a comma separated list in each row – tvashtar Jun 26 '17 at 11:23

score 0 · Answer 4 · answered Jun 26 '17 at 04:00

0

Try earthy:

>>> from earthy.wordlist import punctuations, stopwords
>>> from earthy.preprocessing import remove_stopwords
>>> result = dataset['Content'].apply(remove_stopwords)

See https://github.com/alvations/earthy/blob/master/FAQ.md#what-else-can-earthy-do

answered Jun 26 '17 at 04:00

alvas

115,346
109
446
738

I guess I've to add "shameless plug" ;P – alvas Jun 26 '17 at 04:00

Unable to remove english stopwords from a dataframe

4 Answers4