Removing stopwords from pandas tokenised column before plotting word frequency

Question

I am having difficulties to remove some stopwords (default stopwords plus other words manually added) from a plot. This question is related to other two questions:

for stopwords removing, the reference is Remove stopwords from words frequency;
for plot, the reference is How to annotate a stacked bar chart with word count and column name?

Raw data:

    Date                   Sentences
0   02/06/2020   That's the word some researcher us...
1   02/06/2020   A top official with the World Wide...
2   02/06/2020   Asymptomatic spread is the trans...
3   02/07/2020   "I don't want anyone to get con...
4   02/07/2020   And, separately, how many of th...
... ... ...
65  02/09/2020  its 'very rare' comment on asymp...
66  02/09/2020  The rapid spread of the virus t...

This is an exercise on Text Mining and Analytics. What I have been trying to do is collecting words that are more frequent per each date. To do this I tokenised sentences saving in a new column called 'Clean'. I used to functions, one for removing stopwords and one for cleaning texts.

Code:

import nltk
from nltk.corpus import stopwords

def remove_stopwords(text):
    stop_words = (stopwords.words('English') + extra_stops) # extra stops are words that may not be useful for the analysis so they could be removed, e.g. spread in the example above)
    c_text = []

    for i in text.lower().split():
        if i not in stop_words:
            c_text.append(i)

    return(' '.join(c_text))

def clean_text(file):

#remove punctuation 
    punct = string.punctuation.replace("'", '') 
    punc = r'[{}]'.format(punct)

    remove_words =list(stopwords.words('english'))+list(my_stop)+list(extra_stops)


#clean text
    file.Clean = file..str.replace('\d+', '')  # remove all numbers
    file.Clean = file.Clean.str.replace(punc, ' ')
    file.Clean = file.Clean.str.strip()  
    file.Clean = file.Clean.str.lower().str.split()  

    file.dropna(inplace=True)
    file.Clean = file.Clean.apply(lambda x: list(word for word in x if word not in remove_words)) 

    return(file.Clean)

where Clean is defined by:

df4['Sentences'] = df4['Sentences'].astype(str)
df4['Clean'] = df4['Sentences']

After cleaning text, I tried to group words by Date selecting tops ones (the dataset is huge, so I only selected the top 4).

df4_ex = df4.explode('Clean')
df4_ex.dropna(inplace=True)
df4_ex = df4_ex.groupby(['Date', 'Clean']).agg({'Clean': 'count'}).groupby('Date').head(4)

Then I applied the code for plotting stacked bars reporting most frequent words as follows (I found the code in Stackoverflow; since it was not build from scratch by me, it is possible that I missed some parts before plotting):

# create list of words of appropriate length; all words repeat for each date
cols = [x[1] for x in df_gb.columns for _ in range(len(df_gb))]

# plot df_gb
ax = df_gb.plot.bar(stacked=True)

# annotate the bars
for i, rect in enumerate(ax.patches):
    # Find where everything is located
    height = rect.get_height()
    width = rect.get_width()
    x = rect.get_x()
    y = rect.get_y()

    # The height of the bar is the count value and can used as the label
    label_text = f'{height:.0f}: {cols[i]}'

    label_x = x + width / 2
    label_y = y + height / 2

    # don't include label if it's equivalently 0
    if height > 0.001:
        ax.text(label_x, label_y, label_text, ha='center', va='center', fontsize=8)

# rename xtick labels; remove time
labels = [label.get_text()[:10] for label in labels]
plt.xticks(ticks=ticks, labels=labels)

ax.get_legend().remove()
plt.show()

However, even after adding some new words to exclude from the results, I still get the same variable on the plot, and this means that it was not correctly removed.

Since I am not understanding and figure out where the error is, I hope you can help me. Thank you in advance for all the help and time that you will spend helping me.

Can you post the specific error? That will go long way to being able to help. Rather than, trying to understand all of the code, someone can just spot the error make an adjustment and go from there. — David Erickson, Jun 13 '20 at 01:34
I do not get any error. The plot shows stopwords and words that I wanted to exclude (for example with `extra_stops='spread'`) — still_learning, Jun 13 '20 at 01:35
aaah I see. I cannot produce if there is code missing. I got this eror when trying to run your code, so if possible try to include the full code or at least an intermediate dataframe with the necessary input to test your last block of code.I'm receiving this error: `KeyError: 'Clean'` — David Erickson, Jun 13 '20 at 01:48
My fault. It is `df4['Sentences'] = df4['Sentences'].astype(str) df4['Clean'] = df4['Sentences']` . Now I think it should work — still_learning, Jun 13 '20 at 01:51
Remove unnecessary code please. The problem is with stopword removal, so plot code is irrelevant. — taha, Jun 13 '20 at 02:17

Sy Ker · Accepted Answer · 2020-06-13T02:44:14.567

This might help;

import pandas, string, collections
from nltk.corpus import stopwords

extra = ['der', 'die', 'das']
STOPWORDS = {token.lower() for token in stopwords.words('english') + extra}
PUNCTUATION = string.punctuation

df = pandas.DataFrame({
    'Date': ['02/06/2020', '02/06/2020', '03/06/2020', '03/06/2020'],
    'Sentences': ["That's the word some tor researcher", 'A top official with the World Wide', 'The rapid spread of the virus', 'Asymptomatic spread is the transmition']
})

#### ----------- Preprocessing --------------
def remove_punctuation(input_string):
    for char in PUNCTUATION:
        input_string = input_string.replace(char, ' ')
    return input_string

def remove_stopwords(input_string):
    return ' '.join([word for word in input_string.lower().split() if word not in STOPWORDS])

def preprocess(input_string):
    no_punctuation = remove_punctuation(input_string)
    no_stopwords = remove_stopwords(no_punctuation)

    return no_stopwords

df['clean'] = df['Sentences'].apply(preprocess)

### ------------- Token Count -----------------
group_counters = dict()
for date, group in df.groupby('Date'):
    group_counters[date] = group['clean'].apply(lambda x: pandas.value_counts(x.split())).sum(axis = 0)

counter_df = pandas.concat(group_counters)

Output;

02/06/2020  researcher      1.0
            word            1.0
            tor             1.0
            world           1.0
            wide            1.0
            official        1.0
            top             1.0
03/06/2020  spread          2.0
            rapid           1.0
            virus           1.0
            transmition     1.0
            asymptomatic    1.0
dtype: float64

Removing stopwords from pandas tokenised column before plotting word frequency

1 Answers1