I am having difficulties to remove some stopwords (default stopwords plus other words manually added) from a plot. This question is related to other two questions:
- for stopwords removing, the reference is Remove stopwords from words frequency;
- for plot, the reference is How to annotate a stacked bar chart with word count and column name?
Raw data:
Date Sentences
0 02/06/2020 That's the word some researcher us...
1 02/06/2020 A top official with the World Wide...
2 02/06/2020 Asymptomatic spread is the trans...
3 02/07/2020 "I don't want anyone to get con...
4 02/07/2020 And, separately, how many of th...
... ... ...
65 02/09/2020 its 'very rare' comment on asymp...
66 02/09/2020 The rapid spread of the virus t...
This is an exercise on Text Mining and Analytics. What I have been trying to do is collecting words that are more frequent per each date. To do this I tokenised sentences saving in a new column called 'Clean'. I used to functions, one for removing stopwords and one for cleaning texts.
Code:
import nltk
from nltk.corpus import stopwords
def remove_stopwords(text):
stop_words = (stopwords.words('English') + extra_stops) # extra stops are words that may not be useful for the analysis so they could be removed, e.g. spread in the example above)
c_text = []
for i in text.lower().split():
if i not in stop_words:
c_text.append(i)
return(' '.join(c_text))
def clean_text(file):
#remove punctuation
punct = string.punctuation.replace("'", '')
punc = r'[{}]'.format(punct)
remove_words =list(stopwords.words('english'))+list(my_stop)+list(extra_stops)
#clean text
file.Clean = file..str.replace('\d+', '') # remove all numbers
file.Clean = file.Clean.str.replace(punc, ' ')
file.Clean = file.Clean.str.strip()
file.Clean = file.Clean.str.lower().str.split()
file.dropna(inplace=True)
file.Clean = file.Clean.apply(lambda x: list(word for word in x if word not in remove_words))
return(file.Clean)
where Clean is defined by:
df4['Sentences'] = df4['Sentences'].astype(str)
df4['Clean'] = df4['Sentences']
After cleaning text, I tried to group words by Date selecting tops ones (the dataset is huge, so I only selected the top 4).
df4_ex = df4.explode('Clean')
df4_ex.dropna(inplace=True)
df4_ex = df4_ex.groupby(['Date', 'Clean']).agg({'Clean': 'count'}).groupby('Date').head(4)
Then I applied the code for plotting stacked bars reporting most frequent words as follows (I found the code in Stackoverflow; since it was not build from scratch by me, it is possible that I missed some parts before plotting):
# create list of words of appropriate length; all words repeat for each date
cols = [x[1] for x in df_gb.columns for _ in range(len(df_gb))]
# plot df_gb
ax = df_gb.plot.bar(stacked=True)
# annotate the bars
for i, rect in enumerate(ax.patches):
# Find where everything is located
height = rect.get_height()
width = rect.get_width()
x = rect.get_x()
y = rect.get_y()
# The height of the bar is the count value and can used as the label
label_text = f'{height:.0f}: {cols[i]}'
label_x = x + width / 2
label_y = y + height / 2
# don't include label if it's equivalently 0
if height > 0.001:
ax.text(label_x, label_y, label_text, ha='center', va='center', fontsize=8)
# rename xtick labels; remove time
labels = [label.get_text()[:10] for label in labels]
plt.xticks(ticks=ticks, labels=labels)
ax.get_legend().remove()
plt.show()
However, even after adding some new words to exclude from the results, I still get the same variable on the plot, and this means that it was not correctly removed.
Since I am not understanding and figure out where the error is, I hope you can help me. Thank you in advance for all the help and time that you will spend helping me.