2

So I've been stuck on this problem for daysss and I would appreciate it if someone helped me. I have a dataframe, and the columns are:

 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----   
0   PhraseId    93636 non-null  int64   
1   SentenceId  93636 non-null  int64   
2   Phrase      93636 non-null  object  
3   Sentiment   93636 non-null  int64 

The sentiment is from 0 to 4, which basically rated the Phrase from good to bad. I added two columns which might be of help: Number of words for each phrase, and split each phrase into a list, the list containing the words inside the phrase.

What I want to do is create 4 bar graphs (a bar graph for each sentiment) showing the top 15 most repeated words for that sentiment. The x axist would be the top 15 words repeated in that sentiment.

Below, I pasted a code that I wrote which counts how many times a word is repeated for each sentiment. That would probably be needed for the bar graph.

Sample data:

       PhraseId SentenceId  Phrase                Sentiment SplitPhrase  NumOfWords
44723   75358   3866        Build some robots...    0   [Build, some, robots...] 52

To count how many times a word is repeated for each sentiment:

counters = {}
for Sentiment in train_data['Sentiment'].unique():
    counters[Sentiment] = Counter()
    indices = (train_data['Sentiment'] == Sentiment)
    for Phrase in train_data['SplitPhrase'][indices]:
        counters[Sentiment].update(Phrase)
        
print(counters) 

Sample output:

{2: Counter({'the': 28041, ',': 25046, 'a': 19962, 'of': 19376, 'and': 19052, 'to': 13470, '.': 10505, "'s": 10290, 'in': 8108, 'is': 8012, 'that': 7276, 'it': 6176, 'as': 5027, 'with': 4474, 'for': 4362, 'its': 4159, 'film': 3933......}),
 3: Counter({'the': 28041, ',': 25046, 'a': 19962,.....
soup
  • 79
  • 5
  • 4
    Your explanation makes sense; however, please include sample data, not just the output of `df.info()`. Please see this link on how to ask a good `pandas` question: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – David Erickson May 12 '21 at 21:11
  • Okay, thank you, I attached an image of the sample data – soup May 12 '21 at 21:23
  • no images! Please `read` the link I shared :) – David Erickson May 12 '21 at 21:24
  • I edited again, hopefully this is better. I also altered my question a little because I found a way to count how many times a word is repeated for each sentiment. I now need to create a bar graph based on that. – soup May 12 '21 at 21:45

1 Answers1

1

You could use Pandas groupby to arrange each sentiment in a unique dataframe. Then, you can apply Numpy unique and count over the Phrase column joined text to count the occurrence of each word (for that specific sentiment group). Sort the resulting list by frequency count (lambda i: i[1]) and slice to get the top 15 words. For the bar graph you can use Matplotlib plt.bar passing the list of words and frequencies.

Sample from dataframe.csv

                                               Phrase  PhraseId  SentenceId  Sentiment
0   Live as if you were to die tomorrow. Learn as ...     15795        2568          3
1       That which does not kill us makes us stronger       860       62592          3
2   Be who you are and say what you feel, because ...     76820       67563          0
..                                                ...       ...         ...        ...
97  Others can stop you temporarily – you are the ...     61228       73530          2
98  Life has no limitations, except the ones you make     48984       93557          3
99    Peace comes from within. Do not seek it without     40774       61087          3
[100 rows x 4 columns]
import pandas as pd
import re
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('dataframe.csv')
print(df)

def remove_stop_words(txt):
    txt = re.sub(r'[,.;!?-]', '', txt.lower())
    stop_words = ['the', 'at', 'in', 'of', 'a', 'is', 'to', 'by']
    stop_boundary = r'\b'+r'\b|\b'.join(stop_words)+r'\b'
    return re.sub(stop_boundary, '', txt)

MAX_WORDS = 15
SENTIMENT = ['Bad', 'Poor', 'Good', 'Excellent']

for n, g in df.groupby('Sentiment'):
    all_text = ' '.join(g['Phrase'].values)

    # optionally, clean txt and remove stop words
    clean_text = remove_stop_words(all_text)

    # find most frequent words
    split_txt = clean_text.split()
    word_count = [(word, split_txt.count(word)) for word in np.unique(split_txt)]
    word_count = sorted(word_count, key=lambda i: i[1], reverse=True)[:MAX_WORDS]
    x, y = zip(*word_count)

    # plot graph
    plt.subplot(2,2,n+1)
    plt.bar(x,y)
    plt.title(SENTIMENT[n])
    plt.ylabel('Count')
    plt.xticks(rotation=45)

plt.show()

bar_graph_words

n1colas.m
  • 3,863
  • 4
  • 15
  • 28