0

this is the sample of my Pandas data frame, containing 30,000 rows [excluding column headers]. The expression comes with two classes, mainly Sad and Happy.

Expression              Description
Sad                     "people are sad because they got no money."
Happy                   "people are happy because ..."
Sad                     "people are miserable because they broke up"
Happy                   "They got good money"

Based on the example above, I would like to count the number of frequencies, which allows me to the number of word occurrences of "Sad" and "Happy" Expression's description in a dictionary. e.g. {sad:{people:2}, happy:{happy:1}}

This is my code:

 def calculate_word_frequency(lst, classes):
        #variable
        wordlist = []
        dict_output = {}
        count = 0
        term = ""

data = [lst.columns.values.tolist()] + lst.values.tolist() #to convert into a list


for i in range(1,len(data)):
    if data[i][0] == classes[0]:
        wordlist = data[i][1].lower().split(" ")

        for words in wordlist:
            wordlist.append(words)

            for word in wordlist:
              if word in dict_output:
                dict_output[wordlist] += 1
              else: 
                dict_output[wordlist] == 1
                print(dict_output)

Expected output would be based on the number of words appearing in each Expression respectively.

#Test case:
  words, freqs_per_expression = calculate_word_frequency(social_df, ["Sad", "Happy"])
  #output: 538212

print(freqs_per_class["sad"]["people"]) #output: 203

Because of the dataset, I often face frequent hangs and lags on my VS. Hence, I am unable to retrieve any results. I wondered if there are any better techniques that I can utilise so that I can achieve my desired data of {word:count}.

Thank you!

yan
  • 15
  • 5

2 Answers2

0

We can achieve the desired results in a few steps. If you are using pandas >= 0.25 you can use the new explode function otherwise this solution will achieve what you want.

from collections import defaultdict

exploded = df.set_index('Expression') \
             .stack() \
             .str.split(' ', expand=True) \
             .stack() \
             .reset_index() \
             .drop(['level_1', 'level_2'], axis=1) \
             .rename(columns={0: 'Word'})

print(exploded)

   Expression       Word
0         Sad     people
1         Sad        are
2         Sad        sad
3         Sad    because
4         Sad       they
...

counts = pd.DataFrame(exploded.groupby('Expression')['Word'].value_counts()) \
                              .rename(columns={'Word': 'Count'}).reset_index().to_dict('records')

d = defaultdict(dict)

for rec in counts:
    key = rec.get('Expression')
    word = rec.get('Word')
    count = rec.get('Count')
    d[key].update({word: count})

print(d)

defaultdict(dict,
            {'Happy': {'...': 1,
              'They': 1,
              'are': 1,
              'because': 1,
              'good': 1,
              'got': 1,
              'happy': 1,
              'money': 1,
              'people': 1},
             'Sad': {'are': 2,
              'because': 2,
              'broke': 1,
              'got': 1,
              'miserable': 1,
              'money.': 1,
              'no': 1,
              'people': 2,
              'sad': 1,
              'they': 2,
              'up': 1}})
gold_cy
  • 13,648
  • 3
  • 23
  • 45
-2

Here an example ,maybe this helps complete your code :

from collections import Counter
from io import StringIO
import pandas as pd

data = """
Expression,Description
Sad,"people are sad because they got no money."
Happy,"people are happy because ..really."
Sad,"people are miserable because they broke up"
Happy,"They got good money"
"""
#read csv
df = pd.read_csv(StringIO(data),sep=',')
#Only select result where Expression = 'Sad'
dfToList=df[df['Expression']=='Sad']['Description'].tolist()
# All dict 
print(dict(Counter(" ".join(dfToList).split(" ")).items()))

words=dict(Counter(" ".join(dfToList).split(" ")).items())

for key in words:
  # Here your conditions what you want
  print(key, '->', words[key])

Also you can to use isin() for multiple conditions..Happy...Bad...etc :

dfToList=df[df['Expression'].isin(['Bad', 'Happy'])]['Description'].tolist()

Output :

{'people': 2, 'are': 2, 'sad': 1, 'because': 2, 'they': 2, 'got': 1, 'no': 1, 'money.': 1, 'miserable': 1, 'broke': 1, 'up': 1}
people -> 2
are -> 2
sad -> 1
because -> 2
they -> 2
got -> 1
no -> 1
money. -> 1
miserable -> 1
broke -> 1
up -> 1
GiovaniSalazar
  • 1,999
  • 2
  • 8
  • 15