Count the number of word occurrences from a Pandas Df in Python

Question

this is the sample of my Pandas data frame, containing 30,000 rows [excluding column headers]. The expression comes with two classes, mainly Sad and Happy.

Expression              Description
Sad                     "people are sad because they got no money."
Happy                   "people are happy because ..."
Sad                     "people are miserable because they broke up"
Happy                   "They got good money"

Based on the example above, I would like to count the number of frequencies, which allows me to the number of word occurrences of "Sad" and "Happy" Expression's description in a dictionary. e.g. {sad:{people:2}, happy:{happy:1}}

This is my code:

 def calculate_word_frequency(lst, classes):
        #variable
        wordlist = []
        dict_output = {}
        count = 0
        term = ""

data = [lst.columns.values.tolist()] + lst.values.tolist() #to convert into a list


for i in range(1,len(data)):
    if data[i][0] == classes[0]:
        wordlist = data[i][1].lower().split(" ")

        for words in wordlist:
            wordlist.append(words)

            for word in wordlist:
              if word in dict_output:
                dict_output[wordlist] += 1
              else: 
                dict_output[wordlist] == 1
                print(dict_output)

Expected output would be based on the number of words appearing in each Expression respectively.

#Test case:
  words, freqs_per_expression = calculate_word_frequency(social_df, ["Sad", "Happy"])
  #output: 538212

print(freqs_per_class["sad"]["people"]) #output: 203

Because of the dataset, I often face frequent hangs and lags on my VS. Hence, I am unable to retrieve any results. I wondered if there are any better techniques that I can utilise so that I can achieve my desired data of {word:count}.

Thank you!

Does this answer your question? [what is the most efficient way of counting occurrences in pandas?](https://stackoverflow.com/questions/20076195/what-is-the-most-efficient-way-of-counting-occurrences-in-pandas) — PacketLoss, Dec 17 '19 at 01:04
@aws_apprentice do you mean that I should combine all my rows into one single list so that I can do a value count? — yan, Dec 17 '19 at 01:19
@yan you want an output like this ? {'people': 3, 'are': 3, 'sad': 1, 'because': 3, 'they': 2, 'got': 2, 'no': 1, 'money.': 1, 'happy': 1} — GiovaniSalazar, Dec 17 '19 at 01:34
Using `.values` is discouraged, and `.values.tolist()` is just unnecessary. — AMC, Dec 17 '19 at 03:58

score 0 · Accepted Answer · answered Dec 17 '19 at 02:04

We can achieve the desired results in a few steps. If you are using pandas >= 0.25 you can use the new explode function otherwise this solution will achieve what you want.

from collections import defaultdict

exploded = df.set_index('Expression') \
             .stack() \
             .str.split(' ', expand=True) \
             .stack() \
             .reset_index() \
             .drop(['level_1', 'level_2'], axis=1) \
             .rename(columns={0: 'Word'})

print(exploded)

   Expression       Word
0         Sad     people
1         Sad        are
2         Sad        sad
3         Sad    because
4         Sad       they
...

counts = pd.DataFrame(exploded.groupby('Expression')['Word'].value_counts()) \
                              .rename(columns={'Word': 'Count'}).reset_index().to_dict('records')

d = defaultdict(dict)

for rec in counts:
    key = rec.get('Expression')
    word = rec.get('Word')
    count = rec.get('Count')
    d[key].update({word: count})

print(d)

defaultdict(dict,
            {'Happy': {'...': 1,
              'They': 1,
              'are': 1,
              'because': 1,
              'good': 1,
              'got': 1,
              'happy': 1,
              'money': 1,
              'people': 1},
             'Sad': {'are': 2,
              'because': 2,
              'broke': 1,
              'got': 1,
              'miserable': 1,
              'money.': 1,
              'no': 1,
              'people': 2,
              'sad': 1,
              'they': 2,
              'up': 1}})

not is more easy if just put your conditions Happy and sad in your groupby??? — GiovaniSalazar, Dec 17 '19 at 02:17
@GiovaniSalazar `groupby` takes index values (amongst others, but they’re all related to the index). — AMC, Dec 17 '19 at 04:02

GiovaniSalazar · Answer 2 · 2019-12-17T02:11:11.543

-2

Here an example ,maybe this helps complete your code :

from collections import Counter
from io import StringIO
import pandas as pd

data = """
Expression,Description
Sad,"people are sad because they got no money."
Happy,"people are happy because ..really."
Sad,"people are miserable because they broke up"
Happy,"They got good money"
"""
#read csv
df = pd.read_csv(StringIO(data),sep=',')
#Only select result where Expression = 'Sad'
dfToList=df[df['Expression']=='Sad']['Description'].tolist()
# All dict 
print(dict(Counter(" ".join(dfToList).split(" ")).items()))

words=dict(Counter(" ".join(dfToList).split(" ")).items())

for key in words:
  # Here your conditions what you want
  print(key, '->', words[key])

Also you can to use isin() for multiple conditions..Happy...Bad...etc :

dfToList=df[df['Expression'].isin(['Bad', 'Happy'])]['Description'].tolist()

Output :

{'people': 2, 'are': 2, 'sad': 1, 'because': 2, 'they': 2, 'got': 1, 'no': 1, 'money.': 1, 'miserable': 1, 'broke': 1, 'up': 1}
people -> 2
are -> 2
sad -> 1
because -> 2
they -> 2
got -> 1
no -> 1
money. -> 1
miserable -> 1
broke -> 1
up -> 1

edited Dec 17 '19 at 02:11

answered Dec 17 '19 at 01:47

GiovaniSalazar

1,999
2
8
15

you also have to count the other expression, `Happy` – gold_cy Dec 17 '19 at 01:59
@aws_apprentice in this line dfToList=df[df['Expression']=='Sad']['Description'].tolist() ...just change Sad by Happy – GiovaniSalazar Dec 17 '19 at 02:00
what if you have `n` number of Expressions, not very effective – gold_cy Dec 17 '19 at 02:01
@aws_apprentice ..you have any idea ?.. I hope yes – GiovaniSalazar Dec 17 '19 at 02:03
I do, I just posted my answer – gold_cy Dec 17 '19 at 02:05
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/204378/discussion-between-giovanisalazar-and-aws-apprentice). – GiovaniSalazar Dec 17 '19 at 02:18
People don’t need to have a complete solution in order to ask questions about other answers, nor do they need to know everything. The question by @aws_apprentice was perfectly fine, and is actually a crucial one. – AMC Dec 17 '19 at 04:00

Count the number of word occurrences from a Pandas Df in Python

2 Answers2