Searching word frequency in pandas from dict

Question

Here is the code which I am using:

import pandas as pd

data = [['This is a long sentence which contains a lot of words among them happy', 1],
       ['This is another sentence which contains the word happy* with special character', 1],
       ['Content and merry are another words which implies happy', 2],
       ['Sad is not happy', 2],
       ['unfortunate has negative conotations', 1]]
df = pd.DataFrame(data, columns=['string', 'id'])
words = {
    "positive" : ["happy", "content"],
    "negative" : ["sad", "unfortunate"],
    "neutral" : ["neutral", "000"]
    }

I want the output dataframe to look for keys in the dictionary and search for them in the dataframe but the key can be only be counted one time against an id.

Simply put:

Group by id.
For each group: see if at least one word in all sentences of a group is positive, negative and neutral.
Then sum up the counts for all groups.

For example.

    string  id
0   This is a long sentence which contains a lot o...   1
1   This is another sentence which contains the wo...   1
2   Content and merry are another words which impl...   2
3   Sad is not happy    2
4   unfortunate has negative connotations   1

The id "1" in row number 0 and 1 both contain the dict values for key positive. Thus positive can be counted only 1 time for id 1. Also in the last row it contains the word "unfortunate" thus.

For id 1

positive : 1

negative : 1

neutral : 0

After all ids are summed up, the final dataframe should look like this:

word        freq
positive     2
negative     2
neutral      0

Could you please advise how this can be accomplished in pandas

It is not clear how you to get your expected result. It may be helpful if you make your example smaller and easier to understand — Vladimir Fokow, Aug 31 '22 at 10:56
Do I understand correctly? Group by id. For each group see if at least one word in all sentences of a group is positive, negative and neutral. Then sum up the counts for all groups. — Vladimir Fokow, Aug 31 '22 at 11:01
@VladimirFokow Yes you are right, group by ID then sum the counts for all the groups. Thanks — Slartibartfast, Aug 31 '22 at 11:06
so the case doesn't matter? `sad` mush match `Sad` and `sAd`, etc., correct? — Vladimir Fokow, Aug 31 '22 at 11:17

Lucas M. Uriarte · Answer 1 · 2022-08-31T11:48:28.807

the following code should make the job, although is not totally working with pandas. Note I use phrase.lower() to match the correct counts.

from collections import Counter

out = df.groupby("id")['string'].apply(list)

def get_count(grouped_element):
    counter = Counter({"postive": 0, "negative": 0, "neutral": 0})
    words = {
        "postive" : ["happy", "content"],
        "negative" : ["sad", "unfortunate"],
        "neutral" : ["neutral", "000"]
        }
    for phrase in grouped_element:
        if counter["postive"] < 1:
            for word in words["postive"]:
                if word in phrase.lower():
                    counter.update(["postive"])
                    break 
        if counter["negative"] < 1:
            for word in words["negative"]:
                if word in phrase.lower():
                    counter.update(["negative"])
                    break 
        if counter["neutral"] < 1:
            for word in words["neutral"]:
                if word in phrase.lower():
                    counter.update(["neutral"])
                    break 
    return counter

counter = Counter({"postive": 0, "negative": 0, "neutral": 0})
for phrases in out:
    result = get_count(phrases)
    counter.update(result)

print(counter)

output is:

Counter({'postive': 2, 'negative': 2, 'neutral': 0})

to convert to a dataframe:

out = {"word": [], "freq": []}
for key, val in counter.items():
    out["word"].append(key)
    out["freq"].append(val)
pd.DataFrame(out)

    word    freq
0   postive     2
1   negative    2
2   neutral     0

Vladimir Fokow · Accepted Answer · 2022-08-31T17:15:27.667

1

This is efficient because any() short circuits (stops evaluation at the first value that matches).

texts = df.groupby('id')[['string']].agg(lambda x: ' '.join(x))

for k, v in words.items():
    texts[k] = texts['string'].transform(
        lambda text: any(word.lower() in text.lower() for word in v)
        )

result = texts[words.keys()].sum(axis=0)

result is a Series:

positive    2
negative    2
neutral     0
dtype: int64

You can convert it to a DataFrame like this:

result_df = result.to_frame().reset_index().set_axis(['word', 'freq'], axis=1)

       word  freq
0  positive     2
1  negative     2
2   neutral     0

edited Aug 31 '22 at 17:15

answered Aug 31 '22 at 11:26

Vladimir Fokow

3,728
2
5
27

What if id is not a number and is a string and/or combination of string and number. How can you groupby then? – Slartibartfast Aug 31 '22 at 14:33
Why wouldn't it work? I think there would be no problem. Groupby works by grouping the same elements together. If you need to group by a custom rule (for example, you want to split or somehow transform the `id` first), this would a candidate for another question – Vladimir Fokow Aug 31 '22 at 14:45
I have added an edit to the question. Where I have outlined why I am not able to groupby. – Slartibartfast Aug 31 '22 at 16:22
1

@Slartibartfast ok, I see. Since you have many columns in your dataframe (some of which are numeric) sum defaults to summing up only the numeric ones and ignores all others. I've updated my answer - added `[['string']]` to the first line. You should replace `'string'` with `'Tweet'` (in 2 places in my code) – Vladimir Fokow Aug 31 '22 at 16:36
1

and also replace `'id'` with `'User'` – Vladimir Fokow Aug 31 '22 at 16:44
I got 1 other follow up. What can i change so it does not recognize partials. Such as it does not find "tree" if the word in the text is "trees" Because it get a lot of false positives that way – Slartibartfast Aug 31 '22 at 16:54
@Slartibartfast I've replaced `sum` with `lambda x: ' '.join(x)` so that the strings from different rows are separated by a space (and not just squished together with not space in between) – Vladimir Fokow Aug 31 '22 at 17:15
1

@Slartibartfast Now you can modify `word.lower() in text.lower()` using regular expressions, as explained in [this answer](https://stackoverflow.com/a/5320179/14627505) to check if a word is in a string (and not just any substring) – Vladimir Fokow Aug 31 '22 at 17:16

Searching word frequency in pandas from dict

2 Answers2