Speeding up the Count of words in pandas datarame

Question

I have a column of a dataframe df['psych_prob'] and I have a dictionary containing different categories and their corresponding list of words. I have to count how many times the words belonging to each category in a column of Pandas dataframe appears.

The code below works for me but my real data set is more than 100k rows and my real categories are more than 40 and words in each category are more than 500. It takes me more than 1 hour to run the code. I am trying to optimize the speed of the following code.

dummy_dict={
            'psych_prob':
            ['he would be happy about it, but i am sad it does not make sense to her', 
               'how i can be happy i am anxious all the time my fathers death is here']
           }
df=pd.DataFrame(dummy_dict)

# dictionary containing categories and their list of words
category_dict={'pronouns':['he', "he'd", "he's", 'her', 'hers'],
               'neg_emo':['sad','anxious','angry'], 
               'pos_emo':['happy','excited']}

for category in category_dict:
    #join the list of words by space and pipe
    category_joined=' |'.join(e for e in category_dict[category]) 

    #count how many times the list of words belonging to the category appears in the dataframe
    category_count=df.psych_prob.str.count(category_joined).sum()

    print('Category:',category) 
    print('Words to search:',category_joined)
    print('Total words of this category in dataframe:', category_count)
    print('\n')

jezrael · Answer 1 · 2021-04-19T05:53:27.553

2

EDIT: Here is solution with word boundaries and joind values of column to one long sentence:

import re

#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])

for category, vals in category_dict.items():
    
    pat = '|'.join(r"\b{}\b".format(x) for x in vals)
    category_count = len(re.findall(pat, str_vals))
    
    print('Category:',category) 
    print('Total words of this category in dataframe:', category_count)
    print('\n')

I think you can use Aho-Corasick frequency counter for improve performance with join by column for one long string:

#https://stackoverflow.com/a/51604049/2901002
def ac_frequency(needles, haystack):
    frequencies = [0] * len(needles)
    # Make a searcher
    searcher = ahocorasick.Automaton()
    for i, needle in enumerate(needles):
        searcher.add_word(needle, i)
    searcher.make_automaton()
    # Add up all frequencies
    for _, i in searcher.iter(haystack):
        frequencies[i] += 1
    return frequencies

#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])

for category, vals in category_dict.items():
    #join the list of words by space and pipe
    category_count = ac_frequency(vals, str_vals)
    
    print('Category:',category) 
    print('Total words of this category in dataframe:', category_count)
    print('\n')

edited Apr 19 '21 at 05:53

answered Apr 19 '21 at 05:04

jezrael

822,522
95
1,334
1,252

Thanks for the suggestion and the written code. However, it is giving me not the accurate answer. For eg; for pronoun it gives ['he', "he'd", "he's", 'her', 'hers'] Category: pronouns Total words of this category in dataframe: [5, 0, 0, 3, 1] it is supposed to be [1,0,0,2,1] – Noor Apr 19 '21 at 05:17
@Noor - Can you be more specific? – jezrael Apr 19 '21 at 05:18
I mean it is not doing exact matching of the words. It counts he 5 times, however, it appears 1 time only. It count her 3 times but it appears 2 times only. – Noor Apr 19 '21 at 05:21
@Noor - hmmm, one thing - in real categories are words with spaces? Like `very well` ? Or all words has no spaces like in sample data? – jezrael Apr 19 '21 at 05:23
yes, some words are and some words are not. For eg 'very' and 'very well' both code be in the real list of words of categories. – Noor Apr 19 '21 at 05:26
@Noor - Ok, so solution in question is also not accurate, because count `he` in word `here` ? – jezrael Apr 19 '21 at 05:27
I guess, the solution in question is accurate, because in that I joined the list with space and pip ' |'. So as a result it looks for 'he ' in here not 'he' in here, so it does not count 'he' in here – Noor Apr 19 '21 at 05:33
I thought about the solution in question, it is not correct. – Noor Apr 19 '21 at 05:52
@Noor - yes, need word boundaries – jezrael Apr 19 '21 at 05:52
@Noor - I add new soluton with word boundaries and joined values in column, I hope more accurate and faster. – jezrael Apr 19 '21 at 05:54
1

Sorry, you are right, my solution in question is not correct, becasue For eg: it will find 'hers ' in ' ...my fathers death...'. Even though it should not. Thanks for helping me understand my mistake – Noor Apr 19 '21 at 06:00
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/231331/discussion-between-noor-and-jezrael). – Noor Apr 19 '21 at 16:53

Speeding up the Count of words in pandas datarame

1 Answers1