I have a column of a dataframe df['psych_prob'] and I have a dictionary containing different categories and their corresponding list of words. I have to count how many times the words belonging to each category in a column of Pandas dataframe appears.
The code below works for me but my real data set is more than 100k rows and my real categories are more than 40 and words in each category are more than 500. It takes me more than 1 hour to run the code. I am trying to optimize the speed of the following code.
dummy_dict={
'psych_prob':
['he would be happy about it, but i am sad it does not make sense to her',
'how i can be happy i am anxious all the time my fathers death is here']
}
df=pd.DataFrame(dummy_dict)
# dictionary containing categories and their list of words
category_dict={'pronouns':['he', "he'd", "he's", 'her', 'hers'],
'neg_emo':['sad','anxious','angry'],
'pos_emo':['happy','excited']}
for category in category_dict:
#join the list of words by space and pipe
category_joined=' |'.join(e for e in category_dict[category])
#count how many times the list of words belonging to the category appears in the dataframe
category_count=df.psych_prob.str.count(category_joined).sum()
print('Category:',category)
print('Words to search:',category_joined)
print('Total words of this category in dataframe:', category_count)
print('\n')