GroupBy and Apply (I think) is making Python Pandas code run very slowly

Question

My code is running really slowly. I have a dataset with about one hundred thousand rows that contains the name of the person who wrote the post (each name can occur many times throughout the dataset, as they could have written multiple posts), the post, and another column with just the "feeling" words extracted from the corresponding post. Feeling words are something like: ['happy', 'sad', 'delighted', 'excited', 'angry', 'disappointed', 'annoyed', 'disheartened', 'frightened', 'content', 'peaceful']...the list keeps on going, but you get the point. Here's an example I made up:

                Message                         Name   Feeling_Words
0         I am really happy with my progress.  Alice         [happy]
1     I am really happy with John's progress.  Alice         [happy]
2       I was annoyed by his inconsideration.   John       [annoyed]
3  I felt proud after seeing her performance.   Lisa         [proud]
4  I am ecstatic after hearing the good news.  Alice      [ecstatic]
5      I felt disappointed by her dishonesty.   Lisa  [disappointed]
6        I was disheartened by their actions.   John  [disheartened]
7    I am delighted about the good news. I      Lisa  [delighted, proud]
     am proud to represent our entire 
     community for this occasion. 
.........

I am using the following code to find the most common feeling words that occur for each name. However, the code is really slow to run. I am running it in Jupyter, and it has been going for about 30 minutes now, and still has not executed:

//group all feeling words said by name (using Counter)
df.groupby('Name')['Feeling_words'].sum().apply(Counter)

//find most common feeling word per name
df.groupby('Name')['Feeling_words'].sum().apply(
                                    lambda feel: Counter(feel).most_common(1))
//find total number of feeling words per name
df.groupby('Name')['Feeling_words'].sum().apply(lambda feel: len(feel))

What specifically makes this so slow -- is the apply() or the groupby() or something else? Any suggestions to improve the run-time of this code while still maintaining the functionality would be greatly appreciated. Again I want to a) group all the feeling words said by Alice, John, and so on..., b) find the maximum occurring feeling word for each name and c) count the total number of feeling words for each name. I am fairly new to this so I am unsure of other approaches. Thanks in advance!

It's slow b/c you are using `apply/lamda` instead of a pandas method. Probably just replace with `value_counts().head(1)`? Btw much better if you can provide small sample dataframe rather than describing it. — JohnE, Mar 05 '17 at 14:37
@JohnE I added a small sample dataframe to give a better idea. Hopefully that helps! Would you mind also providing sample code so I can better understand what you are saying? As I mentioned I am sort of unfamiliar with Pandas. I would like to a) group all the feeling words said by Alice, John, and so on b) find the maximum occurring feeling word for each name and c) count the total number of feeling words for each name. Thank you for your help :) — Jane Sully, Mar 05 '17 at 18:21

GroupBy and Apply (I think) is making Python Pandas code run very slowly

0 Answers0