2

I have a dataframe, something like:

index     name     message_counter
1         AA       Counter({'hello':1})
2         BB       Counter({'how':1, 'are':1, 'you':1})
3         BB       Counter({'how':1})
4         AA       Counter({'hello':1})
5         CC       Counter({'hello':1})

I want a sum of all the counters from each unique name. So I did:

df.groupby('name')['message_counter'].sum()

and got the right answer. something like:

name
AA            {'hello':2}
BB            {'how':2, 'are':1, 'you':1}
CC            {'hello':1}

But it was surprisingly slow on my data set. It's going through 6 unique names and summing through 33,000 Counters (numbers of rows in my data frame), Which is not that much, but it took me way longer than I expected. Something like 50+ seconds, and the whole 180 lines doesn't take that much time.

What am I doing wrong? How can I improve this?

sheldonzy
  • 5,505
  • 9
  • 48
  • 86

1 Answers1

2

Try use a bit improved this solution:

from collections import defaultdict

def dsum(*dicts):
    ret = defaultdict(int)
    #add loop for Series of dicts
    for x in dicts:
        for d in x:
            for k, v in d.items():
                ret[k] += v
    return dict(ret)

df1 = df.groupby('name')['message_counter'].agg(dsum)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Wow yeah is performs a lot better. Any idea why? – sheldonzy Mar 05 '18 at 07:51
  • 1
    @sheldonzy - In my opinion `pandas` working not very fast with non scalar values, because primary it dont supports (some function should also failed), so better is use pure python. – jezrael Mar 05 '18 at 07:53