How can I speed groupby?

Question

I have the following dataframe:

import pandas as pd
array = {'id': [1, 1, 1, 2, 2, 2, 3, 3], 'A': [False, False, True, False, False, False, True, True],
         'B': [False, True, True, False, True, False, False, False]}
df = pd.DataFrame(array)
df

I want to represent each id in one line. If all values of this id on a specific column are False, then its value should be False. If at least one is True - then it should be True. I have started with:

df.groupby(['id']).sum()

After which I will convert every value above 0 to 1. This works fine, but my original dataframe has 2,000,000 rows and 14,000 columns and therefore it takes days....

Any other quicker idea to do the task?

I would expect `df.groupby(['id']).any()` to work _much_ faster because `any()` is supposed to be short-circuited. Surprisingly, it is actually slower. — DYZ, Jul 17 '21 at 20:40

Arne · Accepted Answer · 2021-07-17T21:15:44.947

3

You can save the second step by taking the maximum instead of the sum over each group:

df.groupby(['id']).max()

You might expect that it should be faster to aggregate with any, because then each sub-series would only have to be evaluated until the first True is reached, but apparently the following is much slower (as per DYZ's comment):

df.groupby(['id']).agg(any)

So I would recommend testing it with NumPy's any function, which might handle this better:

import numpy as np
df.groupby(['id']).agg(np.any)

Or you could do without agg, as suggested by Henry Ecker:

df.groupby(['id']).any()

However, none of these methods is as fast as one might expect. So maybe your best bet is to convert the dataframe to a NumPy integer array and do the grouping in pure NumPy. See this question.

edited Jul 17 '21 at 21:15

answered Jul 17 '21 at 20:38

Arne

9,990
2
18
28

1

It is actually _much_ slower with `any()`. I suspect that pandas' `any()` is not short-circuited. – DYZ Jul 17 '21 at 20:47
Thanks. However, I am not worried about the second step - cause I am still stuck with the first one... :-( – gtomer Jul 17 '21 at 20:47
1

Thanks @DYZ, I was indeed assuming that it should be short circuited. – Arne Jul 17 '21 at 20:48
@HenryEcker Even `df.groupby(['id']).any()` is slower than `df.groupby(['id']).sum()` . – DYZ Jul 17 '21 at 20:58
2

Yeah I was running timing myself. That's very odd indeed. – Henry Ecker Jul 17 '21 at 21:01

How can I speed groupby?

1 Answers1

Linked