3

I have the following dataframe:

import pandas as pd
array = {'id': [1, 1, 1, 2, 2, 2, 3, 3], 'A': [False, False, True, False, False, False, True, True],
         'B': [False, True, True, False, True, False, False, False]}
df = pd.DataFrame(array)
df

I want to represent each id in one line. If all values of this id on a specific column are False, then its value should be False. If at least one is True - then it should be True. I have started with:

df.groupby(['id']).sum()

After which I will convert every value above 0 to 1. This works fine, but my original dataframe has 2,000,000 rows and 14,000 columns and therefore it takes days....

Any other quicker idea to do the task?

Derek O
  • 16,770
  • 4
  • 24
  • 43
gtomer
  • 5,643
  • 1
  • 10
  • 21
  • 1
    I would expect `df.groupby(['id']).any()` to work _much_ faster because `any()` is supposed to be short-circuited. Surprisingly, it is actually slower. – DYZ Jul 17 '21 at 20:40
  • @jezrael - Any ideas? Would value your help here! – gtomer Jul 18 '21 at 07:56

1 Answers1

3

You can save the second step by taking the maximum instead of the sum over each group:

df.groupby(['id']).max()

You might expect that it should be faster to aggregate with any, because then each sub-series would only have to be evaluated until the first True is reached, but apparently the following is much slower (as per DYZ's comment):

df.groupby(['id']).agg(any)

So I would recommend testing it with NumPy's any function, which might handle this better:

import numpy as np
df.groupby(['id']).agg(np.any)

Or you could do without agg, as suggested by Henry Ecker:

df.groupby(['id']).any()

However, none of these methods is as fast as one might expect. So maybe your best bet is to convert the dataframe to a NumPy integer array and do the grouping in pure NumPy. See this question.

Arne
  • 9,990
  • 2
  • 18
  • 28