5

Ultimate Question

Is there a way to do a general, performant groupby-operation that does not rely on pd.groupby?

Input

pd.DataFrame([[1, '2020-02-01', 'a'], [1, '2020-02-10', 'b'], [1, '2020-02-17', 'c'], [2, '2020-02-02', 'd'], [2, '2020-03-06', 'b'], [2, '2020-04-17', 'c']], columns=['id', 'begin_date', 'status'])`
   id  begin_date status
0   1  2020-02-01      a
1   1  2020-02-10      b
2   1  2020-02-17      c
3   2  2020-02-02      d
4   2  2020-03-06      b

Desired Output

   id status  count  uniquecount
0   1      a      1            1
1   1      b      1            1
2   1      c      1            1
3   2      b      1            1
4   2      c      1            1

Problem

Now, there is an easy way to do that in Python, using Pandas.

df = df.groupby(["id", "status"]).agg(count=("begin_date", "count"), uniquecount=("begin_date", lambda x: x.nunique())).reset_index()
# As commented, omitting the lambda and replacing it with "begin_date", "nunique" will be faster. Thanks!

This operation is slow for larger datasets, I'd take a guess and say O(n²).

Existent solutions that lack the desired general applicability

Now, after some googling, there are some alternative solutions on StackOverflow, either using numpy, iterrows, or different other ways.

Faster alternative to perform pandas groupby operation

Pandas fast weighted random choice from groupby

And an excellent one:

Groupby in python pandas: Fast Way

These solutions generally aim to create the "count" or "uniquecount" in my example, basically the aggregated value. But, unfortunately, always only one aggregation and not with multiple groupby columns. Also, they unfortunately never explain how to merge them into the grouped dataframe.

Is there a way to use itertools (Like this answer: Faster alternative to perform pandas groupby operation, or even better this answer: Groupby in python pandas: Fast Way) that do not only return the series "count", but the whole dataframe in grouped form?

Ultimate Question

Is there a way to do a general, performant groupby-operation that does not rely on pd.groupby?

This would look something like this:

from typing import List
def fastGroupby(df, groupbyColumns: List[str], aggregateColumns):
    # numpy / iterrow magic
    return df_grouped

df = fastGroupby(df, ["id", "status"], {'status': 'count',
                             'status': 'count'}

And return the desired output.

denis
  • 21,378
  • 10
  • 65
  • 88
Dustin
  • 483
  • 3
  • 13
  • 1
    The `lambda` is killing you. That forces it to turn into a slow loop over the groups. `uniquecount=('status', 'nunique')` will probably improve the speed by orders of magnitude. Then you can futher add `sort=False` to the groupby call, which will cause the output to be unsorted, but will greatly improve speed for many groups – ALollz Aug 07 '20 at 17:55
  • 1
    Thanks for the comments, you are of absolutely correct with the lambda @ALollz And indeed made a mistake in the example, the specific column I am aggregating on doesn't really matter, but will fix it. Thank you! – Dustin Aug 07 '20 at 18:04

1 Answers1

6

Before ditching groupby I'd suggest first evaluating whether you are truly taking advantage of what groupby has to offer.

Do away with lambda in favor of built-in pd.DataFrameGroupBy methods.

Many of the Series and DataFrame methods are implemented as pd.DataFrameGroupBy methods. You should use those directly as opposed to calling them with a groupby + apply(lambda x: ...)

Further, for many calculations you can re-frame the problem as some vectorized operation on an entire DataFrame that then uses a groupby method implemented in cython. This will be fast.

A common example of this would be finding the proportion of 'Y' answers within a group. A straight-forward approach would be to check the condition within each group then get the proportion:

N = 10**6
df = pd.DataFrame({'grp': np.random.choice(range(10000), N),
                   'answer': np.random.choice(['Y', 'N'], N)})

df.groupby('grp')['answer'].apply(lambda x: x.eq('Y').mean())

Thinking about the problem this way requires the lambda, because we do two operations within the groupby; check the condition then average. This exact same calculation can be thought of as first checking the condition on the entire DataFrame then calculating average within group:

df['answer'].eq('Y').groupby(df['grp']).mean()

This is a very minor change yet the consequences are huge, and the gains will become greater as the number of groups increases.

%timeit df.groupby('grp')['answer'].apply(lambda x: x.eq('Y').mean())
#2.32 s ± 99.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df['answer'].eq('Y').groupby(df['grp']).mean()
#82.8 ms ± 995 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Add sort=False as an argument

By default groupby sorts the output on the keys. If there is no reason to have a sorted output you can get a slight gain specifying sort=False


Add observed=True as an argument

If grouping keys are categorical it will reindex to all possible combinations, even for groups that never appear in your DataFrame. If these are not important, removing them from the output will greatly improve the speed.


For your example we can examine the difference. There's an enormous gain switching to pd.DataFrameGroupBy.nunique and removing the sorting adds a little extra speed. The combination of both gives an "identical" solution (up to sorting), and is nearly 100x faster for many groups.

import perfplot
import pandas as pd
import numpy

def agg_lambda(df):
    return df.groupby(['id', 'status']).agg(uniquecount=('Col4', lambda x: x.nunique()))
    
def agg_nunique(df):
    return df.groupby(['id', 'status']).agg(uniquecount=('Col4', 'nunique'))

def agg_nunique_nosort(df):
    return df.groupby(['id', 'status'], sort=False).agg(uniquecount=('Col4', 'nunique'))

perfplot.show(
    setup=lambda N: pd.DataFrame({'Col1': range(N),
                       'status': np.random.choice(np.arange(N), N),
                       'id': np.random.choice(np.arange(N), N),
                       'Col4': np.random.choice(np.arange(N), N)}),
    kernels=[
        lambda df: agg_lambda(df),
        lambda df: agg_nunique(df),
        lambda df: agg_nunique_nosort(df),
    ],
    labels=['Agg Lambda', 'Agg Nunique', 'Agg Nunique, No sort'],
    n_range=[2 ** k for k in range(20)],
    # Equality check same data, just allow for different sorting
    equality_check=lambda x,y: x.sort_index().compare(y.sort_index()).empty,
    xlabel="~ Number of Groups"
)

enter image description here

ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    An excellent answer, thank you very much. I incorporated your feedback and indeed it is running faster now. Because it is still taking quite a while, I will keep the question open for now to allow for discussion with other solutions such as in numpy. – Dustin Aug 08 '20 at 14:05
  • @Dustin agreed, there may be a faster way, – ALollz Aug 08 '20 at 14:32