Ultimate Question
Is there a way to do a general, performant groupby-operation that does not rely on pd.groupby?
Input
pd.DataFrame([[1, '2020-02-01', 'a'], [1, '2020-02-10', 'b'], [1, '2020-02-17', 'c'], [2, '2020-02-02', 'd'], [2, '2020-03-06', 'b'], [2, '2020-04-17', 'c']], columns=['id', 'begin_date', 'status'])`
id begin_date status
0 1 2020-02-01 a
1 1 2020-02-10 b
2 1 2020-02-17 c
3 2 2020-02-02 d
4 2 2020-03-06 b
Desired Output
id status count uniquecount
0 1 a 1 1
1 1 b 1 1
2 1 c 1 1
3 2 b 1 1
4 2 c 1 1
Problem
Now, there is an easy way to do that in Python, using Pandas.
df = df.groupby(["id", "status"]).agg(count=("begin_date", "count"), uniquecount=("begin_date", lambda x: x.nunique())).reset_index()
# As commented, omitting the lambda and replacing it with "begin_date", "nunique" will be faster. Thanks!
This operation is slow for larger datasets, I'd take a guess and say O(n²).
Existent solutions that lack the desired general applicability
Now, after some googling, there are some alternative solutions on StackOverflow, either using numpy, iterrows, or different other ways.
Faster alternative to perform pandas groupby operation
Pandas fast weighted random choice from groupby
And an excellent one:
Groupby in python pandas: Fast Way
These solutions generally aim to create the "count" or "uniquecount" in my example, basically the aggregated value. But, unfortunately, always only one aggregation and not with multiple groupby columns. Also, they unfortunately never explain how to merge them into the grouped dataframe.
Is there a way to use itertools (Like this answer: Faster alternative to perform pandas groupby operation, or even better this answer: Groupby in python pandas: Fast Way) that do not only return the series "count", but the whole dataframe in grouped form?
Ultimate Question
Is there a way to do a general, performant groupby-operation that does not rely on pd.groupby?
This would look something like this:
from typing import List
def fastGroupby(df, groupbyColumns: List[str], aggregateColumns):
# numpy / iterrow magic
return df_grouped
df = fastGroupby(df, ["id", "status"], {'status': 'count',
'status': 'count'}
And return the desired output.