Avoiding apply to a pandas groupby to do multifunction operation

Question

I'm trying to apply multiple functions to different columns in pandas. My dataframe consists of over 10M rows and over 100K groups. I'm trying to similar operations as in this (Example below), but it takes a long time. I tried using dask, but that didn't help either.

Any suggestions on how to improve this example below?

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
def f(x):
    d = {}
    d['a_sum'] = x['a'].sum()
    d['a_max'] = x['a'].max()
    d['b_mean'] = x['b'].mean()
    d['c_d_prodsum'] = (x['c'] * x['d']).sum()
    return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])

df.groupby('group').apply(f)

Grayrigel · Answer 1 · 2020-10-07T10:29:38.613

Does it help to aggregate them with .agg?

import pandas as pd

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df['c_d_prod'] = df['c'] * df['d']


df = df.groupby('group').agg({'a' : ['sum', 'max'], 'b' : ['mean'], 'c_d_prod': ['sum'] })
print(df)

Output:

              a                   b  c_d_prod
            sum       max      mean       sum
group                                        
0      1.693675  0.966228  0.500866  0.155463
1      0.950398  0.786002  0.355562  0.557794

You can rename the columns if you like:

df.columns = ['a_sum', 'a_max', 'b_mean', 'c_d_prodsum']
print(df)

Output:

          a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      0.899459  0.736511  0.233027     1.287123
1      0.913862  0.654808  0.730330     0.177089

Not from me :-) I like your answer. it is simple, but jezrael answered early, so ticking his — tandem, Oct 07 '20 at 09:35

score 1 · Accepted Answer · answered Oct 07 '20 at 09:08

Use named aggregation for avoid MultiIndex in columns and for c_d_prodsum are multiple columns to helper column:

np.random.seed(2020)
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df1 = (df.assign(tmp=df['c'] * df['d'])
        .groupby('group').agg(a_sum =('a', 'sum'),
                              a_max =('a', 'max'),
                              b_mean = ('b', 'mean'),
                              c_d_prodsum = ('tmp', 'sum')))

print (df1)
          a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      1.323196  0.986277  0.545173     0.233486
1      1.598484  0.862159  0.256181     0.334105

Avoiding apply to a pandas groupby to do multifunction operation

2 Answers2