1

I'm trying to apply multiple functions to different columns in pandas. My dataframe consists of over 10M rows and over 100K groups. I'm trying to similar operations as in this (Example below), but it takes a long time. I tried using dask, but that didn't help either.

Any suggestions on how to improve this example below?

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
def f(x):
    d = {}
    d['a_sum'] = x['a'].sum()
    d['a_max'] = x['a'].max()
    d['b_mean'] = x['b'].mean()
    d['c_d_prodsum'] = (x['c'] * x['d']).sum()
    return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])

df.groupby('group').apply(f)
tandem
  • 2,040
  • 4
  • 25
  • 52

2 Answers2

1

Does it help to aggregate them with .agg?

import pandas as pd

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df['c_d_prod'] = df['c'] * df['d']


df = df.groupby('group').agg({'a' : ['sum', 'max'], 'b' : ['mean'], 'c_d_prod': ['sum'] })
print(df)

Output:

              a                   b  c_d_prod
            sum       max      mean       sum
group                                        
0      1.693675  0.966228  0.500866  0.155463
1      0.950398  0.786002  0.355562  0.557794

You can rename the columns if you like:

df.columns = ['a_sum', 'a_max', 'b_mean', 'c_d_prodsum']
print(df)

Output:

          a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      0.899459  0.736511  0.233027     1.287123
1      0.913862  0.654808  0.730330     0.177089
Grayrigel
  • 3,474
  • 5
  • 14
  • 32
1

Use named aggregation for avoid MultiIndex in columns and for c_d_prodsum are multiple columns to helper column:

np.random.seed(2020)
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df1 = (df.assign(tmp=df['c'] * df['d'])
        .groupby('group').agg(a_sum =('a', 'sum'),
                              a_max =('a', 'max'),
                              b_mean = ('b', 'mean'),
                              c_d_prodsum = ('tmp', 'sum')))

print (df1)
          a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      1.323196  0.986277  0.545173     0.233486
1      1.598484  0.862159  0.256181     0.334105
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252