Combine complex aggregation function when using pandas groupby

Question

Think of the following table

np.random.seed(42)
ix = pd.date_range('2017-01-01', '2017-01-15', freq='60s')
df = pd.DataFrame(
    {
        'val': np.random.random(size=ix.shape[0]),
        'active': np.random.choice([0,1], size=ix.shape[0])
    },
    index=ix
)
df.sample(10)

yielding:

                    active   val
2017-01-02 06:05:00 1   0.774654
2017-01-04 08:15:00 1   0.934796
2017-01-13 01:02:00 0   0.792351...

My objective is to compute:

sum per day
sum of actives per day

Sum per day This one is straightforwards:

gb = df.groupby(pd.to_datetime(df.index.date))
overall_sum_per_day = gb['val'].sum().rename('overall')

Sum per active day This is a little trickier (see this).

active_sum_per_day = gb.agg(lambda x: x[x.active==1]['val'].sum())['val'].rename('active')

My question How can I combine the two. Using concat:

pd.concat([overall_sum_per_day, active_sum_per_day], axis=1)

I can achieve my objective. But I fail to do achieve it in one go and apply the two aggregations at once. Is it possible? See this comment.

Check my answer to see how you can clean up your groupby and applying function. — Ted Petrou, Aug 30 '17 at 19:13

jezrael · Accepted Answer · 2017-08-30T10:18:27.513

You can use GroupBy.apply:

b = gb.apply(lambda x: pd.Series([x['val'].sum(), x.loc[x.active==1, 'val'].sum()], 
                                  index=['overall', 'active']))
print (b)
               overall      active
2017-01-01  715.997165  366.856234
2017-01-02  720.101832  355.100828
2017-01-03  711.247370  335.231948
2017-01-04  713.688122  338.088299
2017-01-05  716.127970  342.889442
2017-01-06  697.319129  338.741027
2017-01-07  708.121948  361.086977
2017-01-08  731.032093  370.697884
2017-01-09  718.386679  342.162494
2017-01-10  709.706473  349.657514
2017-01-11  720.477342  368.407343
2017-01-12  738.286682  378.618305
2017-01-13  735.805583  372.039108
2017-01-14  727.502271  345.612816
2017-01-15    0.613559    0.613559

Another solution:

b = gb.agg(lambda x: [x['val'].sum(), x.loc[x.active==1, 'val'].sum()])
       .rename(columns={'val':'overall'})
print (b)
                active     overall
2017-01-01  715.997165  366.856234
2017-01-02  720.101832  355.100828
2017-01-03  711.247370  335.231948
2017-01-04  713.688122  338.088299
2017-01-05  716.127970  342.889442
2017-01-06  697.319129  338.741027
2017-01-07  708.121948  361.086977
2017-01-08  731.032093  370.697884
2017-01-09  718.386679  342.162494
2017-01-10  709.706473  349.657514
2017-01-11  720.477342  368.407343
2017-01-12  738.286682  378.618305
2017-01-13  735.805583  372.039108
2017-01-14  727.502271  345.612816
2017-01-15    0.613559    0.613559

MaxU - stand with Ukraine · Answer 2 · 2017-08-30T17:48:13.663

IIUC we can do it in one step, working with your original DF:

In [105]: df.groupby([df.index.normalize(), 'active'])['val'] \
     ...:   .sum() \
     ...:   .unstack(fill_value=0) \
     ...:   .rename(columns={0:'overall', 1:'active'}) \
     ...:   .assign(overall=lambda x: x['overall'] + x['active'])
Out[105]:
active         overall      active
2017-01-01  715.997165  366.856234
2017-01-02  720.101832  355.100828
2017-01-03  711.247370  335.231948
2017-01-04  713.688122  338.088299
2017-01-05  716.127970  342.889442
...                ...         ...
2017-01-11  720.477342  368.407343
2017-01-12  738.286682  378.618305
2017-01-13  735.805583  372.039108
2017-01-14  727.502271  345.612816
2017-01-15    0.613559    0.613559

[15 rows x 2 columns]

Explanation:

In [64]: df.groupby([df.index.normalize(), 'active'])['val'].sum()
Out[64]:
            active
2017-01-01  0         349.140931
            1         366.856234
2017-01-02  0         365.001004
            1         355.100828
2017-01-03  0         376.015422
                         ...
2017-01-13  0         363.766475
            1         372.039108
2017-01-14  0         381.889455
            1         345.612816
2017-01-15  1           0.613559
Name: val, Length: 29, dtype: float64

In [65]: df.groupby([df.index.normalize(), 'active'])['val'].sum().unstack(fill_value=0)
Out[65]:
active               0           1
2017-01-01  349.140931  366.856234
2017-01-02  365.001004  355.100828
2017-01-03  376.015422  335.231948
2017-01-04  375.599823  338.088299
2017-01-05  373.238528  342.889442
...                ...         ...
2017-01-11  352.069999  368.407343
2017-01-12  359.668377  378.618305
2017-01-13  363.766475  372.039108
2017-01-14  381.889455  345.612816
2017-01-15    0.000000    0.613559

[15 rows x 2 columns]

you should use .assign with a lambda rather than eval, which is a bit magical — Jeff, Aug 30 '17 at 10:47
@Jeff, OK, thanks for the comment! I'll change it as soon as I'm back on my notebook (writing from my phone) — MaxU - stand with Ukraine, Aug 30 '17 at 10:50
@Jeff, using `assign` - how would I access dynamically created columns? — MaxU - stand with Ukraine, Aug 30 '17 at 14:16
@jeff This is a problem where this [resample.apply](https://github.com/pandas-dev/pandas/issues/15169) bug crops up — Ted Petrou, Aug 30 '17 at 19:01

score 1 · Answer 3 · answered Aug 30 '17 at 19:06

I think it's cleaner to do the grouping with pd.Grouper which is built for datetime grouping. Also you can define a function for clarity.

def func(df):
    active = (df['active'] * df['val']).sum()
    overall = df['val'].sum()
    return pd.Series(data=[active, overall], index=['active','overall'])

df.groupby(pd.Grouper(freq='d')).apply(func)

                active     overall
2017-01-01  366.856234  715.997165
2017-01-02  355.100828  720.101832
2017-01-03  335.231948  711.247370
2017-01-04  338.088299  713.688122
2017-01-05  342.889442  716.127970
2017-01-06  338.741027  697.319129
2017-01-07  361.086977  708.121948
2017-01-08  370.697884  731.032093
2017-01-09  342.162494  718.386679
2017-01-10  349.657514  709.706473
2017-01-11  368.407343  720.477342
2017-01-12  378.618305  738.286682
2017-01-13  372.039108  735.805583
2017-01-14  345.612816  727.502271
2017-01-15    0.613559    0.613559

You should be able to perform this operation with resample and apply but there is a bug.

df.resample('d').apply(func) # should work but doens't produce correct output           

                active  val
2017-01-01  366.856234  NaN
2017-01-02  355.100828  NaN
2017-01-03  335.231948  NaN
2017-01-04  338.088299  NaN
2017-01-05  342.889442  NaN
2017-01-06  338.741027  NaN
2017-01-07  361.086977  NaN
2017-01-08  370.697884  NaN
2017-01-09  342.162494  NaN
2017-01-10  349.657514  NaN
2017-01-11  368.407343  NaN
2017-01-12  378.618305  NaN
2017-01-13  372.039108  NaN
2017-01-14  345.612816  NaN
2017-01-15    0.613559  NaN

Combine complex aggregation function when using pandas groupby

3 Answers3