Create entirely new dataframe efficiently from groupby .agg() or .apply() in Pandas?

Question

I'd like to create a new dataframe from the results of groupby on another. The result should have one row per group (basically a vectorized map-reduce), and the new column names bear no relation to the existing names. This seems like a natural use for agg, but it only seems to produce existing columns.

d = pd.DataFrame({'a': [0,0,1,1], 'b': [3,4,5,6], 'c': [7,8,9,0]})

   a  b  c
0  0  3  7
1  0  4  8
2  1  5  9
3  1  6  0

agg() will create new columns with a Series:

d.groupby('a')['b'].agg({'x': lambda g: g.sum()})

    x
a    
0   7
1  11

But frustratingly not with a DataFrame:

d.groupby('a').agg({'x': lambda g: g.b.sum()})
KeyError: 'x'

I can do it by returning a one-row DataFrame from apply():

d.groupby('a').apply(lambda g: pd.DataFrame([{'x': g.b.mean(), 'y': (g.b * g.c).sum()}])).reset_index(level=1, drop=True)

     x   y
a         
0  3.5  53
1  5.5  45

but this is ugly and, as you can imagine, creating a new dict, list, and DataFrame for every row is slow for even modestly-sized inputs.

Maybe parfait's answer here would also work for you: http://stackoverflow.com/questions/35938393/pandas-aggregating-multiple-columns-with-multiple-functions/35944697#35944697 — JohnE, Mar 18 '16 at 00:08

Siraj S. · Answer 1 · 2016-03-30T05:44:37.583

1

maybe this could help

df = d.groupby('a')[['b','c']].sum()
df.index.name = None
df.columns=['b_sum','c_sum']

or if you want to compute separate statistics on individual columns, you can do something like this

df = d.groupby('a')[['b','c']].apply(lambda x: (x.b.mean(),x.c.sum())).apply(pd.Series)

edited Mar 30 '16 at 05:44

answered Mar 29 '16 at 20:15

Siraj S.

3,481
3
34
48

even this seems to work if you want to perform different function on different columns (eg b.mean() and c.sum()) df = d.groupby('a')[['b','c']].apply(lambda x: (x.b.mean(), x.c.sum())] df = df.apply(pd.Series) – Siraj S. Mar 29 '16 at 21:41
or more directly df = d.groupby('a')[['b','c']].apply(lambda x: (x.b.mean(), x.c.sum())).apply(pd.Series) – Siraj S. Mar 29 '16 at 21:53
You really should [edit] extra information into your answer, rather than tacking on comments that are impossible to read. – Ajean Mar 30 '16 at 00:12
`apply()`-ing `Series` to a column of dicts is a good trick. Thanks for showing me that. Pandas shouldn't make us do such hacks, but it's workable. – Doctor J Mar 30 '16 at 23:00

score 0 · Accepted Answer · answered Mar 30 '16 at 23:54

Here is a comparison of a few different ways to do it. I prefer returning a Series; reasonably succinct, clear, and efficient. Thanks to @Siraj S for the inspiration.

df = pd.DataFrame(np.random.rand(1000000, 5), columns=list('abcde'))
grp = df.groupby((df.a * 100).astype(int))


%timeit grp.apply(lambda g: pd.DataFrame([{'n': g.e.count(), 'x': (g.b * g.c).sum() / g.c.sum(), 'y': g.d.mean(), 'z': g.e.std()}])).reset_index(level=1, drop=True)
1 loop, best of 3: 328 ms per loop

%timeit grp.apply(lambda g: (g.e.count(), (g.b * g.c).sum() / g.c.sum(), g.d.mean(), g.e.std())).apply(pd.Series)
1 loop, best of 3: 266 ms per loop

%timeit grp.apply(lambda g: pd.Series({'n': g.e.count(), 'x': (g.b * g.c).sum() / g.c.sum(), 'y': g.d.mean(), 'z': g.e.std()}))
1 loop, best of 3: 265 ms per loop

%timeit grp.apply(lambda g: {'n': g.e.count(), 'x': (g.b * g.c).sum() / g.c.sum(), 'y': g.d.mean(), 'z': g.e.std()}).apply(pd.Series)
1 loop, best of 3: 273 ms per loop

%timeit pd.concat([grp.apply(lambda g: g.e.count()), grp.apply(lambda g: (g.b * g.c).sum() / g.c.sum()), grp.apply(lambda g: g.d.mean()), grp.apply(lambda g: g.e.std())], axis=1)
1 loop, best of 3: 708 ms per loop

Create entirely new dataframe efficiently from groupby .agg() or .apply() in Pandas?

2 Answers2