Pandas groupby + transform and multiple columns

Question

To obtain results executed on groupby-data with the same level of detail as the original DataFrame (same observation count) I have used the transform function.

Example: Original dataframe

name, year, grade
Jack, 2010, 6
Jack, 2011, 7
Rosie, 2010, 7
Rosie, 2011, 8

After groupby transform

name, year, grade, average grade
Jack, 2010, 6, 6.5
Jack, 2011, 7, 6.5
Rosie, 2010, 7, 7.5
Rosie, 2011, 8, 7.5

However, with more advanced functions based on multiple columns things get more complicated. What puzzles me is that I seem to be unable to access multiple columns in a groupby-transform combination.

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[1,2,3,4,5,6],
               'c':['q', 'q', 'q', 'q', 'w', 'w'],  
               'd':['z','z','z','o','o','o']})

def f(x):
 y=sum(x['a'])+sum(x['b'])
 return(y)

df['e'] = df.groupby(['c','d']).transform(f)

Gives me:

KeyError: ('a', 'occurred at index a')

Though I know that following does work:

df.groupby(['c','d']).apply(f)

What causes this behavior and how can I obtain something like this:

a   b   c   d   e
1   1   q   z   12
2   2   q   z   12
3   3   q   z   12
4   4   q   o   8
5   5   w   o   22
6   6   w   o   22

Have a look at this [answer](https://stackoverflow.com/a/27951930/9274732) it may help you to understand why — Ben.T, Nov 08 '18 at 17:01
I observed the fact that columns become scalars by using print(x) within function f. But how do I overcome this? — Willem, Nov 08 '18 at 17:07

Haleemur Ali · Accepted Answer · 2019-01-07T18:28:12.567

13

for this particular case you could do:

g = df.groupby(['c', 'd'])

df['e'] = g.a.transform('sum') + g.b.transform('sum')

df
# outputs

   a  b  c  d   e
0  1  1  q  z  12
1  2  2  q  z  12
2  3  3  q  z  12
3  4  4  q  o   8
4  5  5  w  o  22
5  6  6  w  o  22

if you can construct the final result by a linear combination of the independent transforms on the same groupby, this method would work.

otherwise, you'd use a groupby-apply and then merge back to the original df.

example:

_ = df.groupby(['c','d']).apply(lambda x: sum(x.a+x.b)).rename('e').reset_index()
df.merge(_, on=['c','d'])
# same output as above.

edited Jan 07 '19 at 18:28

answered Nov 08 '18 at 17:10

Haleemur Ali

26,718
5
61
85

Just realized I've posted the exact same alternative solution using a merge :) – yatu Nov 08 '18 at 17:15
For some reason rename causes items = [func(x) for x in index] TypeError: 'str' object is not callable. – Willem Nov 08 '18 at 17:19
In general you should avoid using built-ins with Pandas series / NumPy arrays. In fact, using the string `'sum'` instead of `sum` gets routed by Pandas to a tried and tested method which you should be able to rely on. – jpp Nov 08 '18 at 18:12
why do you `.reset_index()`? isn't `merge` faster with the index? and the index will guarantee the correct row ordering (some apply operations change ordering within groups) – william_grisaitis Mar 28 '23 at 18:46

score 2 · Answer 2 · answered Nov 08 '18 at 18:11

You can use GroupBy + transform with sum twice:

df['e'] = df.groupby(['c', 'd'])[['a', 'b']].transform('sum').sum(1)

print(df)

   a  b  c  d   e
0  1  1  q  z  12
1  2  2  q  z  12
2  3  3  q  z  12
3  4  4  q  o   8
4  5  5  w  o  22
5  6  6  w  o  22

Pandas groupby + transform and multiple columns

2 Answers2

Linked

Related