10

To obtain results executed on groupby-data with the same level of detail as the original DataFrame (same observation count) I have used the transform function.

Example: Original dataframe

name, year, grade
Jack, 2010, 6
Jack, 2011, 7
Rosie, 2010, 7
Rosie, 2011, 8

After groupby transform

name, year, grade, average grade
Jack, 2010, 6, 6.5
Jack, 2011, 7, 6.5
Rosie, 2010, 7, 7.5
Rosie, 2011, 8, 7.5

However, with more advanced functions based on multiple columns things get more complicated. What puzzles me is that I seem to be unable to access multiple columns in a groupby-transform combination.

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[1,2,3,4,5,6],
               'c':['q', 'q', 'q', 'q', 'w', 'w'],  
               'd':['z','z','z','o','o','o']})

def f(x):
 y=sum(x['a'])+sum(x['b'])
 return(y)

df['e'] = df.groupby(['c','d']).transform(f)

Gives me:

KeyError: ('a', 'occurred at index a')

Though I know that following does work:

df.groupby(['c','d']).apply(f)

What causes this behavior and how can I obtain something like this:

a   b   c   d   e
1   1   q   z   12
2   2   q   z   12
3   3   q   z   12
4   4   q   o   8
5   5   w   o   22
6   6   w   o   22
jpp
  • 159,742
  • 34
  • 281
  • 339
Willem
  • 593
  • 1
  • 8
  • 25
  • Have a look at this [answer](https://stackoverflow.com/a/27951930/9274732) it may help you to understand why – Ben.T Nov 08 '18 at 17:01
  • I observed the fact that columns become scalars by using print(x) within function f. But how do I overcome this? – Willem Nov 08 '18 at 17:07

2 Answers2

13

for this particular case you could do:

g = df.groupby(['c', 'd'])

df['e'] = g.a.transform('sum') + g.b.transform('sum')

df
# outputs

   a  b  c  d   e
0  1  1  q  z  12
1  2  2  q  z  12
2  3  3  q  z  12
3  4  4  q  o   8
4  5  5  w  o  22
5  6  6  w  o  22

if you can construct the final result by a linear combination of the independent transforms on the same groupby, this method would work.

otherwise, you'd use a groupby-apply and then merge back to the original df.

example:

_ = df.groupby(['c','d']).apply(lambda x: sum(x.a+x.b)).rename('e').reset_index()
df.merge(_, on=['c','d'])
# same output as above.
Haleemur Ali
  • 26,718
  • 5
  • 61
  • 85
  • Just realized I've posted the exact same alternative solution using a merge :) – yatu Nov 08 '18 at 17:15
  • For some reason rename causes items = [func(x) for x in index] TypeError: 'str' object is not callable. – Willem Nov 08 '18 at 17:19
  • In general you should avoid using built-ins with Pandas series / NumPy arrays. In fact, using the string `'sum'` instead of `sum` gets routed by Pandas to a tried and tested method which you should be able to rely on. – jpp Nov 08 '18 at 18:12
  • why do you `.reset_index()`? isn't `merge` faster with the index? and the index will guarantee the correct row ordering (some apply operations change ordering within groups) – william_grisaitis Mar 28 '23 at 18:46
2

You can use GroupBy + transform with sum twice:

df['e'] = df.groupby(['c', 'd'])[['a', 'b']].transform('sum').sum(1)

print(df)

   a  b  c  d   e
0  1  1  q  z  12
1  2  2  q  z  12
2  3  3  q  z  12
3  4  4  q  o   8
4  5  5  w  o  22
5  6  6  w  o  22
jpp
  • 159,742
  • 34
  • 281
  • 339