2

How can I calculate a column showing the % of total in a groupby?

One way to do it is to calculate it manually after the groupby, as in the last line of this example:

import numpy as np
import pandas as pd
df= pd.DataFrame(np.random.randint(5,8,(10,4)), columns=['a','b','c','d'])
g = df.groupby('a').agg({'b':['sum','mean'], 'c':['sum'], 'd':['sum']})
g.columns = g.columns.map('_'.join)
g['b %']=g['b_sum']/g['b_sum'].sum()

However, in my real data I have many more columns, and I'd need the % right after the sum, so with this approach I'd have to manually change the order of the columns.

Is there a more direct way of doing it so that the % is the column right after the sum? Note that I need the agg(), or something equivalent, because in all my groupbys I apply different aggregate functions to different columns (e.g. sum and avg of x, but only the min of y, etc.).

halfer
  • 19,824
  • 17
  • 99
  • 186
Pythonista anonymous
  • 8,140
  • 20
  • 70
  • 112

1 Answers1

2

I think you need lambda function in agg and then replace column names to %:

np.random.seed(78)
df= pd.DataFrame(np.random.randint(5,8,(10,4)), columns=['a','b','c','d'])

g =(df.groupby('a')
      .agg({'b':['sum',lambda x: x.sum()/ df['b'].sum(),'mean'],
            'c':['sum'], 
            'd':['sum']}))
g.columns = g.columns.map('_'.join).str.replace('<lambda>','%')

print (g)

   d_sum  c_sum  b_sum       b_%  b_mean
a                                       
5     25     24     24  0.387097       6
6     11     11     14  0.225806       7
7     22     23     24  0.387097       6
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks! Can I use a lambda function to calculate a weighted average in the groupby, too? – Pythonista anonymous Oct 12 '17 at 11:46
  • 1
    Hmm, do you need implement [this](https://stackoverflow.com/q/26205922/2901002) solution to `agg` ? – jezrael Oct 12 '17 at 11:50
  • I see you replace lambda with % in the column names. But what if I have multiple lambda functions for the same column? E.g. for column b I may do sum(b) and %, then count(b) and related %. How could I rename the columns now that, for reasons honestly beyond me, using a nested dictionary for renaming is deprecated? https://stackoverflow.com/questions/46694207/applying-different-aggregate-functions-to-different-columns-now-that-dict-with – Pythonista anonymous Oct 12 '17 at 12:14
  • Yes, it is not so easy. need replace not substring, but all columns like `g.columns = g.columns.map('_'.join)` and then `g = g.rename(columns={'b_':'b _%'})` – jezrael Oct 12 '17 at 12:18
  • yes, implementing the solution you linked to agg would be great! Do you know a way? – Pythonista anonymous Oct 12 '17 at 12:36
  • 1
    I think it is not possible. because need transform :( – jezrael Oct 12 '17 at 12:37
  • Thank you for accepting, can you accept this solution too? – jezrael Oct 13 '17 at 07:51