1

I calculate a number of aggregate functions using groupby and agg , because I need different aggregate functions for different variables, e.g. not the sum of all, but sum and mean of x, mean of y, etc.

Is there a way to calculate a weighted average using agg? I have found lots of examples, but none with agg.

I can calculate the weighted average manually, as in the code below (note the lines with **), but I was wondering if there is a more elegant and direct way?

Can I create my own function and use that with agg?

For the sake of clarity, I fully understand there are other solutions, e.g.

import numpy as np
import pandas as pd
df= pd.DataFrame(np.random.randint(5,8,(1000,4)), columns=['a','b','c','d'])
**df['c * b']= df['c']* df['b']**
g = df.groupby('a').agg(
        {'b':['sum', lambda x: x.sum() / df['b'] .sum(), 'mean'],
              'c':['sum','mean'], 'd':['sum'],
              'c * b':['sum']})
g.columns = g.columns.map('_'.join)
**g['weighted average of c'] = g['c * b_sum'] / g['b_sum']**
halfer
  • 19,824
  • 17
  • 99
  • 186
Pythonista anonymous
  • 8,140
  • 20
  • 70
  • 112
  • Running 'c * b':[lambda x: x.sum() / df['b'].sum() ] doesn't work, because this divides each row of [c * b] by the sum of b, i.e. it divides each row by the same number, which is not what I need here. – Pythonista anonymous Oct 12 '17 at 16:30
  • 1
    Can you give sample input with maybe a fixed random seed then your expected output? `np.random.seed(123)` – Scott Boston Oct 12 '17 at 16:31
  • Sure. Setting np.random.seed(123) the groupby returns 3 rows, and the weighted averages are: [6, 6.06 , 6.05]. This can be seen in the column where I calculate it manually (the line of code with ** at the bottom). – Pythonista anonymous Oct 12 '17 at 16:39
  • Why not just do means for the selected variables and then std's for the other selected variables. Sometimes a multi-line approach can actually be much (syntax-wise) than a single line approach, and is no less efficient – JohnE Oct 12 '17 at 16:47

1 Answers1

4

Is it possible, but really complicated:

np.random.seed(234)
df= pd.DataFrame(np.random.randint(5,8,(1000,4)), columns=['a','b','c','d'])

wm = lambda x: (x * df.loc[x.index, "c"]).sum() / x.sum()
wm.__name__ = 'wa'

f = lambda x: x.sum() / df['b'] .sum()
f.__name__ = '%'

g = df.groupby('a').agg(
        {'b':['sum', f, 'mean', wm],
         'c':['sum','mean'], 
         'd':['sum']})
g.columns = g.columns.map('_'.join)
print (g)

   d_sum  c_sum    c_mean  b_sum       b_%    b_mean      b_wa
a                                                             
5   2104   2062  5.976812   2067  0.344672  5.991304  5.969521
6   1859   1857  5.951923   1875  0.312656  6.009615  5.954667
7   2058   2084  6.075802   2055  0.342671  5.991254  6.085645

Solution with apply:

def func(x):
#    print (x)
    b1 = x['b'].sum()
    b2 = x['b'].sum() / df['b'].sum()
    b3 = (x['b'] * x['c']).sum() / x['b'].sum()
    b4 = x['b'].mean()

    c1 = x['c'].sum()
    c2 = x['c'].mean()

    d1 = x['d'].sum()
    cols = ['b sum','b %','wa', 'b mean', 'c sum', 'c mean', 'd sum']
    return pd.Series([b1,b2,b3,b4,c1,c2,d1], index=cols)


g = df.groupby('a').apply(func)
print (g)
    b sum       b %        wa    b mean   c sum    c mean   d sum
a                                                                
5  2067.0  0.344672  5.969521  5.991304  2062.0  5.976812  2104.0
6  1875.0  0.312656  5.954667  6.009615  1857.0  5.951923  1859.0
7  2055.0  0.342671  6.085645  5.991254  2084.0  6.075802  2058.0

g.loc['total']=g.sum()
print (g)
        b sum       b %         wa     b mean   c sum     c mean   d sum
a                                                                       
5      2067.0  0.344672   5.969521   5.991304  2062.0   5.976812  2104.0
6      1875.0  0.312656   5.954667   6.009615  1857.0   5.951923  1859.0
7      2055.0  0.342671   6.085645   5.991254  2084.0   6.075802  2058.0
total  5997.0  1.000000  18.009832  17.992173  6003.0  18.004536  6021.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • So I need to create one function per each column? I cannot just create a generic function which takes two inputs (values and weights) and use that within agg() ? – Pythonista anonymous Oct 12 '17 at 16:42
  • 1
    No,unfortunately it is not possible because `agg` always working only with one column. Function `apply` can working with multiple columns nice. – jezrael Oct 12 '17 at 16:43
  • Forgive me for the banal question, but, as you will have guessed, I am very much a novice! Could I use apply in my example, ie when I need to use different aggregate functions for different columns (sum and min of x, mean of y, max of z, etc).? – Pythonista anonymous Oct 12 '17 at 16:45
  • @Pythonistaanonymous - It is not shame you are novice, each coder starts some way. I add solution with apply, it is used if need some complicated aggregate functions. Please check edited answer. – jezrael Oct 12 '17 at 16:56
  • Super! Maybe it should be another question, but how would you add a row at the bottom with the totals, in your solution? – Pythonista anonymous Oct 12 '17 at 21:47
  • Use `g.loc['total']=g.sum()` – jezrael Oct 13 '17 at 04:12
  • You need call it outside apply or outside. – jezrael Oct 13 '17 at 04:18
  • One thing - if my answers was helpful, please dont forget accept it. Thanks. – jezrael Oct 13 '17 at 05:29
  • g.sum sums everything, whereas not all columns are sums: some are mean, max, weighted avg – Pythonista anonymous Oct 13 '17 at 06:00
  • OK, then use `g.loc['total']=g.filter(like='sum').sum()` - but is necessary all sum column have `sum` in column names. – jezrael Oct 13 '17 at 06:02
  • Another solution is seelct columns by names - `g.loc['total']=g[['b sum','c sum','d sum']].sum()` – jezrael Oct 13 '17 at 06:05
  • Or I could create a column of all 1, group by that and apply the same func, so that I get the sun of all the avg of all, etc where needed. I'd then need to do a union (in SQL speak, concat in pandas) to append the total row to the original dataframe. But I wonder if there's a more elegant way. – Pythonista anonymous Oct 13 '17 at 07:01
  • 1
    Yes, `loc` or `concat` or `join` - all solutions working perfectly. – jezrael Oct 13 '17 at 07:02