pandas and groupby: how to calculate weighted averages within an agg

Question

I calculate a number of aggregate functions using groupby and agg , because I need different aggregate functions for different variables, e.g. not the sum of all, but sum and mean of x, mean of y, etc.

Is there a way to calculate a weighted average using agg? I have found lots of examples, but none with agg.

I can calculate the weighted average manually, as in the code below (note the lines with **), but I was wondering if there is a more elegant and direct way?

Can I create my own function and use that with agg?

For the sake of clarity, I fully understand there are other solutions, e.g.

Pandas DataFrame aggregate function using multiple columns
groupby weighted average and sum in pandas dataframe
Calculate weighted average with pandas dataframe

and lots, lots more. However, as I said, I am not sure how to implement these solutions with an agg, and I need agg because I need to apply different aggregate functions to different columns (again, not the sum of all, but sum and mean of x, mean of y, etc.).

import numpy as np
import pandas as pd
df= pd.DataFrame(np.random.randint(5,8,(1000,4)), columns=['a','b','c','d'])
**df['c * b']= df['c']* df['b']**
g = df.groupby('a').agg(
        {'b':['sum', lambda x: x.sum() / df['b'] .sum(), 'mean'],
              'c':['sum','mean'], 'd':['sum'],
              'c * b':['sum']})
g.columns = g.columns.map('_'.join)
**g['weighted average of c'] = g['c * b_sum'] / g['b_sum']**

Running 'c * b':[lambda x: x.sum() / df['b'].sum() ] doesn't work, because this divides each row of [c * b] by the sum of b, i.e. it divides each row by the same number, which is not what I need here. — Pythonista anonymous, Oct 12 '17 at 16:30
Can you give sample input with maybe a fixed random seed then your expected output? `np.random.seed(123)` — Scott Boston, Oct 12 '17 at 16:31
Sure. Setting np.random.seed(123) the groupby returns 3 rows, and the weighted averages are: [6, 6.06 , 6.05]. This can be seen in the column where I calculate it manually (the line of code with ** at the bottom). — Pythonista anonymous, Oct 12 '17 at 16:39
Why not just do means for the selected variables and then std's for the other selected variables. Sometimes a multi-line approach can actually be much (syntax-wise) than a single line approach, and is no less efficient — JohnE, Oct 12 '17 at 16:47

jezrael · Accepted Answer · 2017-10-13T05:28:59.977

4

Is it possible, but really complicated:

np.random.seed(234)
df= pd.DataFrame(np.random.randint(5,8,(1000,4)), columns=['a','b','c','d'])

wm = lambda x: (x * df.loc[x.index, "c"]).sum() / x.sum()
wm.__name__ = 'wa'

f = lambda x: x.sum() / df['b'] .sum()
f.__name__ = '%'

g = df.groupby('a').agg(
        {'b':['sum', f, 'mean', wm],
         'c':['sum','mean'], 
         'd':['sum']})
g.columns = g.columns.map('_'.join)
print (g)

   d_sum  c_sum    c_mean  b_sum       b_%    b_mean      b_wa
a                                                             
5   2104   2062  5.976812   2067  0.344672  5.991304  5.969521
6   1859   1857  5.951923   1875  0.312656  6.009615  5.954667
7   2058   2084  6.075802   2055  0.342671  5.991254  6.085645

Solution with apply:

def func(x):
#    print (x)
    b1 = x['b'].sum()
    b2 = x['b'].sum() / df['b'].sum()
    b3 = (x['b'] * x['c']).sum() / x['b'].sum()
    b4 = x['b'].mean()

    c1 = x['c'].sum()
    c2 = x['c'].mean()

    d1 = x['d'].sum()
    cols = ['b sum','b %','wa', 'b mean', 'c sum', 'c mean', 'd sum']
    return pd.Series([b1,b2,b3,b4,c1,c2,d1], index=cols)


g = df.groupby('a').apply(func)
print (g)
    b sum       b %        wa    b mean   c sum    c mean   d sum
a                                                                
5  2067.0  0.344672  5.969521  5.991304  2062.0  5.976812  2104.0
6  1875.0  0.312656  5.954667  6.009615  1857.0  5.951923  1859.0
7  2055.0  0.342671  6.085645  5.991254  2084.0  6.075802  2058.0

g.loc['total']=g.sum()
print (g)
        b sum       b %         wa     b mean   c sum     c mean   d sum
a                                                                       
5      2067.0  0.344672   5.969521   5.991304  2062.0   5.976812  2104.0
6      1875.0  0.312656   5.954667   6.009615  1857.0   5.951923  1859.0
7      2055.0  0.342671   6.085645   5.991254  2084.0   6.075802  2058.0
total  5997.0  1.000000  18.009832  17.992173  6003.0  18.004536  6021.0

edited Oct 13 '17 at 05:28

answered Oct 12 '17 at 16:37

jezrael

822,522
95
1,334
1,252

So I need to create one function per each column? I cannot just create a generic function which takes two inputs (values and weights) and use that within agg() ? – Pythonista anonymous Oct 12 '17 at 16:42
1

No,unfortunately it is not possible because `agg` always working only with one column. Function `apply` can working with multiple columns nice. – jezrael Oct 12 '17 at 16:43
Forgive me for the banal question, but, as you will have guessed, I am very much a novice! Could I use apply in my example, ie when I need to use different aggregate functions for different columns (sum and min of x, mean of y, max of z, etc).? – Pythonista anonymous Oct 12 '17 at 16:45
@Pythonistaanonymous - It is not shame you are novice, each coder starts some way. I add solution with apply, it is used if need some complicated aggregate functions. Please check edited answer. – jezrael Oct 12 '17 at 16:56
Super! Maybe it should be another question, but how would you add a row at the bottom with the totals, in your solution? – Pythonista anonymous Oct 12 '17 at 21:47
Use `g.loc['total']=g.sum()` – jezrael Oct 13 '17 at 04:12
You need call it outside apply or outside. – jezrael Oct 13 '17 at 04:18
One thing - if my answers was helpful, please dont forget accept it. Thanks. – jezrael Oct 13 '17 at 05:29
g.sum sums everything, whereas not all columns are sums: some are mean, max, weighted avg – Pythonista anonymous Oct 13 '17 at 06:00
OK, then use `g.loc['total']=g.filter(like='sum').sum()` - but is necessary all sum column have `sum` in column names. – jezrael Oct 13 '17 at 06:02
Another solution is seelct columns by names - `g.loc['total']=g[['b sum','c sum','d sum']].sum()` – jezrael Oct 13 '17 at 06:05
Or I could create a column of all 1, group by that and apply the same func, so that I get the sun of all the avg of all, etc where needed. I'd then need to do a union (in SQL speak, concat in pandas) to append the total row to the original dataframe. But I wonder if there's a more elegant way. – Pythonista anonymous Oct 13 '17 at 07:01
1

Yes, `loc` or `concat` or `join` - all solutions working perfectly. – jezrael Oct 13 '17 at 07:02

pandas and groupby: how to calculate weighted averages within an agg

1 Answers1

Linked