Calculating weighted average using grouped .agg in pandas

Question

I would like to calculate, by group, the mean of one column and the weighted mean of another column in a dataset using the .agg() function within pandas. I am aware of a few solutions, but they aren't very concise.

One solution has been posted here (pandas and groupby: how to calculate weighted averages within an agg, but it still doesn't seem very flexible because the weights column is hard coded in the lambda function definition. I'm looking to create a syntax closer to this:

(
df
.groupby(['group'])
.agg(avg_x=('x', 'mean'),
     wt_avg_y=('y', 'weighted_mean', weights='weight')
)

Here is a fully worked example with code that seems needlessly complicated:

import pandas as pd
import numpy as np

# sample dataset
df = pd.DataFrame({
    'group': ['a', 'a', 'b', 'b'],
    'x': [1, 2, 3, 4],
    'y': [5, 6, 7, 8],
    'weights': [0.75, 0.25, 0.75, 0.25]
})
df
#>>>    group   x   y   weights
#>>> 0      a   1   5   0.75
#>>> 1      a   2   6   0.25
#>>> 2      b   3   7   0.75
#>>> 3      b   4   8   0.25

# aggregation logic
summary = pd.concat(
    [
        df.groupby(['group']).x.mean(),
        df.groupby(['group']).apply(lambda x: np.average(x['y'], weights=x['weights']))
    ], axis=1
)
# manipulation to format the output of the aggregation
summary = summary.reset_index().rename(columns={'x': 'avg_x', 0: 'wt_avg_y'})

# final output
summary
#>>>    group   avg_x   wt_avg_y
#>>> 0      a   1.50    5.25
#>>> 1      b   3.50    7.25

Does this answer your question? [Calculate weighted average using a pandas/dataframe](https://stackoverflow.com/questions/26205922/calculate-weighted-average-using-a-pandas-dataframe) — Grzegorz Skibinski, May 15 '20 at 20:36
No because it requires two separate calculations one with an .agg and another with .apply for the weighted average. — Steven M. Mortimer, May 15 '20 at 20:38

Steven M. Mortimer · Answer 1 · 2020-05-15T21:22:11.650

Using the .apply() method on the entire DataFrame was the simplest solution I could arrive to that does not hardcode the column name inside the function definition.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'group': ['a', 'a', 'b', 'b'],
    'x': [1, 2, 3, 4],
    'y': [5, 6, 7, 8],
    'weights': [0.75, 0.25, 0.75, 0.25]
})

summary = (
    df
    .groupby(['group'])
    .apply(
        lambda x: pd.Series([
            np.mean(x['x']),
            np.average(x['y'], weights=x['weights'])
        ], index=['avg_x', 'wt_avg_y'])
    )
    .reset_index()
)
# final output
summary
#>>>    group   avg_x   wt_avg_y
#>>> 0      a   1.50    5.25
#>>> 1      b   3.50    7.25

Rostyslav Shevchenko · Answer 2 · 2020-05-15T20:56:14.017

0

How about this:

grouped = df.groupby('group')

def wavg(group):
    group['mean_x'] = group['x'].mean()
    group['wavg_y'] = np.average(group['y'], weights=group.loc[:, "weights"])
    return group

grouped.apply(wavg)

edited May 15 '20 at 20:56

answered May 15 '20 at 20:41

Rostyslav Shevchenko

201
2
6

This is the solution that I linked to. It hard codes the weights column in the function definition. I would prefer to pass them in as an argument. I think the problem is hard to solve because agg only operations on a single column at a time. – Steven M. Mortimer May 15 '20 at 20:47
Ah, sorry I didn't see that you linked it. Indeed, egg is acting on a single column. One can play around with apply tho. Something like this: ```grouped = df.groupby('group') def wa(group): return group['x'].mean(),np.average(group['y'], weights=group.loc[:, "weights"]) grouped.apply(wa)``` It returns a tuple tho and not two columns. But maybe there is a workaround – Rostyslav Shevchenko May 15 '20 at 20:53

Quang Hoang · Answer 3 · 2020-05-15T21:15:27.530

0

Since your weights sum to 1 within groups, you can assign a new column and groupby as usual:

(df.assign(wt_avg_y=df['y']*df['weights'])
  .groupby('group')
  .agg({'x': 'mean', 'wt_avg_y':'sum', 'weights':'sum'})
  .assign(wt_avg_y=lambda x: x['wt_avg_y']/ x['weights'])
)

Output:

         x  wt_avg_y  weights
group                        
a      1.5      5.25      1.0
b      3.5      7.25      1.0

edited May 15 '20 at 21:15

answered May 15 '20 at 21:03

Quang Hoang

146,074
10
56
74

@StevenM.Mortimer see update for a solution that does not requires `weights` adding up to `1`. – Quang Hoang May 15 '20 at 21:15

score 0 · Answer 4 · answered May 15 '20 at 21:08

Try:

df["weights"]=df["weights"].div(df.join(df.groupby("group")["weights"].sum(), on="group", rsuffix="_2").iloc[:, -1])
df["y"]=df["y"].mul(df["weights"])

res=df.groupby("group", as_index=False).agg({"x": "mean", "y": "sum"})

Outputs:

  group    x     y
0     a  1.5  5.25
1     b  3.5  7.25

score 0 · Answer 5 · answered Feb 20 '23 at 04:51

Steven M. Mortimer's solution is clean and easy to read. Alternatively, one could use dict notation inside pd.Series() such that the index= argument is not needed. This provides slightly better readability in my opinion.

summary = (
    df
    .groupby(['group'])
    .apply(
        lambda x: pd.Series({
            'avg_x' : np.mean(x['x']),
            'wt_avg_y':  np.average(x['y'], weights=x['weights'])
        }))
    .reset_index()
)

Calculating weighted average using grouped .agg in pandas

5 Answers5