4

I would like to calculate, by group, the mean of one column and the weighted mean of another column in a dataset using the .agg() function within pandas. I am aware of a few solutions, but they aren't very concise.

One solution has been posted here (pandas and groupby: how to calculate weighted averages within an agg, but it still doesn't seem very flexible because the weights column is hard coded in the lambda function definition. I'm looking to create a syntax closer to this:

(
df
.groupby(['group'])
.agg(avg_x=('x', 'mean'),
     wt_avg_y=('y', 'weighted_mean', weights='weight')
)

Here is a fully worked example with code that seems needlessly complicated:

import pandas as pd
import numpy as np

# sample dataset
df = pd.DataFrame({
    'group': ['a', 'a', 'b', 'b'],
    'x': [1, 2, 3, 4],
    'y': [5, 6, 7, 8],
    'weights': [0.75, 0.25, 0.75, 0.25]
})
df
#>>>    group   x   y   weights
#>>> 0      a   1   5   0.75
#>>> 1      a   2   6   0.25
#>>> 2      b   3   7   0.75
#>>> 3      b   4   8   0.25

# aggregation logic
summary = pd.concat(
    [
        df.groupby(['group']).x.mean(),
        df.groupby(['group']).apply(lambda x: np.average(x['y'], weights=x['weights']))
    ], axis=1
)
# manipulation to format the output of the aggregation
summary = summary.reset_index().rename(columns={'x': 'avg_x', 0: 'wt_avg_y'})

# final output
summary
#>>>    group   avg_x   wt_avg_y
#>>> 0      a   1.50    5.25
#>>> 1      b   3.50    7.25
Steven M. Mortimer
  • 1,618
  • 14
  • 36
  • Does this answer your question? [Calculate weighted average using a pandas/dataframe](https://stackoverflow.com/questions/26205922/calculate-weighted-average-using-a-pandas-dataframe) – Grzegorz Skibinski May 15 '20 at 20:36
  • No because it requires two separate calculations one with an .agg and another with .apply for the weighted average. – Steven M. Mortimer May 15 '20 at 20:38

5 Answers5

8

Using the .apply() method on the entire DataFrame was the simplest solution I could arrive to that does not hardcode the column name inside the function definition.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'group': ['a', 'a', 'b', 'b'],
    'x': [1, 2, 3, 4],
    'y': [5, 6, 7, 8],
    'weights': [0.75, 0.25, 0.75, 0.25]
})

summary = (
    df
    .groupby(['group'])
    .apply(
        lambda x: pd.Series([
            np.mean(x['x']),
            np.average(x['y'], weights=x['weights'])
        ], index=['avg_x', 'wt_avg_y'])
    )
    .reset_index()
)
# final output
summary
#>>>    group   avg_x   wt_avg_y
#>>> 0      a   1.50    5.25
#>>> 1      b   3.50    7.25
Steven M. Mortimer
  • 1,618
  • 14
  • 36
0

How about this:

grouped = df.groupby('group')

def wavg(group):
    group['mean_x'] = group['x'].mean()
    group['wavg_y'] = np.average(group['y'], weights=group.loc[:, "weights"])
    return group

grouped.apply(wavg)
  • This is the solution that I linked to. It hard codes the weights column in the function definition. I would prefer to pass them in as an argument. I think the problem is hard to solve because agg only operations on a single column at a time. – Steven M. Mortimer May 15 '20 at 20:47
  • Ah, sorry I didn't see that you linked it. Indeed, egg is acting on a single column. One can play around with apply tho. Something like this: ```grouped = df.groupby('group') def wa(group): return group['x'].mean(),np.average(group['y'], weights=group.loc[:, "weights"]) grouped.apply(wa)``` It returns a tuple tho and not two columns. But maybe there is a workaround – Rostyslav Shevchenko May 15 '20 at 20:53
0

Since your weights sum to 1 within groups, you can assign a new column and groupby as usual:

(df.assign(wt_avg_y=df['y']*df['weights'])
  .groupby('group')
  .agg({'x': 'mean', 'wt_avg_y':'sum', 'weights':'sum'})
  .assign(wt_avg_y=lambda x: x['wt_avg_y']/ x['weights'])
) 

Output:

         x  wt_avg_y  weights
group                        
a      1.5      5.25      1.0
b      3.5      7.25      1.0
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
0

Try:

df["weights"]=df["weights"].div(df.join(df.groupby("group")["weights"].sum(), on="group", rsuffix="_2").iloc[:, -1])
df["y"]=df["y"].mul(df["weights"])

res=df.groupby("group", as_index=False).agg({"x": "mean", "y": "sum"})

Outputs:

  group    x     y
0     a  1.5  5.25
1     b  3.5  7.25
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
0

Steven M. Mortimer's solution is clean and easy to read. Alternatively, one could use dict notation inside pd.Series() such that the index= argument is not needed. This provides slightly better readability in my opinion.

summary = (
    df
    .groupby(['group'])
    .apply(
        lambda x: pd.Series({
            'avg_x' : np.mean(x['x']),
            'wt_avg_y':  np.average(x['y'], weights=x['weights'])
        }))
    .reset_index()
)
PingPong
  • 355
  • 2
  • 11