Weighted average on a GroupBy DataFrame with Multiple Columns and a Fractional Weight Column

Question

My question is similar to this and that but neither answer works for me.

I have a dataframe of users and user survey responses. Each survey response is a assigned a weight which is a fractional number (like 1.532342). Each user responds with ~20 scores, in this example shown as scoreA and scoreB.

user	weight	scoreA	scoreB
1	2	3	1
1	1	5	3
1	0.5	7	5
2	0.5	8	6
2	1	9	7
2	0.5	8	6

It's trivial to compute the average unweighted score for each column by way of scores.groupby('user').mean() but I'm struggling to compute the weighted score.

df = pd.DataFrame({
    'weight': [ 2, 1, 0.5, 0.5,1,0.5], 
    'scoreA': [3,5,7, 8,9,8], 
    'scoreB': [1,3,5, 6,7,6] 
}, index=pd.Index([1,1,1,2,2,2],name='user'))
scores = df[['scoreA', 'scoreB']]
weights = df.weight

scores.groupby('user').mean()
>>> scoreA  scoreB
user        
1   5.000000    3.000000
2   8.333333    6.333333

scores.groupby('user').agg(lambda x: np.average(x, weights=weights)
>>> TypeError: Axis must be specified when shapes of a and weights differ.

What I want to output is:

df.drop(columns='weight').mul(df.weight,axis=0).groupby('user').sum().div(df.weight.groupby('user').sum(),axis=0)
scoreA  scoreB
user        
1   4.142857    2.142857
2   8.500000    6.500000

score 1 · Answer 1 · answered Feb 22 '23 at 14:00

1

For me working set default index for correct extract weight values in DataFrame.loc:

df = df.reset_index()

df = (df.groupby('user')[['scoreA', 'scoreB']]
        .agg(lambda x: np.average(x, weights=df.loc[x.index, "weight"])))
print (df)
        scoreA    scoreB
user                    
1     4.142857  2.142857
2     8.500000  6.500000

answered Feb 22 '23 at 14:00

jezrael

822,522
95
1,334
1,252

ah, I was missing the reset_index(). Let me try on my dataset – Ilya Voytov Feb 22 '23 at 14:13
that's quite an ugly workaround as it relies on the external state of the dataframe (side effect), the provided solution at the end of the question is much cleaner – mozway Feb 22 '23 at 14:14
that might be why i can't get it to work using my separate `weights` and `scores` variables. Once you separate them, the lambda doesn't work. I do other processing on the scores before doing the averaging so it's neater to have them be separate. – Ilya Voytov Feb 22 '23 at 14:24
Then you can achieve it with `scores.groupby('user').apply(lambda g: np.average(g[cols], weights=weights.loc[g.name], axis=0))` without needing to `reset_index`. You then have to convert back to 2D as I showed in my answer – mozway Feb 22 '23 at 14:28

score 1 · Accepted Answer · answered Feb 22 '23 at 14:22

Your issue is that you try to provide the weights externally to numpy and numpy cannot perform index alignment.

The solution that you provided in the end of your answer is likely the best workaround.

For a more generic approach (assuming a function that couldn't be split in two steps), you would need to use groupby.apply to access all the columns of the group at once:

def w_avg(g, cols):
    return pd.Series(np.average(g[cols], weights=g['weight'], axis=0),
                     index=cols)

df.groupby('user').apply(w_avg, cols=['scoreA', 'scoreB'])

Or:

cols = ['scoreA', 'scoreB']
s = (df.groupby('user')
       .apply(lambda g: np.average(g[cols], weights=g['weight'], axis=0))
     )
out = pd.DataFrame(s.to_list(), index=s.index, columns=cols)

Output:

        scoreA    scoreB
user                    
1     4.142857  2.142857
2     8.500000  6.500000

Weighted average on a GroupBy DataFrame with Multiple Columns and a Fractional Weight Column

2 Answers2