1

My question is similar to this and that but neither answer works for me.

I have a dataframe of users and user survey responses. Each survey response is a assigned a weight which is a fractional number (like 1.532342). Each user responds with ~20 scores, in this example shown as scoreA and scoreB.

user weight scoreA scoreB
1 2 3 1
1 1 5 3
1 0.5 7 5
2 0.5 8 6
2 1 9 7
2 0.5 8 6

It's trivial to compute the average unweighted score for each column by way of scores.groupby('user').mean() but I'm struggling to compute the weighted score.

df = pd.DataFrame({
    'weight': [ 2, 1, 0.5, 0.5,1,0.5], 
    'scoreA': [3,5,7, 8,9,8], 
    'scoreB': [1,3,5, 6,7,6] 
}, index=pd.Index([1,1,1,2,2,2],name='user'))
scores = df[['scoreA', 'scoreB']]
weights = df.weight

scores.groupby('user').mean()
>>> scoreA  scoreB
user        
1   5.000000    3.000000
2   8.333333    6.333333

scores.groupby('user').agg(lambda x: np.average(x, weights=weights)
>>> TypeError: Axis must be specified when shapes of a and weights differ.

What I want to output is:

df.drop(columns='weight').mul(df.weight,axis=0).groupby('user').sum().div(df.weight.groupby('user').sum(),axis=0)
scoreA  scoreB
user        
1   4.142857    2.142857
2   8.500000    6.500000
Ilya Voytov
  • 329
  • 1
  • 9

2 Answers2

1

For me working set default index for correct extract weight values in DataFrame.loc:

df = df.reset_index()

df = (df.groupby('user')[['scoreA', 'scoreB']]
        .agg(lambda x: np.average(x, weights=df.loc[x.index, "weight"])))
print (df)
        scoreA    scoreB
user                    
1     4.142857  2.142857
2     8.500000  6.500000
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • ah, I was missing the reset_index(). Let me try on my dataset – Ilya Voytov Feb 22 '23 at 14:13
  • that's quite an ugly workaround as it relies on the external state of the dataframe (side effect), the provided solution at the end of the question is much cleaner – mozway Feb 22 '23 at 14:14
  • that might be why i can't get it to work using my separate `weights` and `scores` variables. Once you separate them, the lambda doesn't work. I do other processing on the scores before doing the averaging so it's neater to have them be separate. – Ilya Voytov Feb 22 '23 at 14:24
  • Then you can achieve it with `scores.groupby('user').apply(lambda g: np.average(g[cols], weights=weights.loc[g.name], axis=0))` without needing to `reset_index`. You then have to convert back to 2D as I showed in my answer – mozway Feb 22 '23 at 14:28
1

Your issue is that you try to provide the weights externally to numpy and numpy cannot perform index alignment.

The solution that you provided in the end of your answer is likely the best workaround.

For a more generic approach (assuming a function that couldn't be split in two steps), you would need to use groupby.apply to access all the columns of the group at once:

def w_avg(g, cols):
    return pd.Series(np.average(g[cols], weights=g['weight'], axis=0),
                     index=cols)

df.groupby('user').apply(w_avg, cols=['scoreA', 'scoreB'])

Or:

cols = ['scoreA', 'scoreB']
s = (df.groupby('user')
       .apply(lambda g: np.average(g[cols], weights=g['weight'], axis=0))
     )
out = pd.DataFrame(s.to_list(), index=s.index, columns=cols)

Output:

        scoreA    scoreB
user                    
1     4.142857  2.142857
2     8.500000  6.500000
mozway
  • 194,879
  • 13
  • 39
  • 75