2

I want to calculate means by group, leaving out the value of the row itself.

import pandas as pd

d = {'col1': ["a", "a", "b", "a", "b", "a"], 'col2': [0, 4, 3, -5, 3, 4]}
df = pd.DataFrame(data=d)

I know how to return means by group:

df.groupby('col1').agg({'col2': 'mean'})

Which returns:

Out[247]: 
  col1  col2
1    a     4
3    a    -5
5    a     4

But what I want is mean by group, leaving out the row's value. E.g. for the first row:

df.query('col1 == "a"')[1:4].mean()

which returns:

Out[251]: 
col2    1.0
dtype: float64

Edit: Expected output is a dataframe of the same format as df above, with a column mean_excl_own which is the mean across all other members in the group, excluding the row's own value.

ulima2_
  • 1,276
  • 1
  • 13
  • 23

2 Answers2

1

You could GroupBy col1and transform with the mean. Then subtract the value from a given row from the mean:

df['col2'] = df.groupby('col1').col2.transform('mean').sub(df.col2)
yatu
  • 86,083
  • 12
  • 84
  • 139
  • I think the OP wants the aggregated mean of the group rather than this – EdChum Apr 16 '19 at 13:48
  • But he wants to subtract the value from a given row, that would only make sense if the row itself is included in the solution right? – yatu Apr 16 '19 at 13:49
  • They want to calculate mean for the entire group, but ignore the first row for that group, at least this was my understanding, I could be wrong of course – EdChum Apr 16 '19 at 13:49
  • 1
    Hmm actually I think you're right, I'll delete my answer – EdChum Apr 16 '19 at 13:50
1

Thanks for all your input. I ended up using the approach linked to by @VnC.

Here's how I solved it:

import pandas as pd

d = {'col1': ["a", "a", "b", "a", "b", "a"], 'col2': [0, 4, 3, -5, 3, 4]}
df = pd.DataFrame(data=d)

group_summary = df.groupby('col1', as_index=False)['col2'].agg(['mean', 'count'])
df = pd.merge(df, group_summary, on = 'col1')

df['other_sum'] = df['col2'] * df['mean'] - df['col2'] 
df['result'] = df['other_sum'] / (df['count']  - 1)

Check out the final result:

df['result']

Which prints:

Out: 
0    1.000000
1   -0.333333
2    2.666667
3   -0.333333
4    3.000000
5    3.000000
Name: result, dtype: float64

Edit: I previously had some trouble with column names, but I fixed it using this answer.

ulima2_
  • 1,276
  • 1
  • 13
  • 23