0

I have some DataFrame:

d = {'fruit': ['apple', 'pear', 'peach'] * 6, 'values': np.random.uniform(-5,5,18), 'values2': np.random.uniform(-5,5,18)}
df = pd.DataFrame(data=d)

I can take the mean of each fruit group as such:

df.groupby('fruit').mean()

However, for each group of fruit, I'd like to take the mean of the N number of largest values as ranked by absolute value.

So for example, if my values were as follows and N=3:

[ 0.7578507 ,  3.81178045, -4.04810913,  3.08887538,  2.87999752, 4.65670954]

The desired outcome would be (4.65670954 + -4.04810913 + 3.81178045) / 3 = ~1.47

Edit - to clarify that sign is preserved in outcome:

(4.65670954 + -20.04810913 + 3.81178045) / 3 = -3.859

Cactus Philosopher
  • 804
  • 2
  • 12
  • 25

1 Answers1

2

Updating with a new approach that I think is simpler. I was avoiding apply like the plague but maybe this is one of the more acceptable uses. Plus it fixes the fact that you want to mean the original values as ranked by their absolute values:

def foo(d):
    return d[d.abs().nlargest(3).index].mean()
    
out = df.groupby('fruit')['values'].apply(foo)

So you index each group by the 3 largest absolute values, then mean.

And for the record my original, incorrect, and slower code was:

df['values'].abs().groupby(df['fruit']).nlargest(3).groupby("fruit").mean()
Tom
  • 8,310
  • 2
  • 16
  • 36