1

For a pandas.Series, I know how to remove outliers. With something like this:

x = pd.Series(np.random.normal(size=1000))
iqr = x.quantile(.75) - x.quantile(.25)
y = x[x.between(x.quantile(.25) - 1.5*iqr, x.quantile(.75) + 1.5*iqr)]

I would like to do thins over the different Series/columns of a DataFrame

import string
import random

df = pd.DataFrame([])
df['A'] = pd.Series(np.random.normal(size=1000))
df['B'] = pd.Series(np.random.normal(size=1000, loc=-5, scale=1))
df['C'] = pd.Series(np.random.normal(size=1000, loc=10, scale=2))
df['index'] = pd.Series([random.choice(string.ascii_uppercase) for i in range(1000)])

df.set_index('index')

I usually do stuff like

df = df.groupby('index').mean()

However, in this case, it would also average the outliers, which I would like to ignore from averaging.

Notice that the random data makes than the outliers are in different positions in each column. So an outlier should be ignored only in that column/Series

The result should be a DataFrame, with 26 lines (one for each letter of index), and 3 columns, with the values averaged without outliers

I can loop over the columns of df and do the first block of code. But is there a nicer way?

Suggestion are welcome. Any approach is accepted

phollox
  • 323
  • 3
  • 13

1 Answers1

2

Use the following code.

def mean_without_outlier(x): # x: series
    iqr = x.quantile(.75) - x.quantile(.25)
    y = x[x.between(x.quantile(.25) - 1.5*iqr, x.quantile(.75) + 1.5*iqr)]
    return y.mean()

df.groupby("index")[['A', 'B', 'C']].agg(mean_without_outlier)
Gilseung Ahn
  • 2,598
  • 1
  • 4
  • 11
  • Awesome. Thanks for pointing out the last line, which is what I was missing. I actually did `df.groupby("index").agg(mean_without_outlier)` which computes the filtered mean for ALL columns/Series – phollox Nov 02 '21 at 09:11