groupby operation in pandas.DataFrame without outliers

Question

For a pandas.Series, I know how to remove outliers. With something like this:

x = pd.Series(np.random.normal(size=1000))
iqr = x.quantile(.75) - x.quantile(.25)
y = x[x.between(x.quantile(.25) - 1.5*iqr, x.quantile(.75) + 1.5*iqr)]

I would like to do thins over the different Series/columns of a DataFrame

import string
import random

df = pd.DataFrame([])
df['A'] = pd.Series(np.random.normal(size=1000))
df['B'] = pd.Series(np.random.normal(size=1000, loc=-5, scale=1))
df['C'] = pd.Series(np.random.normal(size=1000, loc=10, scale=2))
df['index'] = pd.Series([random.choice(string.ascii_uppercase) for i in range(1000)])

df.set_index('index')

I usually do stuff like

df = df.groupby('index').mean()

However, in this case, it would also average the outliers, which I would like to ignore from averaging.

Notice that the random data makes than the outliers are in different positions in each column. So an outlier should be ignored only in that column/Series

The result should be a DataFrame, with 26 lines (one for each letter of index), and 3 columns, with the values averaged without outliers

I can loop over the columns of df and do the first block of code. But is there a nicer way?

Suggestion are welcome. Any approach is accepted

What you want is to calculate mean of each column without outliers. Am I right? — Gilseung Ahn, Nov 02 '21 at 01:35
Exactly. Why it was so hard for me to word it in a simple way? — phollox, Nov 02 '21 at 08:29

score 2 · Accepted Answer · answered Nov 02 '21 at 01:42

2

Use the following code.

def mean_without_outlier(x): # x: series
    iqr = x.quantile(.75) - x.quantile(.25)
    y = x[x.between(x.quantile(.25) - 1.5*iqr, x.quantile(.75) + 1.5*iqr)]
    return y.mean()

df.groupby("index")[['A', 'B', 'C']].agg(mean_without_outlier)

answered Nov 02 '21 at 01:42

Gilseung Ahn

2,598
1
4
11

Awesome. Thanks for pointing out the last line, which is what I was missing. I actually did `df.groupby("index").agg(mean_without_outlier)` which computes the filtered mean for ALL columns/Series – phollox Nov 02 '21 at 09:11

groupby operation in pandas.DataFrame without outliers

1 Answers1