Pandas - Replace outliers with groupby mean

Question

I have a pandas dataframe which I would like to split into groups, calculate the mean and standard deviation, and then replace all outliers with the mean of the group. Outliers are defined as such if they are more than 3 standard deviations away from the group mean.

df = pandas.DataFrame({'a': ['A','A','A','B','B','B','B'], 'b': [1.1,1.2,1.1,3.3,3.4,3.3,100.0]})

I thought that the following would work:

df.groupby('a')['b'].transform(lambda x: x[i] if np.abs(x[i]-x.mean())<=(3*x.std()) else x.mean() for i in range(0,len(x)))

but get the following error:

NameError: name 'x' is not defined

I have also tried defining a transform function separately:

def trans_func(x):
    mean = x.mean()
    std = x.std()
    length = len(x)
    for i in range(0,length):
        if abs(x[i]-mean)<=(3*std):
            return x
        else:
            return mean

and then calling it like so:

df.groupby('a')['b'].transform(lambda x: trans_func(x))

but I get a different error:

KeyError: 0

Finally, I resorted to creating a separate column altogether:

df['c'] = [df.groupby('a')['b'].transform(mean) if df.groupby('a')['b'].transform(lambda x: (x - x.mean()) / x.std()) > 3 else df['b']]

but this hasn't worked either:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Any advice much appreciated.

elyase · Accepted Answer · 2014-12-24T15:15:35.993

8

Try this:

def replace(group):
    mean, std = group.mean(), group.std()
    outliers = (group - mean).abs() > 3*std
    group[outliers] = mean        # or "group[~outliers].mean()"
    return group

df.groupby('a').transform(replace)

Note: If you want to eliminate the 100 in your last group you can replace 3*std by just 1*std. The standard deviation in this group is 48.33 so it would be included in the result.

edited Dec 24 '14 at 15:15

answered Dec 24 '14 at 15:10

elyase

39,479
12
112
119

1

But wouldn't be that mean affected by the outlier? – jayarjo Sep 02 '17 at 01:14

score 6 · Answer 2 · answered Jan 06 '19 at 13:40

6

It would be more appropriate to first remove outliers and then calculate group means for replacement. If a mean for replacement is calculated with outliers the mean is affected by the outliers

answered Jan 06 '19 at 13:40

Andrius Vabalas

61
1
1

score 0 · Answer 3 · answered Feb 01 '19 at 07:24

Hope this would be helpful:

Step 1, remove outliers (reference from pandas group by remove outliers):

def is_outlier(s):
    lower_limit = s.mean() - (s.std() * 3)
    upper_limit = s.mean() + (s.std() * 3)
    return ~s.between(lower_limit, upper_limit)

df = df[~df.groupby('a')['count'].apply(is_outlier)]

Step 2, replace outlier (reference from elyase):

def replace(group):
    mean, std = group.mean(), group.std()
    outliers = (group - mean).abs() > 3*std
    group[outliers] = mean        # or "group[~outliers].mean()"
    return group

df.groupby('a').transform(replace)

Pandas - Replace outliers with groupby mean

3 Answers3

Linked