Removing outliers in groups with standard deviation in Pandas?

Question

I have a Pandas dataframe that I am trying to remove outliers from on a group by group basis. Each row in a group is considered an outlier the value of a column if it is outside the range of

[group_mean - (group_std_dev * 3), group_mean + (group_std_dev * 3)]

where group_mean is the average value of the column in the group, and group_std_dev is the standard deviation of the column for the group. I tried the following Pandas chain

df.groupby(by='group').apply(lambda x: x[(x['col'].mean() - (x['col'].std() * 3)) < x['col'] < (x['col'].mean() - (x['col'].std() * 3)])

but it does not appear the work as Pandas throws the following error for the comparison inside apply

The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

The error does not appear to make much sense to me because the comparison should convert to a Series of bools, which then is applied to the group x?

However filtering by just the upper or lower bound does work, like

df.groupby(by='group').apply(lambda x: x[(x['col'].mean() - (x['col'].std() * 3)) < x['col'])

but I am unsure of how to chain these together.

Does anyone have any ideas on how to simply & cleanly implement this? It doesn't appear very hard to me, but other posts on here have not yielded a satisfactory or working answer.

See this answer for the _The truth value of a Series is ambiguous_ error: https://stackoverflow.com/q/36921951/11301900 — AMC, Jan 22 '20 at 23:36

ansev · Accepted Answer · 2020-01-22T17:43:22.667

Use GroupBy.transform and Series.between, this is faster:

groups = df.groupby('group')['col']
groups_mean = groups.transform('mean')
groups_std = groups.transform('std')
m = df['col'].between(groups_mean.sub(groups_std.mul(3)),
                      groups_mean.add(groups_std.mul(3)),
                      inclusive=False)
print(m)
new_df = df.loc[m]

When should I want to use apply

Your code with apply could be:

df.groupby(by='group')['col'].apply(lambda x: x.lt( x.mean().add(x.std().mul(3)) ) & x.gt( x.mean().sub(x.std().mul(3)) ))

Removing outliers in groups with standard deviation in Pandas?

1 Answers1