pandas dataframe how to replace extreme outliers for all columns

Question

I have a pandas dataframe with some very extreme value - more than 5 std. I want to replace, per column, each value that is more than 5 std with the max other value. For example,

Will become:

What is the best way to do it without a for loop over the columns?

does this help: https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-a-pandas-dataframe — piterbarg, Jan 12 '22 at 13:39
In this particular case, the last row is "only" ~2.3 the std of the column, you should better explain your criterion for exclusion ;) — mozway, Jan 12 '22 at 13:41
@piterbarg in the link they find the outlier but how I replace its value, instead of filter it? — Cranjis, Jan 12 '22 at 13:44
@mozway not sure what u meant? it is more than 2.3std? but anyway I changed — Cranjis, Jan 12 '22 at 13:45
doesn't change much, the issue is that a single extreme value can bias the deviation, please provide the calculations for the exclusion, you'll realize what I mean — mozway, Jan 12 '22 at 13:48
@mozway got it, good point. So what should be the best way to detect extreme outlier w/o be biased by the outlier itself? — Cranjis, Jan 12 '22 at 13:56
that's the thing, there is no "best way", there are [dozen of ways](https://en.wikipedia.org/wiki/Anomaly_detection), this is an entire field so you need to test existing methods and see if it meets your expectations with your use case — mozway, Jan 12 '22 at 14:00

wwnde · Accepted Answer · 2022-01-12T14:10:47.740

1

s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame

    A    B
0  1.0  2.0
1  1.0  6.0
2  2.0  8.0
3  1.0  8.0
4  2.0  1.0

edited Jan 12 '22 at 14:10

answered Jan 12 '22 at 13:59

wwnde

26,119
6
18
32

1

I do not think the condition is quite right, it says x - standard_deviation(x) > 5 is an outlier? works in this case but as a general rule for outlier detection it is quite poor. Eg for a column [100,...,100] it will mark all elements as outliers – piterbarg Jan 12 '22 at 14:26
@piterbarg and you may be right. However, I assume OP knows want he wants. I do not not what dataset he has. I answered the question literally. I didn't think I needed to read the question beyond the info he has provided. With your comment, he can edit the question and I can re do my answer if needed – wwnde Jan 12 '22 at 14:41
1

I agree that your solution implements OPs request taken literally, but probably not what he really wants – piterbarg Jan 12 '22 at 14:54

Grinjero · Answer 2 · 2022-01-12T14:02:55.850

Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range

def calc_zscore(col):
   return (col - col.mean()) / col.std()

zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5

After that it's up to you to fill the values marked with the boolean mask.

df[outlier_mask] = something

score 0 · Answer 3 · answered Jan 12 '22 at 13:57

Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do

q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df

this fixes column A:


    A   B
0   1   2
1   1   6
2   2   8
3   1   115
4   2   1

do the same for B

pandas dataframe how to replace extreme outliers for all columns

3 Answers3