1

I have a pandas dataframe with some very extreme value - more than 5 std. I want to replace, per column, each value that is more than 5 std with the max other value. For example,

df = A B
     1 2
     1 6
     2 8
     1 115
     191 1

Will become:

df = A B
     1 2
     1 6
     2 8
     1 8
     2 1

What is the best way to do it without a for loop over the columns?

Cranjis
  • 1,590
  • 8
  • 31
  • 64
  • does this help: https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-a-pandas-dataframe – piterbarg Jan 12 '22 at 13:39
  • In this particular case, the last row is "only" ~2.3 the std of the column, you should better explain your criterion for exclusion ;) – mozway Jan 12 '22 at 13:41
  • @piterbarg in the link they find the outlier but how I replace its value, instead of filter it? – Cranjis Jan 12 '22 at 13:44
  • @mozway not sure what u meant? it is more than 2.3std? but anyway I changed – Cranjis Jan 12 '22 at 13:45
  • doesn't change much, the issue is that a single extreme value can bias the deviation, please provide the calculations for the exclusion, you'll realize what I mean – mozway Jan 12 '22 at 13:48
  • @mozway got it, good point. So what should be the best way to detect extreme outlier w/o be biased by the outlier itself? – Cranjis Jan 12 '22 at 13:56
  • that's the thing, there is no "best way", there are [dozen of ways](https://en.wikipedia.org/wiki/Anomaly_detection), this is an entire field so you need to test existing methods and see if it meets your expectations with your use case – mozway Jan 12 '22 at 14:00

3 Answers3

1
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame

    A    B
0  1.0  2.0
1  1.0  6.0
2  2.0  8.0
3  1.0  8.0
4  2.0  1.0
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • 1
    I do not think the condition is quite right, it says x - standard_deviation(x) > 5 is an outlier? works in this case but as a general rule for outlier detection it is quite poor. Eg for a column [100,...,100] it will mark all elements as outliers – piterbarg Jan 12 '22 at 14:26
  • @piterbarg and you may be right. However, I assume OP knows want he wants. I do not not what dataset he has. I answered the question literally. I didn't think I needed to read the question beyond the info he has provided. With your comment, he can edit the question and I can re do my answer if needed – wwnde Jan 12 '22 at 14:41
  • 1
    I agree that your solution implements OPs request taken literally, but probably not what he really wants – piterbarg Jan 12 '22 at 14:54
0

Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range

def calc_zscore(col):
   return (col - col.mean()) / col.std()

zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5

After that it's up to you to fill the values marked with the boolean mask.

df[outlier_mask] = something
Grinjero
  • 436
  • 2
  • 7
0

Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do

q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df

this fixes column A:


    A   B
0   1   2
1   1   6
2   2   8
3   1   115
4   2   1

do the same for B

piterbarg
  • 8,089
  • 2
  • 6
  • 22