How to replace the outliers with the 95th and 5th percentile in Python?

Question

I am trying to do an outlier treatment on my time series data where I want to replace the values > 95th percentile with the 95th percentile and the values < 5th percentile with the 5th percentile value. I have prepared some code but I am unable to find the desired result.

I am trying to create a OutlierTreatment function using a sub- function called Cut. The code is given below

def outliertreatment(df,high_limit,low_limit):
    df_temp=df['y'].apply(cut,high_limit,low_limit, extra_kw=1)
    return df_temp
def cut(column,high_limit,low_limit):
    conds = [column > np.percentile(column, high_limit),
             column < np.percentile(column, low_limit)]
    choices = [np.percentile(column, high_limit),
            np.percentile(column, low_limit)]
    return np.select(conds,choices,column)

I expect to send the dataframe, 95 as high_limit and 5 as low_limit in the OutlierTreatment function. How to achieve the desired result?

Be careful with how you set your 95th and 5th values because if you are iterating, these limits will change whenever the the values that surpass the 95th change. Other than that, simply define a function that if the value is higher than the fixed 95th replace it by that number and if it's lower than the 5th, replace it by that value? — Celius Stingher, Aug 21 '19 at 13:43
Possible solution: https://stackoverflow.com/questions/11686720/is-there-a-numpy-builtin-to-reject-outliers-from-a-list — Anthony R, Aug 21 '19 at 13:43

score 11 · Accepted Answer · edited Aug 21 '19 at 16:24

11

I'm not sure if this approach is a suitable way to deal with outliers, but to achieve what you want, clip function is useful. It assigns values outside boundary to boundary values. You can read more in documentation.

data=pd.Series(np.random.randn(100))
data.clip(lower=data.quantile(0.05), upper=data.quantile(0.95))

edited Aug 21 '19 at 16:24

Jaroslav Bezděk

6,967
6
29
46

answered Aug 21 '19 at 13:43

Mark Wang

2,623
7
15

Suhas_Pote · Answer 2 · 2021-07-17T18:28:52.130

If your data contains multiple columns

For individual column

p_05 = df['sales'].quantile(0.05) # 5th quantile
p_95 = df['sales'].quantile(0.95) # 95th quantile

df['sales'].clip(p_05, p_95, inplace=True)

For more than one numerical columns:

num_col = df.select_dtypes(include=['int64','float64']).columns.tolist()

# or you can create a custom list of numerical columns

df[num_col] = df[num_col].apply(lambda x: x.clip(*x.quantile([0.05, 0.95])))

Bonus:

To check outliers using box plot

import matplotlib.pyplot as plt

for x in num_col:
    df[num_col].boxplot(x)
    plt.figure()

How to replace the outliers with the 95th and 5th percentile in Python?

2 Answers2

For individual column

For more than one numerical columns: