3

I am trying to do an outlier treatment on my time series data where I want to replace the values > 95th percentile with the 95th percentile and the values < 5th percentile with the 5th percentile value. I have prepared some code but I am unable to find the desired result.

I am trying to create a OutlierTreatment function using a sub- function called Cut. The code is given below

def outliertreatment(df,high_limit,low_limit):
    df_temp=df['y'].apply(cut,high_limit,low_limit, extra_kw=1)
    return df_temp
def cut(column,high_limit,low_limit):
    conds = [column > np.percentile(column, high_limit),
             column < np.percentile(column, low_limit)]
    choices = [np.percentile(column, high_limit),
            np.percentile(column, low_limit)]
    return np.select(conds,choices,column)  

I expect to send the dataframe, 95 as high_limit and 5 as low_limit in the OutlierTreatment function. How to achieve the desired result?

Anwarvic
  • 12,156
  • 4
  • 49
  • 69
Ashesh Das
  • 365
  • 1
  • 8
  • 22
  • 1
    Be careful with how you set your 95th and 5th values because if you are iterating, these limits will change whenever the the values that surpass the 95th change. Other than that, simply define a function that if the value is higher than the fixed 95th replace it by that number and if it's lower than the 5th, replace it by that value? – Celius Stingher Aug 21 '19 at 13:43
  • Possible solution: https://stackoverflow.com/questions/11686720/is-there-a-numpy-builtin-to-reject-outliers-from-a-list – Anthony R Aug 21 '19 at 13:43

2 Answers2

11

I'm not sure if this approach is a suitable way to deal with outliers, but to achieve what you want, clip function is useful. It assigns values outside boundary to boundary values. You can read more in documentation.

data=pd.Series(np.random.randn(100))
data.clip(lower=data.quantile(0.05), upper=data.quantile(0.95))
Jaroslav Bezděk
  • 6,967
  • 6
  • 29
  • 46
Mark Wang
  • 2,623
  • 7
  • 15
2

If your data contains multiple columns

For individual column

p_05 = df['sales'].quantile(0.05) # 5th quantile
p_95 = df['sales'].quantile(0.95) # 95th quantile

df['sales'].clip(p_05, p_95, inplace=True)

For more than one numerical columns:

num_col = df.select_dtypes(include=['int64','float64']).columns.tolist()

# or you can create a custom list of numerical columns

df[num_col] = df[num_col].apply(lambda x: x.clip(*x.quantile([0.05, 0.95])))

Bonus:

To check outliers using box plot

import matplotlib.pyplot as plt

for x in num_col:
    df[num_col].boxplot(x)
    plt.figure()
Suhas_Pote
  • 3,620
  • 1
  • 23
  • 38