0

Domain: Python & Pandas

I have a time series data frame which has the total number of customers for each day for the last 10 years.

The columns are:

  • date
  • total customers

There are outliers in my total customers column.

I wanted to reset the outliers outside of 3 standard deviations above the mean to a value as defined by the formula below.

Outlier which is above 3SD = Mean + 3 S.D.

zosh
  • 83
  • 1
  • 7

1 Answers1

1

You could use the .clip_upper() method to limit values in the customers column to mean+3*sd.

m = df['total customers'].mean()
sd = df['total customers'].std()
df['total customers'] = df['total_customers'].clip_upper(m + 3*sd)

Here's the documentation for clip_upper.

Craig
  • 4,605
  • 1
  • 18
  • 28
  • Thank you so much for your reply – zosh Nov 21 '18 at 21:52
  • 1
    This function does exactly what you are asking for. It replaces any values that exceed the 'clip' value with the 'clip' value. It does not remove anything. – Craig Nov 21 '18 at 21:54
  • Hey Craig, sorry to bother you again: What if I wanted to completely remove all the rows with outliers? – zosh Nov 21 '18 at 22:13
  • @zosh - That's a new question, but the answer is to use boolean indexing as described in https://stackoverflow.com/a/23200666/7517724 – Craig Nov 22 '18 at 00:13