0

I have a dataset and need to remove the outliers 3 standard deviations away from the mean for each numerical column. The rows which contain the outliers should then be dropped.

Vish10
  • 27
  • 3
  • What have you tried so far based on your own research, and what went wrong with your attempts? Please [edit] your question to include a [mcve] so that we can provide _specific_ help – G. Anderson Mar 11 '22 at 17:04
  • You can use Boolean indexing: `df[np.abs(df['col_name']-df['col_name'].mean())<=(3*df['col_name'].std())]` . Also have a look at this [question](https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-a-pandas-dataframe) – Yolao_21 Mar 11 '22 at 17:07

1 Answers1

1

Here is an example of code that will do what your question asks:

import pandas as pd
df = pd.DataFrame( [ 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105 ] * 10 + [ 1, 11, 21, 31, 41, 161, 171, 181, 191, 201 ] )
print(len(df[0]))
print(df.std()[0])
print(df[0].mean())
df_filtered = df[(df[0] - df[0].mean()).abs() < 3 * df.std()[0]]
print(len(df_filtered[0]))

Output:

120
23.74747517170641
100.08333333333333
114

The length of the filtered dataframe is 6 less than that of the original, as 6 values are outliers beyond 3 standard deviations.

constantstranger
  • 9,176
  • 2
  • 5
  • 19