I have a dataset and need to remove the outliers 3 standard deviations away from the mean for each numerical column. The rows which contain the outliers should then be dropped.
Asked
Active
Viewed 398 times
0
-
What have you tried so far based on your own research, and what went wrong with your attempts? Please [edit] your question to include a [mcve] so that we can provide _specific_ help – G. Anderson Mar 11 '22 at 17:04
-
You can use Boolean indexing: `df[np.abs(df['col_name']-df['col_name'].mean())<=(3*df['col_name'].std())]` . Also have a look at this [question](https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-a-pandas-dataframe) – Yolao_21 Mar 11 '22 at 17:07
1 Answers
1
Here is an example of code that will do what your question asks:
import pandas as pd
df = pd.DataFrame( [ 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105 ] * 10 + [ 1, 11, 21, 31, 41, 161, 171, 181, 191, 201 ] )
print(len(df[0]))
print(df.std()[0])
print(df[0].mean())
df_filtered = df[(df[0] - df[0].mean()).abs() < 3 * df.std()[0]]
print(len(df_filtered[0]))
Output:
120
23.74747517170641
100.08333333333333
114
The length of the filtered dataframe is 6 less than that of the original, as 6 values are outliers beyond 3 standard deviations.

constantstranger
- 9,176
- 2
- 5
- 19