How to identify and remove outliers from a dataframe that contains both numerical and catagorical values?

Question

I have a dataset and need to remove the outliers 3 standard deviations away from the mean for each numerical column. The rows which contain the outliers should then be dropped.

What have you tried so far based on your own research, and what went wrong with your attempts? Please [edit] your question to include a [mcve] so that we can provide _specific_ help — G. Anderson, Mar 11 '22 at 17:04
You can use Boolean indexing: `df[np.abs(df['col_name']-df['col_name'].mean())<=(3*df['col_name'].std())]` . Also have a look at this [question](https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-a-pandas-dataframe) — Yolao_21, Mar 11 '22 at 17:07

score 1 · Answer 1 · answered Mar 11 '22 at 17:19

Here is an example of code that will do what your question asks:

import pandas as pd
df = pd.DataFrame( [ 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105 ] * 10 + [ 1, 11, 21, 31, 41, 161, 171, 181, 191, 201 ] )
print(len(df[0]))
print(df.std()[0])
print(df[0].mean())
df_filtered = df[(df[0] - df[0].mean()).abs() < 3 * df.std()[0]]
print(len(df_filtered[0]))

Output:

120
23.74747517170641
100.08333333333333
114

The length of the filtered dataframe is 6 less than that of the original, as 6 values are outliers beyond 3 standard deviations.

How to identify and remove outliers from a dataframe that contains both numerical and catagorical values?

1 Answers1