0

What is the Efficient way to remove outlier from a pandas dataframe? I have a pandas dataframe where i need to remove outlier points from the dataframe.

 X1       X2              X3              X4
228.0   4474.91836735   3507.15151515   6625.0
77.0    468.0           582.0           549.0
160.0   9.0             3507.15151515   6625.0
36.0    250.0           3507.15151515   6625.0
52.0    3.0             3.0             223.0
78.0    998.0           3507.15151515   6625.0

I tried with the solution in link but no points were removed. Even a sklearn implementation for the same will be useful.

kashf34Kashf
  • 63
  • 2
  • 8

1 Answers1

0

There are really two problems here: 1) outlier detection, and 2) removing them from a dataframe.

Problem #2 is fairly straightforward. You can use something like this once you have detected outliers in your columns:

df = df[df.loc[:,'column_name'] < high_threshold]
df = df[df.loc[:,'column_name'] > low_threshold]

Now for #1, outlier detection methods vary widely. If you have just these 4 dimensions and not so much data, a Median Absolute Deviation approach might be sufficient with no need for sklearn.

Since I don't know your application, I'll point you to this documentation on outlier detection in sklearn.

kdd
  • 436
  • 3
  • 9