0

I have a non null dataframe df which has about 100 columns. I want to remove outliers from each column, for which I'm doing the following.

df1 = df[np.abs(df - df.mean()) <= (3*df.std())]

I would expect df1 to contain lesser number of records than df but using the above method, shape remains same. In addition it is also creating a lof of null values.

My understanding is that its removing outliers but in place of the outliers now I have nulls. Is my understanding correct?

iprof0214
  • 701
  • 2
  • 6
  • 19
  • Try `df[(np.abs(df - df.mean()) <= (3*df.std())).all()]`. – Graipher Mar 18 '19 at 17:53
  • @Graipher - Thanks. Got the following error : IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match – iprof0214 Mar 18 '19 at 17:56
  • you aren't subsetting/slicing your original dataframe anywhere. https://stackoverflow.com/questions/19237878/subsetting-a-python-dataframe – dasvootz Mar 18 '19 at 18:39

1 Answers1

0

Your understanding is correct. It is removing outliers and replacing them with NaN:

np.random.seed(0)
df = pd.DataFrame(np.random.normal(0,1,(100,10)))

idx = np.abs(df - df.mean()) <= (3*df.std())
outlier_locations = np.where(idx == False)
df1 = df[idx]

print(outlier_locations)

    (array([58]), array([9]))

If you expect df1 to contain less records than df, then perhaps you are wanting to drop the rows or columns that contain the outliers, or simply remove the entry in the row so you are left with ragged arrays.

Nathaniel
  • 3,230
  • 11
  • 18