0

I have a pandas data frame:

data = pd.read_csv(path)

I'm looking for a good way to remove outlier rows that have an extreme value in any of the features (I have 400 features in the data frame) before I run some prediction algorithms.

Tried a few ways but they don't seem to solve the issue:

  • data[data.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]

  • using Standard Scaler

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Menkes
  • 391
  • 1
  • 5
  • 18
  • Can you add sample of data and desired output? because it seems your solution is [nice](http://stackoverflow.com/a/31502974/2901002). – jezrael Jun 09 '16 at 06:47
  • Unfortunately I cannot share data but is there a built in way in pandas to do it? – Menkes Jun 09 '16 at 06:48

1 Answers1

0

I think you can check your output but comparing both indexes by Index.difference, because I think your solution works very nice:

import pandas as pd
import numpy as np

np.random.seed(1234)
df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))
print (df)
           A         B         C
0   0.471435 -1.190976  1.432707
1  -0.312652 -0.720589  0.887163
2   0.859588 -0.636524  0.015696
3  -2.242685  1.150036  0.991946
4   0.953324 -2.021255 -0.334077
5   0.002118  0.405453  0.289092
6   1.321158 -1.546906 -0.202646
7  -0.655969  0.193421  0.553439
8   1.318152 -0.469305  0.675554
9  -1.817027 -0.183109  1.058969
10 -0.397840  0.337438  1.047579
11  1.045938  0.863717 -0.122092
12  0.124713 -0.322795  0.841675
13  2.390961  0.076200 -0.566446
14  0.036142 -2.074978  0.247792
15 -0.897157 -0.136795  0.018289
16  0.755414  0.215269  0.841009
17 -1.445810 -1.401973 -0.100918
18 -0.548242 -0.144620  0.354020
19 -0.035513  0.565738  1.545659
20 -0.974236 -0.070345  0.307969
21 -0.208499  1.033801 -2.400454
22  2.030604 -1.142631  0.211883
23  0.704721 -0.785435  0.462060
24  0.704228  0.523508 -0.926254
25  2.007843  0.226963 -1.152659
26  0.631979  0.039513  0.464392
27 -3.563517  1.321106  0.152631
28  0.164530 -0.430096  0.767369
29  0.984920  0.270836  1.391986
df1 = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]
print (df1)
           A         B         C
0   0.471435 -1.190976  1.432707
1  -0.312652 -0.720589  0.887163
2   0.859588 -0.636524  0.015696
3  -2.242685  1.150036  0.991946
4   0.953324 -2.021255 -0.334077
5   0.002118  0.405453  0.289092
6   1.321158 -1.546906 -0.202646
7  -0.655969  0.193421  0.553439
8   1.318152 -0.469305  0.675554
9  -1.817027 -0.183109  1.058969
10 -0.397840  0.337438  1.047579
11  1.045938  0.863717 -0.122092
12  0.124713 -0.322795  0.841675
13  2.390961  0.076200 -0.566446
14  0.036142 -2.074978  0.247792
15 -0.897157 -0.136795  0.018289
16  0.755414  0.215269  0.841009
17 -1.445810 -1.401973 -0.100918
18 -0.548242 -0.144620  0.354020
19 -0.035513  0.565738  1.545659
20 -0.974236 -0.070345  0.307969
22  2.030604 -1.142631  0.211883
23  0.704721 -0.785435  0.462060
24  0.704228  0.523508 -0.926254
25  2.007843  0.226963 -1.152659
26  0.631979  0.039513  0.464392
28  0.164530 -0.430096  0.767369
29  0.984920  0.270836  1.391986
30  0.079842 -0.399965 -1.027851
31 -0.584718  0.816594 -0.081947
idx = df.index.difference(df1.index)
print (idx)
Int64Index([21, 27], dtype='int64')

print (df.loc[idx])
           A         B         C
21 -0.208499  1.033801 -2.400454
27 -3.563517  1.321106  0.152631
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252