0

I have a dataframe with 50 numerical columns and 10 categorical columns.

df = C1 C2 .. C10 N1 N2 ... N50
     a  b      c   2 3      1

I want to remove all outliers, but only from columns N1,N2,N6,N8,N10. Meaning I wnt to keep all wors that are not outliers in any of this columns. What is the best way to do it?

Cranjis
  • 1,590
  • 8
  • 31
  • 64

1 Answers1

0

Try any of these:

1) selecting and dropping rows in a cycle:

test_cols = ['N1','N2','N6','N8','N10']

for c in test_cols:
    drop_rows = df[(((df[c] - df[c].mean()) / df[c].std()).abs() < 3)].index
    df = df.drop(drop_rows)

2) combine drop_rows indices and drop all of the at once:

drop_set = {}
for c in test_cols:
    drop_ind = df[(((df[c] - df[c].mean()) / df[c].std()).abs() < 3)].index
    drop_set = {*drop_set, *drop_ind}
df = df.drop(drop_set)

3) have a complex selection condition and drop selected rows at once. Or .

drop_rows = df[(((df['N1'] - df['N1'].mean()) / df['N1'].std()).abs() < 3) |
               (((df['N2'] - df['N2'].mean()) / df['N2'].std()).abs() < 3) |
               (((df['N6'] - df['N6'].mean()) / df['N6'].std()).abs() < 3) |
               (((df['N8'] - df['N8'].mean()) / df['N8'].std()).abs() < 3) |
               (((df['N10'] - df['N10'].mean()) / df['N10'].std()).abs() < 3)].index
df = df.drop(drop_rows)

2) and 3) should be faster than 1)

Poe Dator
  • 4,535
  • 2
  • 14
  • 35
  • @Rusian Isn't a way to do it without loop? won't it be more elegant? – Cranjis Apr 14 '20 at 09:35
  • if 'N1','N2','N6','N8','N10' were booleans, then a nice clean solution is possible with `any()` function. To avoid cycle, you can use complex condition. See my answer updated. Accept? – Poe Dator Apr 15 '20 at 12:53