pandas dataframe remove outliers from subgroup of the columns

Question

I have a dataframe with 50 numerical columns and 10 categorical columns.

df = C1 C2 .. C10 N1 N2 ... N50
     a  b      c   2 3      1

I want to remove all outliers, but only from columns N1,N2,N6,N8,N10. Meaning I wnt to keep all wors that are not outliers in any of this columns. What is the best way to do it?

does this help? https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-pandas-data-frame — M-Wi, Apr 13 '20 at 20:20
@M-Wi no because I only want to filter by sub-group of the columns — Cranjis, Apr 13 '20 at 20:26
At least a few of the answers on that thread involve specifying the columns for filtration. — M-Wi, Apr 13 '20 at 20:29
@M-Wi I saw an answer for specific column - I want for several columns and to avoid a loop — Cranjis, Apr 13 '20 at 20:37
I think it would be clearer if you edited those specific requirements into your question. As it is, you're asking for the "best" way, which may be a loop. — M-Wi, Apr 13 '20 at 20:41

Poe Dator · Accepted Answer · 2020-04-15T12:52:45.800

Try any of these:

1) selecting and dropping rows in a cycle:

test_cols = ['N1','N2','N6','N8','N10']

for c in test_cols:
    drop_rows = df[(((df[c] - df[c].mean()) / df[c].std()).abs() < 3)].index
    df = df.drop(drop_rows)

2) combine drop_rows indices and drop all of the at once:

drop_set = {}
for c in test_cols:
    drop_ind = df[(((df[c] - df[c].mean()) / df[c].std()).abs() < 3)].index
    drop_set = {*drop_set, *drop_ind}
df = df.drop(drop_set)

3) have a complex selection condition and drop selected rows at once. Or .

drop_rows = df[(((df['N1'] - df['N1'].mean()) / df['N1'].std()).abs() < 3) |
               (((df['N2'] - df['N2'].mean()) / df['N2'].std()).abs() < 3) |
               (((df['N6'] - df['N6'].mean()) / df['N6'].std()).abs() < 3) |
               (((df['N8'] - df['N8'].mean()) / df['N8'].std()).abs() < 3) |
               (((df['N10'] - df['N10'].mean()) / df['N10'].std()).abs() < 3)].index
df = df.drop(drop_rows)

2) and 3) should be faster than 1)

@Rusian Isn't a way to do it without loop? won't it be more elegant? — Cranjis, Apr 14 '20 at 09:35
if 'N1','N2','N6','N8','N10' were booleans, then a nice clean solution is possible with `any()` function. To avoid cycle, you can use complex condition. See my answer updated. Accept? — Poe Dator, Apr 15 '20 at 12:53

pandas dataframe remove outliers from subgroup of the columns

1 Answers1