Python remove outliers from data

Question

I have a data frame as following:

ID Value
A   70
A   80
B   75
C   10
B   50
A   1000
C   60
B   2000
..  ..

I would like to group this data by ID, remove the outliers from the grouped data (the ones we see from the boxplot) and then calculate mean.

So far

grouped = df.groupby('ID')

statBefore = pd.DataFrame({'mean': grouped['Value'].mean(), 'median': grouped['Value'].median(), 'std' : grouped['Value'].std()})

How can I find outliers, remove them and get the statistics.

This is essentially a duplicated of https://stackoverflow.com/questions/11686720/is-there-a-numpy-builtin-to-reject-outliers-from-a-list/45215127#45215127, with a few answers already — Chris, Jul 20 '17 at 22:16
Possible duplicate of [Is there a numpy builtin to reject outliers from a list](https://stackoverflow.com/questions/11686720/is-there-a-numpy-builtin-to-reject-outliers-from-a-list) — Herpes Free Engineer, Jul 14 '18 at 20:28

Sam · Accepted Answer · 2016-04-26T13:54:11.493

11

I believe the method you're referring to is to remove values > 1.5 * the interquartile range away from the median. So first, calculate your initial statistics:

statBefore = pd.DataFrame({'q1': grouped['Value'].quantile(.25), \
'median': grouped['Value'].median(), 'q3' : grouped['Value'].quantile(.75)})

And then determine whether values in the original DF are outliers:

def is_outlier(row):
    iq_range = statBefore.loc[row.ID]['q3'] - statBefore.loc[row.ID]['q1']
    median = statBefore.loc[row.ID]['median']
    if row.Value > (median + (1.5* iq_range)) or row.Value < (median - (1.5* iq_range)):
        return True
    else:
        return False
#apply the function to the original df:
df.loc[:, 'outlier'] = df.apply(is_outlier, axis = 1)
#filter to only non-outliers:
df_no_outliers = df[~(df.outlier)]

edited Apr 26 '16 at 13:54

answered Apr 26 '16 at 13:45

Sam

4,000
20
27

Thanks Sam, this is what I was looking for. Can you please suggest a good tutorial on Pandas. I am a beginner and have not much experience. Cheers! – chintan s Apr 26 '16 at 14:06
can you please explain this part of code: statBefore.loc[row.ID]['q3'] – user1584253 Aug 25 '16 at 15:18
@user1584253 This function returns the respective quantile with regard to individual features. – yuqli Apr 06 '18 at 19:15

score 1 · Answer 2 · answered Sep 10 '20 at 11:35

1

Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1

data = df[~((df['Value'] < (Q1 - 1.5 * IQR)) |(df['Value'] > (Q3 + 1.5 * 
IQR))).any(axis=1)]

answered Sep 10 '20 at 11:35

Yogesh Kawadkar

11
2

score 0 · Answer 3 · answered Apr 26 '16 at 12:37

0

just do :

In [187]: df[df<100].groupby('ID').agg(['mean','median','std'])
Out[187]: 
   Value                  
    mean median        std
ID                        
A   75.0   75.0   7.071068
B   62.5   62.5  17.677670
C   35.0   35.0  35.355339

answered Apr 26 '16 at 12:37

B. M.

18,243
2
35
54

Thanks but I would like to remove outliers first, before doing mean, std calculations. The outlier method I would like to use is the one used by Boxplot calculations – chintan s Apr 26 '16 at 12:44
I don't know this method. have you references ? – B. M. Apr 26 '16 at 12:56

Python remove outliers from data

3 Answers3

Linked