31

I have a DataFrame df with 40 columns and many records.

df:

User_id | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 |...| Col39

For each column except the user_id column I want to check for outliers and remove the whole record, if an outlier appears.

For outlier detection on each row I decided to simply use 5th and 95th percentile (I know it's not the best statistical way):

Code what I have so far:

P = np.percentile(df.Col1, [5, 95])
new_df = df[(df.Col1 > P[0]) & (df.Col1 < P[1])]

Question: How can I apply this approach to all columns (except User_id) without doing this by hand? My goal is to get a dataframe without records that had outliers.

Thank you!

matrixanomaly
  • 6,627
  • 2
  • 35
  • 58
Mi Funk
  • 455
  • 1
  • 6
  • 8

5 Answers5

100

Use this code and don't waste your time:

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]

in case you want specific columns:

cols = ['col_1', 'col_2'] # one or more

Q1 = df[cols].quantile(0.25)
Q3 = df[cols].quantile(0.75)
IQR = Q3 - Q1

df = df[~((df[cols] < (Q1 - 1.5 * IQR)) |(df[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]
smerllo
  • 3,117
  • 1
  • 22
  • 37
E.Zolduoarrati
  • 1,539
  • 2
  • 9
  • 9
  • 4
    There aren't enough upvotes to express my thanks for you posting this. – exmatelote Apr 23 '20 at 07:47
  • Thanks this was very helpful – Bawantha May 29 '20 at 22:16
  • 7
    Why + and - 1.5* IQR? – Carl Dec 13 '20 at 00:18
  • 1
    @Carl outliers are the data points that fall outside of 1.5 times of the inter quartile range (Q3 - Q1). So + and - 1.5*IQR means we are considering data within the constraints – stuckoverflow Mar 08 '21 at 17:45
  • Brillian solution, thanks! – Lorenzo Bassetti Sep 27 '21 at 11:37
  • Can you explain the .any() portion of this? I generally get what .any() does (or so I think). I get some weird results when I exclude it, and my thinking was it isn't necessary. Clearly it is, so I was just wondering if you have a moment to clarify your use here please. Thanks! – OptionAndrew Dec 15 '21 at 16:20
  • Is there a way to implement groupby(), to remove outliers separately for each group of items? – KayEss Apr 11 '22 at 11:36
  • 2
    is there an explanation for this? – Themasterhimself May 31 '22 at 17:41
  • @OptionAndrew It seems like the purpose of the any(1) is to discard rows when the condition is true for any column present in the dataframe. The interquantile range is computed seperately for all columns but if any column in a row is further than 1.5 * IQR from Q1 or Q3, then that row is dropped. Note that any(1) is inside the brackets ([]), so it applies on the condition – MattSt Sep 01 '22 at 13:10
  • Couldn't this be simplified to `df = df[~((df < df.quantile(0.13))) |(df > (df.quantile(1-0.13))).any(axis=1)]` considering the properties of the normal distribution? – Ignacio Aug 16 '23 at 20:48
36

The initial dataset.

print(df.head())

   Col0  Col1  Col2  Col3  Col4  User_id
0    49    31    93    53    39       44
1    69    13    84    58    24       47
2    41    71     2    43    58       64
3    35    56    69    55    36       67
4    64    24    12    18    99       67

First removing the User_id column

filt_df = df.loc[:, df.columns != 'User_id']

Then, computing percentiles.

low = .05
high = .95
quant_df = filt_df.quantile([low, high])
print(quant_df)

       Col0   Col1  Col2   Col3   Col4
0.05   2.00   3.00   6.9   3.95   4.00
0.95  95.05  89.05  93.0  94.00  97.05

Next filtering values based on computed percentiles. To do that I use an apply by columns and that's it !

filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) & 
                                    (x < quant_df.loc[high,x.name])], axis=0)

Bringing the User_id back.

filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)

Last, rows with NaN values can be dropped simply like this.

filt_df.dropna(inplace=True)
print(filt_df.head())

   User_id  Col0  Col1  Col2  Col3  Col4
1       47    69    13    84    58    24
3       67    35    56    69    55    36
5        9    95    79    44    45    69
6       83    69    41    66    87     6
9       87    50    54    39    53    40

Checking result

print(filt_df.head())

   User_id  Col0  Col1  Col2  Col3  Col4
0       44    49    31   NaN    53    39
1       47    69    13    84    58    24
2       64    41    71   NaN    43    58
3       67    35    56    69    55    36
4       67    64    24    12    18   NaN

print(filt_df.describe())

          User_id       Col0       Col1       Col2       Col3       Col4
count  100.000000  89.000000  88.000000  88.000000  89.000000  89.000000
mean    48.230000  49.573034  45.659091  52.727273  47.460674  57.157303
std     28.372292  25.672274  23.537149  26.509477  25.823728  26.231876
min      0.000000   3.000000   5.000000   7.000000   4.000000   5.000000
25%     23.000000  29.000000  29.000000  29.500000  24.000000  36.000000
50%     47.000000  50.000000  40.500000  52.500000  49.000000  59.000000
75%     74.250000  69.000000  67.000000  75.000000  70.000000  79.000000
max     99.000000  95.000000  89.000000  92.000000  91.000000  97.000000

How to generate the test dataset

np.random.seed(0)
nb_sample = 100
num_sample = (0,100)

d = dict()
d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
for i in range(5):
    d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)

df = DataFrame.from_dict(d)
Romain
  • 19,910
  • 6
  • 56
  • 65
  • Worked! But in my case, I had to change (lambda x: x[(x>quant_df.loc[low,x.name]) & (x < quant_df.loc[high,x.name])], axis=0) to (lambda x: x[(x >= quant_df.loc[low,x.name]) & (x <= quant_df.loc[high,x.name])], axis=0). Otherwise all records will be dropped. I have medians which are very close to zero, like 0,00001, maybe that's why. – Mi Funk Mar 06 '16 at 17:05
  • Great ! I don't see the difference between the two lambda except the line feed. – Romain Mar 06 '16 at 17:11
  • 1
    I included the upper and lower boundaries with ">=" and "<=" instead using ">" and "<". – Mi Funk Mar 06 '16 at 17:20
  • After that why are we getting lot of "NaN" which are not there in original dataset. How can we handle that? – DreamerP Mar 27 '18 at 12:59
  • But wont the rows be scrambled for different columns? – Rohan Bhale Sep 15 '20 at 16:12
7

What you are describing is similar to the process of winsorizing, which clips values (for example, at the 5th and 95th percentiles) instead of eliminating them completely.

Here's an example:

import pandas as pd
from scipy.stats import mstats
%matplotlib inline

test_data = pd.Series(range(30))
test_data.plot()

Original data

# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05])) 
transformed_test_data.plot()

Winsorized data

mgoldwasser
  • 14,558
  • 15
  • 79
  • 103
2

Use an inner join. Something like this should work

cols = df.columns.tolist()
cols.remove('user_id') #remove user_id from list of columns

P = np.percentile(df[cols[0]], [5, 95])
new_df = df[(df[cols[0] > P[0]) & (df[cols[0]] < P[1])]
for col in cols[1:]:
    P = np.percentile(df[col], [5, 95])
    new_df = new_df.join(df[(df[col] > P[0]]) & (df[col] < P[1])], how='inner')
-1

To trim the entire DataFrame based on a single column, here is an easier way. Remove n rows from the top and bottom after sorting.

nb_to_trim = round(len(df.index) * 0.05)
df = df.sort_values(col1).iloc[nb_to_trim:-nb_to_trim, :]
ChaimG
  • 7,024
  • 4
  • 38
  • 46