3

I have a Pandas DataFrame containing 3 categorical grouping variables and 1 numerical outcome variable. Within each group, there is an n = 6, where one of these values may be an outlier (as defined by the distribution within each group: an outlier can either exceed quartile 3 by 1.5 times the inter-quartile range, or be less than quartile 1 by 1.5 times the inter-quartile range).

An example of the DataFrame is shown below:

# Making the df without our outcome variable

import numpy as np
import pandas as pd

G1 = np.repeat(['E', 'F'], 24)
G2 = np.tile(np.repeat(['C', 'D'], 6), 4)
G3 = np.tile(np.repeat(['A', 'B'], 12), 2)

dummy_data = pd.DataFrame({'G1' : G1, 'G2' : G2, 'G3': G3})

# Defining a function to generate a numpy array with n = 6, where one of these values is an outlier # by our previous definition

np.random.seed(0)

def outlier_arr(low, high):
    norm_arr = np.random.randint(low, high, 5)

    IQR = np.percentile(norm_arr, 75) - np.percentile(norm_arr, 25)
    upper_fence = np.percentile(norm_arr, 75) + (IQR * 1.5)
    lower_fence = np.percentile(norm_arr, 25) - (IQR * 1.5)
    rand_decision = np.random.randint(0, 2, 1)[0]

    if rand_decision == 1:
        high_outlier = np.round(upper_fence * 3, decimals = 0)
        final_arr = np.hstack([norm_arr, high_outlier])

    else:
        low_outlier = np.round(lower_fence * (1/3), decimals = 0)
        final_arr = np.hstack([norm_arr, low_outlier])

    return final_arr.astype(int)

# Making a list to add into the dataframe to represent our values

abund_arr = []

for i in range(0, 8):
    abund_arr = abund_arr + outlier_arr(700, 800).tolist()

abund_arr = np.array(abund_arr)

# Appending this list as a new row

dummy_data['V1'] = abund_arr

This should generate a DataFrame with 3 grouping variables G1, G2, and G3, and a single outcome variable V1 where each group should have one outlier that needs to be removed. We can look at the first 6 rows (a single group) with dummy_data.head(6) below to see that one of these values (the last row) is an outlier that we would like to filter out.


    G1  G2  G3  V1
0   E   C   A   744
1   E   C   A   747
2   E   C   A   764
3   E   C   A   767
4   E   C   A   767
5   E   C   A   2391 <--- outlier

From what I understand, a good approach may be to use df.groupby().filter(), and to group by variables G1, G2, and G3 and implement a user-defined function to filter() that returns T/F based on the outlier criteria discusses above.

I have tried this, where the function for detecting outliers (returns array of True or False) within an array is found below:

def is_outlier(x): 

    IQR = np.percentile(x, 75) - np.percentile(x, 25)
    upper_fence = np.percentile(x, 75) + (IQR * 1.5)
    lower_fence = np.percentile(x, 25) - (IQR * 1.5)

    return (x > upper_fence) | (x < lower_fence)

Which correctly detects an outlier as shown below:

test_arr = outlier_arr(300, 500)

is_outlier(test_arr)

# returns an array of [False, False, False, False, False,  True]

However, when using the method described above on a pandas object, the following code throws no errors, but also does not filter any of the outliers:

dummy_data.groupby(['G1', 'G2', 'G3']).filter(lambda x: (is_outlier(x['V1'])).any())

NOTE: I actually found a way to do this here, where you use apply() instead of filter().

Running dummy_data[~dummy_data.groupby(['G1', 'G2', 'G3'])['V1'].apply(is_outlier)] produced the desired result.

However, just for the sake of doing it with this method, what needs to be tweaked to get this to work using filter()? If it's possible, which of the two ways is correct/preferred?

Thanks in advance.

1010shane
  • 75
  • 7
  • 1
    You can see: https://stackoverflow.com/questions/58731702/pandas-use-dataframegroupby-filter-method-to-select-dataframes-rows-with-a-v/58734451#58734451. The manipulation is slightly different (i.e. exclude above mean), but the core issue is the same. `GroupBy.filter` excludes **entire** groups. It **cannot** be used to exclude individual rows within groups. (IMO, the documentation isn't very clear and leads to this sort of confusion often) – ALollz Feb 27 '20 at 17:31
  • 1
    `groupby().transform` mean and std, then use those to filter with `between`. – Quang Hoang Feb 27 '20 at 17:37

0 Answers0