I have a Pandas DataFrame containing 3 categorical grouping variables and 1 numerical outcome variable. Within each group, there is an n = 6, where one of these values may be an outlier (as defined by the distribution within each group: an outlier can either exceed quartile 3 by 1.5 times the inter-quartile range, or be less than quartile 1 by 1.5 times the inter-quartile range).
An example of the DataFrame is shown below:
# Making the df without our outcome variable
import numpy as np
import pandas as pd
G1 = np.repeat(['E', 'F'], 24)
G2 = np.tile(np.repeat(['C', 'D'], 6), 4)
G3 = np.tile(np.repeat(['A', 'B'], 12), 2)
dummy_data = pd.DataFrame({'G1' : G1, 'G2' : G2, 'G3': G3})
# Defining a function to generate a numpy array with n = 6, where one of these values is an outlier # by our previous definition
np.random.seed(0)
def outlier_arr(low, high):
norm_arr = np.random.randint(low, high, 5)
IQR = np.percentile(norm_arr, 75) - np.percentile(norm_arr, 25)
upper_fence = np.percentile(norm_arr, 75) + (IQR * 1.5)
lower_fence = np.percentile(norm_arr, 25) - (IQR * 1.5)
rand_decision = np.random.randint(0, 2, 1)[0]
if rand_decision == 1:
high_outlier = np.round(upper_fence * 3, decimals = 0)
final_arr = np.hstack([norm_arr, high_outlier])
else:
low_outlier = np.round(lower_fence * (1/3), decimals = 0)
final_arr = np.hstack([norm_arr, low_outlier])
return final_arr.astype(int)
# Making a list to add into the dataframe to represent our values
abund_arr = []
for i in range(0, 8):
abund_arr = abund_arr + outlier_arr(700, 800).tolist()
abund_arr = np.array(abund_arr)
# Appending this list as a new row
dummy_data['V1'] = abund_arr
This should generate a DataFrame with 3 grouping variables G1
, G2
, and G3
, and a single outcome variable V1
where each group should have one outlier that needs to be removed. We can look at the first 6 rows (a single group) with dummy_data.head(6)
below to see that one of these values (the last row) is an outlier that we would like to filter out.
G1 G2 G3 V1
0 E C A 744
1 E C A 747
2 E C A 764
3 E C A 767
4 E C A 767
5 E C A 2391 <--- outlier
From what I understand, a good approach may be to use df.groupby().filter(), and to group by variables G1
, G2
, and G3
and implement a user-defined function to filter()
that returns T/F based on the outlier criteria discusses above.
I have tried this, where the function for detecting outliers (returns array of True
or False
) within an array is found below:
def is_outlier(x):
IQR = np.percentile(x, 75) - np.percentile(x, 25)
upper_fence = np.percentile(x, 75) + (IQR * 1.5)
lower_fence = np.percentile(x, 25) - (IQR * 1.5)
return (x > upper_fence) | (x < lower_fence)
Which correctly detects an outlier as shown below:
test_arr = outlier_arr(300, 500)
is_outlier(test_arr)
# returns an array of [False, False, False, False, False, True]
However, when using the method described above on a pandas object, the following code throws no errors, but also does not filter any of the outliers:
dummy_data.groupby(['G1', 'G2', 'G3']).filter(lambda x: (is_outlier(x['V1'])).any())
NOTE: I actually found a way to do this here, where you use apply()
instead of filter()
.
Running dummy_data[~dummy_data.groupby(['G1', 'G2', 'G3'])['V1'].apply(is_outlier)]
produced the desired result.
However, just for the sake of doing it with this method, what needs to be tweaked to get this to work using filter()
? If it's possible, which of the two ways is correct/preferred?
Thanks in advance.