Filtering outliers in a big dataset with columns of both numerical and string values

Question

I have a large dataset with mostly numeric columns and some object type (string) columns. I am trying to remove the outliers using quantiles for the numeric columns, but I am having trouble skipping over the string columns. I want to iterate over each column, check if it is an object type and if not, calculate the IQR for that column, find the outliers, remove the entire row that outlier is in and move on to the next column. I tried different approaches but the one below makes the most sense out of the ones I came up with. The issue is that it filters more than 90% of the rows which I know isn't correct because I made another dataset with just the numeric columns, filtered those, and got a reasonable amount removed (<10%). I just don't know how to implement this and would appreciate any help.

def filter_outliers(df):
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    df_filtered = df.copy()
    
    for column in numeric_columns:
        q1 = df[column].quantile(0.25)
        q3 = df[column].quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr
        outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
        df_filtered = df_filtered.drop(outliers.index, errors='ignore')
    
    return df_filtered

Would it be possible to provide a sample of the data you are trying to work with? I created a sample dataset based on your description and am attempting to run the above, but I am getting the error, "NameError: name 'numeric_columns' is not defined". — Michael Grogan, Jun 15 '23 at 23:14

score 0 · Answer 1 · answered Jun 16 '23 at 00:07

Let us consider the following pandas dataframe (df) of four columns - three are numeric.

a1      a2      a3     a4
22      51      X      100
78      41      X      52
19      35      X      67
31      15      X      93
29      26      X      47
71      94      X      77
17      52      X      7
74      40      X      32
10023   5454    X      45454

We can see that the last row shows clear outliers for a1, a2, and a4.

Firstly, we can filter the dataframe to only include numeric values:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)
newdf

Here is what our dataframe now looks like:

a1      a2      a4
22      51      100
78      41      52
19      35      67
31      15      93
29      26      47
71      94      77
17      52      7
74      40      32
10023   5454    45454

Using this function, we can assign NA to those values that lie outside the interquartile range:

for x in newdf:
    q3,q1 = np.percentile(df.loc[:,x],[75,25])
    iqr = q3-q1
 
    max = q3+1.5*iqr
    min = q1-1.5*iqr
 
    newdf.loc[newdf[x] < min,x] = np.nan
    newdf.loc[newdf[x] > max,x] = np.nan

Then, these NA entries can be dropped from the dataframe:

newdf.dropna(axis='rows')

Now, we have an array where the outliers have been excluded:

a1      a2      a4
22.0    51.0    100.0
78.0    41.0    52.0
19.0    35.0    67.0
31.0    15.0    93.0
29.0    26.0    47.0
17.0    52.0    7.0
74.0    40.0    32.0

Please refer to the following references for further guidance:

Filtering outliers in a big dataset with columns of both numerical and string values

1 Answers1