1

I'm trying to pull the indices from each column where a value has been flagged as an outlier. What I want is to then combine all those indices and remove them from my dataframe. I have a starting point here. I'm not sure if I should have the function include the full dataset and have each column outliers detected within or include instead in a for loop and append the bad indexes to a list.

def find_outliers(df):
    q1 = df[i].quantile(.25)
    q3 = df[i].quantile(.75)
    IQR = q3 - q1
    ll = q1 - (1.5*IQR)
    ul = q3 + (1.5*IQR)
    upper_outliers = df[df[i] > ul].index.tolist()
    lower_outliers = df[df[i] < ll].index.tolist()
    bad_indices = list(set(upper_outliers + lower_outliers))
    return(bad_indices)

bad_indexes = []
for col in df.columns:
    if df[col].dtype in ["int64","float64"]:
        bad_indexes.append(find_outliers(df[col]))
casanoan
  • 27
  • 1
  • 5
  • Does this answer your question? [Detect and exclude outliers in Pandas data frame](https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-pandas-data-frame) – Chris Sep 20 '21 at 00:55
  • Hi Chris. I was able to solve my issue. It looks like I just needed to fix my function input before iterating across all columns. Thanks – casanoan Sep 20 '21 at 01:05

2 Answers2

0

It looks like I just had to change my function in put and iterate over each column of the dataframe to do the trick:

def find_outliers(col):
    q1 = col.quantile(.25)
    q3 = col.quantile(.75)
    IQR = q3 - q1
    ll = q1 - (1.5*IQR)
    ul = q3 + (1.5*IQR)
    upper_outliers = col[col > ul].index.tolist()
    lower_outliers = col[col < ll].index.tolist()
    bad_indices = list(set(upper_outliers + lower_outliers))
    return(bad_indices)

import numpy as np
bad_indexes = []
for col in df.columns:
    if df[col].dtype in ["int64","float64"]:
        bad_indexes.append(find_outliers(df[col]))

bad_indexes = set(list(np.concatenate(bad_indexes).flat))
print(len(bad_indexes))
casanoan
  • 27
  • 1
  • 5
0

This will work for you

def find_outliers(df_in, col_name):
    Q1 = df_in[col_name].quantile(0.25)
    Q3 = df_in[col_name].quantile(0.75)
    IQR = Q3-Q1
    fence_low  = Q1-1.5*IQR
    fence_high = Q3+1.5*IQR
    outlier_list=((df_in[col_name] <= fence_low) | (df_in[col_name] >= fence_high)).tolist()
    outlier_indexes=[i for i, x in enumerate(outlier_list) if x]
return outlier_indexes
#----------------------
bad_indexes=[]
for col in df.columns:
    if df[col].dtype in ["int64", "float64"]:
        outlierindexes= find_outliers(df, col)
        bad_indexes.extend(outlierindexes)
print(f"All Bad indexes:{bad_indexes}")   
FEldin
  • 131
  • 7