1

I am trying to automate removing outliers from a Pandas dataframe using IQR as the parameter and putting the variables in a list.

This code works - (where dummy_df is the dataframe and 'pdays' is the first variable I want to remove outliers for).

q1 = np.percentile(dummy_df['pdays'], 25, interpolation = 'midpoint')
 
q3 = np.percentile(dummy_df['pdays'], 75, interpolation = 'midpoint') 

iqr = q3 - q1

upper = np.where(dummy_df['pdays'] >= (q3+1.5*iqr))

lower = np.where(dummy_df['pdays'] <= (q1-1.5*iqr))

dummy_df.drop(upper[0], inplace = True)

dummy_df.drop(lower[0], inplace = True)

print("New Shape: ", dummy_df.shape)

enter image description here

However, this doesn't -

remove_outliers = ['pdays','poutcome', 'campaign', 'previous']

for outlier in remove_outliers:

    q1 = np.percentile(dummy_df[outlier], 25, interpolation = 'midpoint')
 
    q3 = np.percentile(dummy_df[outlier], 75, interpolation = 'midpoint') 

    iqr = q3 - q1 

    upper = np.where(dummy_df[outlier] >= (q3+1.5*iqr))

    lower = np.where(dummy_df[outlier] <= (q1-1.5*iqr))

    dummy_df.drop(upper[0], inplace = True)

    dummy_df.drop(lower[0], inplace = True)

print("New Shape: ", dummy_df.shape) 

The error I am getting is different datatypes. But why? Isnt it the same thing? What am I missing?

enter image description here

I want to be able to run a For loop since I am going to be doing trial and error on the decision tree for the best accuracy. Dont want to be writing code every time I need to drop a variable or add a variable for which I want to remove outliers.

I have tried putting the dummy_df['pdays] etc in the remove_outliers list, as well as dummy_df.pdays, etc... I have tried using loc and iloc - though I don't think that's applicable. Not sure what to do next. Important thing is, I need to understand what is the difference - what am I missing?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
RikkiS
  • 91
  • 1
  • 7
  • Please post text as text, not as screenshots. – deceze Jul 12 '21 at 13:54
  • @RikkiS Are there any `NaN` values in your columns? – filiabel Jul 12 '21 at 14:05
  • @deceze - Apologies. My first question on Stackoverflow. I thought I had followed all protocols- my questions description is in text, code is segregated and the error message (or correct output) is as images. Please do correct me if not correct. Thanks. – RikkiS Jul 12 '21 at 15:03
  • @filiabel = Thanks. No NaN values. In fact, as mentioned, when I do it outside the loop - it works perfectly fine. But as soon as I put it in a for loop - I get the error. There has to be some syntax mistake. And I purposely put the first variable that worked as the first item on the list so that if there is any mistake it would start from the 2nd item on list. But first variable itself doesn't work though it works outside the loop. – RikkiS Jul 12 '21 at 15:05
  • @RikkiS dtype ` – filiabel Jul 12 '21 at 16:18
  • @filiabel - Hi! Follows the results of dummy_df.info() RangeIndex: 11162 entries, 0 to 11161 Data columns (total 17 columns): # Column Non-Null Count Dtype 0 age 11162 non-null int64 1 job 11162 non-null object 2 marital 11162 non-null object 12 campaign 11162 non-null int64 13 pdays 11162 non-null int64 14 previous 11162 non-null int64 15 poutcome 11162 non-null object 16 deposit 11162 non-null object I deleted rows 3 to 11 since short of space. They follow in next comment – RikkiS Jul 14 '21 at 12:47
  • @filiabel Rest of the rows. 3 education 11162 non-null object 4 default 11162 non-null object 5 balance 11162 non-null int64 6 housing 11162 non-null object 7 loan 11162 non-null object 8 contact 11162 non-null object 9 day 11162 non-null int64 10 month 11162 non-null object 11 duration 11162 non-null int64 – RikkiS Jul 14 '21 at 12:48
  • @RikkiS `poutcome` is of dtype `object` and not `int64` as you have said. This might explain why it is not working, as it should be consisting of numbers to calculate `np.percentile`. – filiabel Jul 14 '21 at 14:46
  • @filiabel Correct. I figured that and removed it from the list, yet wasn't working. Forgot to add- when I made the code into a function - it worked - except of course 'poutcome' variable. – RikkiS Jul 14 '21 at 15:21
  • @RikkiS made an answer now you can check out :) – filiabel Jul 14 '21 at 15:55

2 Answers2

6

Based on comments on the original post, I suggest you do the following and revamp your solution.

I believe this answer provides a quick solution to your problem, so remember to search on SO before posting. This will remove all rows where one (or more) of the wanted column values is an outlier.

cols = ['pdays', 'campaign', 'previous'] # The columns you want to search for outliers in

# Calculate quantiles and IQR
Q1 = dummy_df[cols].quantile(0.25) # Same as np.percentile but maps (0,1) and not (0,100)
Q3 = dummy_df[cols].quantile(0.75)
IQR = Q3 - Q1

# Return a boolean array of the rows with (any) non-outlier column values
condition = ~((dummy_df[cols] < (Q1 - 1.5 * IQR)) | (dummy_df[cols] > (Q3 + 1.5 * IQR))).any(axis=1)

# Filter our dataframe based on condition
filtered_df = dummy_df[condition]
filiabel
  • 395
  • 1
  • 8
  • 1
    Hey. Thanks. This worked. Thanks a million. Meanwhile, I tried to upvote but tells me I got to have 15 reputation but my feedback has been recorded. Just so you know. I express my thanks once again. – RikkiS Jul 16 '21 at 15:25
  • Happy to help, @RikkiS. You [accept the answer](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) :-) – filiabel Jul 16 '21 at 17:30
0

Actually before removing the outlier please check that the data type of feature in which you are going to remove the outliers is type of that feature is numeric (int or float) or not. if the feature type is an object then IQR will not work. because IQR outlier detection works only on numerical features, to check data type of DataFrame type:

dummy_df.dtype

if every column in which you are going to remove outliers, is of type int64 or float64 the there will be no error but if it is of object type then you have to convert it to a numeric type.

also before this please remove all the Nan value from Data set by:

dummy_df=dummy_df.dropna()
desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • @desertnut @ mukul Kirti Verma - I cant make out who answered the question. Datatypes are int64. And again, when I do the same code outside a for loop - it works perfectly fine. – RikkiS Jul 12 '21 at 15:08