I am trying to automate removing outliers from a Pandas dataframe using IQR as the parameter and putting the variables in a list.
This code works - (where dummy_df is the dataframe and 'pdays' is the first variable I want to remove outliers for).
q1 = np.percentile(dummy_df['pdays'], 25, interpolation = 'midpoint')
q3 = np.percentile(dummy_df['pdays'], 75, interpolation = 'midpoint')
iqr = q3 - q1
upper = np.where(dummy_df['pdays'] >= (q3+1.5*iqr))
lower = np.where(dummy_df['pdays'] <= (q1-1.5*iqr))
dummy_df.drop(upper[0], inplace = True)
dummy_df.drop(lower[0], inplace = True)
print("New Shape: ", dummy_df.shape)
However, this doesn't -
remove_outliers = ['pdays','poutcome', 'campaign', 'previous']
for outlier in remove_outliers:
q1 = np.percentile(dummy_df[outlier], 25, interpolation = 'midpoint')
q3 = np.percentile(dummy_df[outlier], 75, interpolation = 'midpoint')
iqr = q3 - q1
upper = np.where(dummy_df[outlier] >= (q3+1.5*iqr))
lower = np.where(dummy_df[outlier] <= (q1-1.5*iqr))
dummy_df.drop(upper[0], inplace = True)
dummy_df.drop(lower[0], inplace = True)
print("New Shape: ", dummy_df.shape)
The error I am getting is different datatypes. But why? Isnt it the same thing? What am I missing?
I want to be able to run a For loop since I am going to be doing trial and error on the decision tree for the best accuracy. Dont want to be writing code every time I need to drop a variable or add a variable for which I want to remove outliers.
I have tried putting the dummy_df['pdays] etc in the remove_outliers list, as well as dummy_df.pdays, etc... I have tried using loc and iloc - though I don't think that's applicable. Not sure what to do next. Important thing is, I need to understand what is the difference - what am I missing?