1

How can I detect the outliers of all data set (all continuous columns) based on a categorical variable and replace them with NA. I want to use Tukey technique but focusing on each level of a categorical variable. For example replace the outliers of mtcars[, -c(8,9)] with NA based on the each level of mtcars$am OR How can I modify this code to work for all variables in each level of am.

lapply(mtcars, function(x){sort(outlier_values<- boxplot.stats(x)$out)})
david
  • 13
  • 4

1 Answers1

0

EDIT: outliers are now 1.5*IQR, as specified in comment.

This replaces the outliers in the qsec column per group in the am column with NA's. It does so by first constructin a dataframe called limits, which contains lower- and upperbounds per am group. Then, that dataframe is joined with the original dataframe, and outliers are filtered.

library(dplyr)


limits = data.frame(am = unique(mtcars$am))
limits$lower = lapply(limits$am, function(x) quantile(mtcars$qsec[mtcars$am==x],0.25) - 1.5 * (quantile(mtcars$qsec[mtcars$am==x],0.75)- quantile(mtcars$qsec[mtcars$am==x],0.25)) )
limits$upper = lapply(limits$am, function(x) quantile(mtcars$qsec[mtcars$am==x],0.75) + 1.5 * (quantile(mtcars$qsec[mtcars$am==x],0.75)- quantile(mtcars$qsec[mtcars$am==x],0.25)) )

df = mtcars %>% left_join(limits)
df$qsec = ifelse(df$qsec< df$lower | df$qsec>df$upper,NA,df$qsec) 
df = df %>% select(-upper,-lower)

The a parameter can be used to determine what proportion is considered an outlier.

Florian
  • 24,425
  • 4
  • 49
  • 80
  • thanks so much, I have a question, I want to find the outliers based on Tukey 1.5 * IQR for each column vs each level of a categorical variable. 1. How can I modify your code to change it to 1.5 * IQR. Also How can modify your code to find the Tukey outliers of each column and replace it with NA. In my real data set, I have 19 columns and I want to replace the Tukey outliers of each column with NA. – david Jul 16 '17 at 15:33
  • Modified the queestion to work with 1.5*IQR. You could easily write a for loop that goes over all columns. Let me know if you are able to do so. If you found my answer helpful could you please accept it? Thanks! – Florian Jul 16 '17 at 16:15
  • thanks so much, of course, I will accept your solution. Just two questions. I am new in R so I was wondering how can I modify your code to work with missing observation Because I have so many missing observation. Also any advice regarding having loop function to work for all columns. – david Jul 16 '17 at 17:45
  • See here: https://stackoverflow.com/questions/4862178/remove-rows-with-nas-missing-values-in-data-frame for handling missing values ( the complete cases statement will help). A for loop can be constructed like for (col in colnames(df)){function here} everywhere where I refer to $qsec, you should point to [col]. Hope this helps! – Florian Jul 16 '17 at 20:19
  • @ Florian Maas, you mean, (col in colnames(mtcars)){ limits = data.frame(am = unique(mtcars$am)) limits$lower = lapply(limits$am, function(x) quantile(mtcars$col[mtcars$am==x],0.25) - 1.5 * (quantile(mtcars$col[mtcars$am==x],0.75)- quantile(mtcars$col[mtcars$am==x],0.25)) ) limits$upper = lapply(limits$am, function(x) quantile(mtcars$col[mtcars$am==x],0.75) + 1.5 * (quantile(mtcars$col[mtcars$am==x],0.75)- quantile(mtcars$col[mtcars$am==x],0.25)) ) df = mtcars %>% left_join(limits) df$col = ifelse(df$col< df$lower | df$col>df$upper,NA,df$col) df = df %>% select(-upper,-lower)} – david Jul 16 '17 at 21:50