-2

I am doing a project and have a dataset of 8545 X 52. Every variable has outlier in it and unfortunately I can't remove the outliers. I know the method of capping by checking for IQR of each column but as number of columns is 52 it will take a lot of time. Can anyone suggest any quick method to treat the outliers.

Peter
  • 11,500
  • 5
  • 21
  • 31
  • Welcome to stack overflow. It’s easier to help if you make your question reproducible: include a minimal dataset in the form of an object for example if a data frame as df <- data.frame(…) where … is your variables and values or use dput(head(df)). Include the code you have tried and set out your expected answer. These links should be of help: [mre] and [ask] – Peter Jul 11 '20 at 10:13

1 Answers1

0

A very quick (and dirty) way to check for and identify outliers is this:

Data:

set.seed(123)
df <- data.frame(
  v1 = c(sample(1:10, 9), 1000),
  v2 = c(runif(9), 2000),
  v3 = c(11111, rnorm(8), 23450))

Boxplots per se identify outliers; they can be retrieved via $out:

boxplot(df)$out
[1]   100  2000 11111  2345

To detect these values in your dataframe, you can use sapply:

sapply(df, function(x) x %in% boxplot(df)$out)
         v1    v2    v3
 [1,] FALSE FALSE  TRUE
 [2,] FALSE FALSE FALSE
 [3,] FALSE FALSE FALSE
 [4,] FALSE FALSE FALSE
 [5,] FALSE FALSE FALSE
 [6,] FALSE FALSE FALSE
 [7,] FALSE FALSE FALSE
 [8,] FALSE FALSE FALSE
 [9,] FALSE FALSE FALSE
[10,]  TRUE  TRUE  TRUE
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34