I am doing a project and have a dataset of 8545 X 52. Every variable has outlier in it and unfortunately I can't remove the outliers. I know the method of capping by checking for IQR of each column but as number of columns is 52 it will take a lot of time. Can anyone suggest any quick method to treat the outliers.
Asked
Active
Viewed 200 times
-2
-
Welcome to stack overflow. It’s easier to help if you make your question reproducible: include a minimal dataset in the form of an object for example if a data frame as df <- data.frame(…) where … is your variables and values or use dput(head(df)). Include the code you have tried and set out your expected answer. These links should be of help: [mre] and [ask] – Peter Jul 11 '20 at 10:13
1 Answers
0
A very quick (and dirty) way to check for and identify outliers is this:
Data:
set.seed(123)
df <- data.frame(
v1 = c(sample(1:10, 9), 1000),
v2 = c(runif(9), 2000),
v3 = c(11111, rnorm(8), 23450))
Boxplots per se identify outliers; they can be retrieved via $out
:
boxplot(df)$out
[1] 100 2000 11111 2345
To detect these values in your dataframe, you can use sapply
:
sapply(df, function(x) x %in% boxplot(df)$out)
v1 v2 v3
[1,] FALSE FALSE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE
[6,] FALSE FALSE FALSE
[7,] FALSE FALSE FALSE
[8,] FALSE FALSE FALSE
[9,] FALSE FALSE FALSE
[10,] TRUE TRUE TRUE

Chris Ruehlemann
- 20,321
- 4
- 12
- 34