0

I am new to R and I am trying to remove outliers from every boxplot but I have too many columns to make it by hand. For each column I make a boxplot. This is the code:

Library(car)
Boxplot(xf$V1, id.method="y") 
# it prints boxplot's image of first column and outliers of this boxplot in console
# for example output in console is: 2427 536 

# to remove this rows with outliers I do:
xf = xf[-c(2427,536),]

So, I need to go through a huge amount of columns and remove outliers in every column. Can I automate this?

  • But how to automatically delete outlines which appears in console? – Ilya Sleptsov May 02 '17 at 21:17
  • 1
    Welcome to Stack Overflow! Your question is not asked in a way that makes it easy to answer. First of all R is case sensitive, but you have erroneously capitalized `library` and `boxplot`. This makes it so we cannot just copy your code and run it. But more importantly, you refer to the data.frame xf. We do not have access to that and so cannot reproduce your results. Please read [How to make a reproducible example](http://stackoverflow.com/q/5963269/4752675) and edit your question so that we can answer. – G5W May 03 '17 at 00:24
  • @G5W actually `Boxplot` *is* a function in `car`. Presumably that's to avoid conflict with `stats::boxplot` ... I'm not sure whether perhaps `Library` is a function in still some other package. Your basic points still stand of course -- the example isn't reproducible. – Glen_b May 03 '17 at 00:41
  • For example, see what happens with this data set: `Boxplot(c(1,2,3,4,8,18,29,56,69,136,152))`. The last value is an outlier by the boxplot rule. But if you remove it, then the next largest one is an outlier. If you remove that, the *next* largest one is an outlier... and so on down. – Glen_b May 03 '17 at 00:57
  • @Ilya what behavior do you expect in those circumstances? – Glen_b May 03 '17 at 01:04
  • I expect the behavior which is in "Car" Library. I use a code Boxplot(xf$V1, id.method="y"). **id.method** is used in Boxplot of "Car" Library. – Ilya Sleptsov May 03 '17 at 22:29

1 Answers1

1

Are your outliers corrupting your data variables so much that you have to mess with your data to interpret your distributions? Why not leave your data where it is and look at the documentation of the boxplot function to just show you only what you want to see which is everything except the outliers which are the dots on R's boxplot function? I could see outliers corrupting the mean. But the black line which a boxplot shows is the median, and it should not be so easily corrupted by outliers.

you can see a few outliers here:

boxplot(airquality$Ozone ~ airquality$Month)

I wonder how I make a boxplot without outliers? How about I look at the documentation?

?boxplot

boxplot(airquality$Ozone ~ airquality$Month, outline = FALSE)

What do you know? The outliers aren't there anymore. By default outliers show when outline is true. So you change it to false and they don't show.

If you want to do the same for your data just ...

boxplot(xf$V1, id.method="y", outline = FALSE)

If I want to remove some outliers from a column of this airquality dataframe.

View(airquality)

Then I can remove the outliers from the Ozone column like so ...

ozone <- boxplot(airquality$Ozone, outline = FALSE, plot = FALSE)

Let's see what we can take from here variable-wise. The outlier points of the ozone column from the airquality dataframe are in $out, so we just do this.

To show the outliers in ozone just do this.

intersect(airquality$Ozone, ozone$out)

To show everything else in ozone just do this.

setdiff(airquality$Ozone, ozone$out)

I can pass this right to the boxplot function without specifying outline = FALSE, and I get the boxplot without the two outlier points.

boxplot(setdiff(airquality$Ozone, ozone$out))

If you want to readjust all your data, I would try tampering it. In my case I'm tampering with a dataframe which is called airquality.

tamper <- apply(airquality, 2, FUN = boxplot)

See all the things you can tamper with.

  tamper$
    tamper$Ozone
tamper$Ozone$out

It might take the for loop to tamper all the outliers (out) out. But I have them all in one variable.

Now you can see the outliers in all the 6 columns of airquality. As you can see there are only two columns 1 (Ozone) and 3 (Wind) with outliers, and it shows them.

for(i in 1:length(tamper)){print(tamper[[i]]$out)}
[1] 135 168
numeric(0)
[1] 20.1 18.4 20.7
numeric(0)
numeric(0)
numeric(0)
xyz123
  • 651
  • 4
  • 19