0

I'm curious about the method of the function Boxplot() from the car package to return identified outliers (see for example How to show the id of outliers on a boxplot).

In fact I supposed that the detected outliers should be the same than any method, but it appeared not to be so, particularly for long vectors. It appears that this function returns only the most extreme outliers for some reason.

Here the demonstration using simulated data (simulation method from : simulation of normal distribution data contaiminated with outliers)

my.rnorm <- function(N, num.out, mean=0, sd=1){
  x <- rnorm(N, mean = mean, sd = sd)
  ind <- sample(1:N, num.out, replace=FALSE )
  x[ind] <- (abs(x[ind]) + 3*sd) * sign(x[ind])
  x
}


vector<-my.rnorm(1200,20)

First using the boxplot() function give me 32 outliers :

outliers1<-sort(boxplot(vector)$out)
sort(outliers1)

1 -4.124101 -3.869423 -3.768973 -3.768571 -3.639510 -3.536848 -3.469979 -3.422215 -3.240268 -3.141479 -3.107837
[12] -2.822105 -2.723802 2.685210 2.712847 2.726344 2.726544 2.751796 2.762394 3.008180 3.030209 3.116131
[23] 3.146028 3.198794 3.353337 3.423981 3.605032 3.607052 3.944753 3.950593 4.012654 4.623255

Now the car::Boxplot() function gives me the 20 most extreme values :

id_outliers<-car::Boxplot(vector)
outliers2<-vector[id_outliers]
sort(outliers2)

1 -4.124101 -3.869423 -3.768973 -3.768571 -3.639510 -3.536848 -3.469979 -3.422215 -3.240268 -3.141479 3.146028
[12] 3.198794 3.353337 3.423981 3.605032 3.607052 3.944753 3.950593 4.012654 4.623255

Its seems that car::Boxplot() does not retain the 12 less extremes outliers. The problem is clearer when comparing the two boxplots :

plot from boxplot()

plot from car::Boxplot()

My question is why car::Boxplot function does not return all outliers ?

Kyabdro
  • 75
  • 5

1 Answers1

1

Ok, I explored the code of car::Boxplot and found that the function is made to return by default only the 10 most extreme low value and the 10 most extreme high values.

I guess now I need to see with the developer what statistical reasons (if any) there are for this choice in writing the function.

Kyabdro
  • 75
  • 5