I believe that "outlier" is a very dangerous and misleading term. In many cases it means a data point which should be excluded from analysis for a specific reason. Such a reason could be that a value is beyond physical boundaries because of a measurement error, but not that "it does not fit the other points around it".
Here, you specify a statistical criterion based on the distribution of the actual data. Let aside that I don't find that approach appropriate here (because these data a presumably precisely measured for a given car), when you apply remove_outliers
to the data, the function will determine the outlier limits and set the data points beyond these limits to NA
.
## Using only column horsepower
dat <- read.csv("./cars.csv")
hp <- dat$Horsepower
## Calculates the boundaries like remove_outliers
calc.limits <- function(x, na.rm = TRUE) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm)
H <- 1.5 * IQR(x, na.rm = na.rm)
lwr <- qnt[1] - H
upr <- qnt[2] + H
c(lwr, upr)
}
> calc.limits(hp)
25% 75%
-1.5 202.5
This results in a new data set with NA values. When you apply remove_outliers
to the already reduced data set, the statistics will differ, and so will the limits. Thus, you will get "new" outliers (see Roland's comment).
hp2 <- remove_outliers(hp)
calc.limits(hp2)
> calc.limits(hp2)
25% 75%
9 185
You can visualize this fact:
plot(hp, ylim = c(0, 250), las = 1)
abline(h = calc.limits(hp))
abline(h = calc.limits(hp2), lty = 3)

The solid lines indicate the limits of the original data, the dotted lines that of the already reduced data. First, you lose 10 data points, and then another 7.
> sum(is.na(hp2))
[1] 10
> sum(is.na(remove_outliers(hp2)))
[1] 17
In conclusion, if you don't have a good reason to remove a data point, just don't do it.