-1

I have a dataset in which there are some outliers due to input errors.

I have written a function to remove these outliers from my data frame (source):

remove_outliers <- function(x, na.rm = TRUE, ...) 
  {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)

  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
  }

Once I remove these outliers, data set is modified. When checked again new set of outliers are shown in some cases.

Is there any one stage method where we can remove all the possible outliers?

Community
  • 1
  • 1
Mash
  • 13
  • 2
  • 6
  • 2
    Please specify "some cases" and provide some data to reproduce the problem. – Jens Tierling May 18 '15 at 12:22
  • 1
    It is an extremely bad idea to remove outliers in such a recursive way. Normally, you shouldn't remove any outliers (use robust statistical methods instead), but if you have to, you must do the outlier test only once with your original data and not with data after outlier removal. – Roland May 18 '15 at 12:42
  • @JensTierling I used this data (http://web.pdx.edu/~gerbing/data/cars.csv) . In this case after outlier removal using the function, the columns Horsepower and Accelerate are still showing some outliers. – Mash May 19 '15 at 07:14

2 Answers2

0

I believe that "outlier" is a very dangerous and misleading term. In many cases it means a data point which should be excluded from analysis for a specific reason. Such a reason could be that a value is beyond physical boundaries because of a measurement error, but not that "it does not fit the other points around it".

Here, you specify a statistical criterion based on the distribution of the actual data. Let aside that I don't find that approach appropriate here (because these data a presumably precisely measured for a given car), when you apply remove_outliers to the data, the function will determine the outlier limits and set the data points beyond these limits to NA.

## Using only column horsepower
dat <- read.csv("./cars.csv")

hp <- dat$Horsepower

## Calculates the boundaries like remove_outliers
calc.limits <- function(x, na.rm = TRUE) {
    qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm)
    H <- 1.5 * IQR(x, na.rm = na.rm)
    lwr <- qnt[1] - H
    upr <- qnt[2] + H
    c(lwr, upr)
}

> calc.limits(hp)
  25%   75% 
 -1.5 202.5 

This results in a new data set with NA values. When you apply remove_outliers to the already reduced data set, the statistics will differ, and so will the limits. Thus, you will get "new" outliers (see Roland's comment).

hp2 <- remove_outliers(hp)

calc.limits(hp2)

> calc.limits(hp2)
25% 75% 
  9 185 

You can visualize this fact:

plot(hp, ylim = c(0, 250), las = 1)

    abline(h = calc.limits(hp))

    abline(h = calc.limits(hp2), lty = 3)

The solid lines indicate the limits of the original data, the dotted lines that of the already reduced data. First, you lose 10 data points, and then another 7.

> sum(is.na(hp2))
[1] 10

> sum(is.na(remove_outliers(hp2)))
[1] 17

In conclusion, if you don't have a good reason to remove a data point, just don't do it.

Jens Tierling
  • 701
  • 8
  • 9
0

I would generally advice against removing outliers. Look into using robust procedures instead. They will down-weight the points that are far from the main trend but will not remove them from the analysis. You can also do robust transformation of your data and then use the transformed values in your analysis. If you still want to identify your outliers, a good method is Median-MAD. That works a lot better because it is using median rather than mean, which makes it more robust. I can post my code for Med-MAD test here if you are interested.

Davit Sargsyan
  • 1,264
  • 1
  • 18
  • 26