3

I have looked at a set of data and decided it would be good to remove outliers, with an outlier having the definition of being 2SD away from the mean.

If I have a set of data, say 500 rows with 15 different attributes, how can I remove all the rows which have 1 or more attribute which is 2 standard deviations away from the mean?

Is there an easy way to do this using R? Thanks,

ThePerson
  • 3,048
  • 8
  • 43
  • 69
  • 2
    If you do a StackOverflow search for `[R] remove outlier` you'll get many topical previous questions, such as: http://stackoverflow.com/q/1444306/602276 or http://stackoverflow.com/questions/4787332/how-to-remove-outliers-from-a-dataset – Andrie May 13 '12 at 05:54

2 Answers2

3

There's probably lots of ways and probably add on packages to deal with this. I'd suggest you try this first:

library(sos); findFn("outlier")

Here's a way you could do what your asking for using the scale function that can standardize vectors.

#create a data set with outliers
set.seed(10)
dat <- data.frame(sapply(seq_len(5), function(i) 
    sample(c(1:50, 100:101), 200, replace=TRUE)))

#standardize each column (we use it in the outdet function)
scale(dat)

#create function that looks for values > +/- 2 sd from mean
outdet <- function(x) abs(scale(x)) >= 2
#index with the function to remove those values
dat[!apply(sapply(dat, outdet), 1, any), ]

So in answering your question yes there is an easy way in that the code to do this could be boiled down to 1 line of code:

dat[!apply(sapply(dat, function(x) abs(scale(x)) >= 2), 1, any), ]

And I'm guessing there's a package that may do this and more. The sos package is terrific (IMHO) for finding functions to do what you want.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
2
na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
KLDavenport
  • 659
  • 8
  • 24