1

I want to find the index of the outlier spotted by the grubbs.test function of the outliers package (I adapted it from another SO answer here)

where = function(x) which(x==as.numeric(strsplit(grubbs.test(x)$alternative," ")[[1]][3]))

It works by retrieving the number in the text displayed by the grubbs result. It's kind of a hack but it works well, let's say, for round numbers:

df=c(0, 3, rnorm(10))
where(df) #[1] 2

When it gets to decimal numbers, the text doesn't match all the times with the digits of the actual number:

df=c(0, sqrt(10), rnorm(10))
where(df) # integer(0)

Someone has an idea to fix that problem? Or another way to find the index of the grubbs test biggest outlier? I'm trying to use this in a loop.

Community
  • 1
  • 1
agenis
  • 8,069
  • 5
  • 53
  • 102

1 Answers1

1

The problem is because strsplit returns stings instead of numbers. In your second example I get:

[1] "highest"          "value"            "3.16227766016838" "is"               "an"               "outlier"   

but the third element is not really the character version of the number 3.16227766016838. In fact the real number returned from grubbs.test might have a lot more decimal places and this is why the == operator does not 'catch' it as an equality. This can be seen clearly here:

a<-sqrt(10)
> a == as.numeric(as.character(a))
[1] FALSE

Is there a solution to this?

YES there is.

In order to tackle this problem just use the almost.equal function that I took the liberty to copy from this R-help post:

almost.equal <- function (x, y, tolerance=.Machine$double.eps^0.5,
                          na.value=TRUE)
{
  answer <- rep(na.value, length(x))
  test <- !is.na(x)
  answer[test] <- abs(x[test] - y) < tolerance
  answer
}

The above function is a vectorized form of the all.equal function which checks for an 'approximate' equality so that it captures cases like yours.

Let's convert your function to:

where = function(x) {
  which(almost.equal(x, as.numeric(strsplit(grubbs.test(x)$alternative," ")[[1]][3])))
}

And let's check it now:

> df=c(0, 3, rnorm(10))
> where(df)
[1] 2

And:

> df=c(0, sqrt(10), rnorm(10))
> where(df)
[1] 2

And you have a solution that works well with decimal numbers too!!

LyzandeR
  • 37,047
  • 12
  • 77
  • 87