2

I'm trying to find a good way to test whether or not the correlation between two vectors is perfect (or NA.) I've tried lots of different methods, but I'm having a similar problem with all of them, namely that the result of correlation doesn't evaluate in the way I'd expect.

This is my latest example:

foo1 <- c(4, NA, 6, NA)   
foo2 <- c(1, 2, 3, 4)
set  <- c(-1, 1, NA)
correlation <- cor(foo1, foo2, use = "na.or.complete")  # Result: 1
correlation %in% set  # Should be TRUE, is FALSE
correlation == 1      # Should also be TRUE, but is FALSE

is.numeric(correlation) is TRUE. The only value I can see in it or around it is 1. Then whyyy in the world does this not work?

set  <- c('-1', '1', NA)

This works, but I'm not sure why, and I'm worried there are ways that it might fail because I clearly don't understand what's going on with the returned value.

Any insight at all would be helpful!

Danielle McCool
  • 146
  • 2
  • 11
  • `all.equal(correlation,1)` is `TRUE` ... taking a look at R FAQ 7.31 might be useful. – Ben Bolker May 20 '13 at 17:19
  • Thanks! I knew about all.equal, but I couldn't manage to find a way to get it working with NAs in the mix, or the possibility of it being -1, which is why I was hoping to manage it differently. – Danielle McCool May 20 '13 at 17:30
  • I think if you've developed a solution to your problem you should post it as an answer rather than as a part of your question. (You are allowed and in fact encouraged to answer your own questions.) – Ben Bolker May 20 '13 at 18:19
  • 1
    if (reputation <= 10){print("We don't trust you enough to answer your own questions within 8 hours.")} Let's be honest, when my code makes the world explode sometime around iteration 5000, I think we'll all be greatful for that rule. – Danielle McCool May 20 '13 at 18:29
  • now your reputation is high enough ... – Ben Bolker May 20 '13 at 22:49

3 Answers3

2

Instead, you can make sure that they are close enough to your set:

mytol <- 1e-10
set <- set[1:2]
any(abs(correlation - set) <= mytol)|is.na(correlation) # TRUE

It's to do with floats and tolerance, I guess. I'm not sure what the standard reference is, but here's one of them: Numeric comparison difficulty in R

1 is not really 1; if you wanted an integer (and here I guess you shouldn't), you could use 1L. Vectors created like 1:3 and seq(1,3) are also integers. Look at ?integer and ?numeric for more information. Strangely, I can't find a doc page that covers the differences between the classes.

EDIT: I split off the check for NA because, as the OP pointed out, it didn't work.

Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180
  • That's a really good thing to know -- I hadn't considered floating point issues, really. In this specific circumstance, it doesn't handle the comparison against NA, though, since it does that whole R thing where NA is like the plague and if anything touches NA it suddenly becomes NA. :) I guess for 1 and -1, I could test it in that way, and have a secondary test for NA, but I don't really know how that might affect efficiency -- this is error checking for a simulation, so I'll be running it 10k times or so. – Danielle McCool May 20 '13 at 17:10
  • Ah, you're right, I forgot about the NA weirdness. I've edited the answer so that it should do the right check now. I guess for your simulation, you can put the correlations into a vector and just run this at the end. I don't think it should be too costly... – Frank May 20 '13 at 17:14
2

I think this is basically FAQ 7.31, but "how do I see if a value is within a set (within tolerance)" requires a slight extension of the standard "just use all.equal()" answer ...

test <- function(x) {
          isTRUE(all.equal(x,-1)) || 
            isTRUE(all.equal(x,1)) || is.na(x)
}
test(1-1e-14)  ## TRUE
test(NA)       ## TRUE
test(0.88)      ## FALSE

is the most precise solution to your question, although I can see that it would get difficult if you had a much longer list of candidates ... since isTRUE(all.equal(1,NA)) is FALSE (as is convenient) perhaps

test <- function(x,candidates=c(-1,1,NA), ...) {
    any(sapply(lapply(candidates,all.equal,target=x,...),
               isTRUE))
}
test(1-1e-6)    ## FALSE
test(1-1e-6,tolerance=1e-4)  ## TRUE

The one wrinkle here is that isTRUE(all.equal(NA,NaN)) is not TRUE, so one might want to build is.na() (which tests for either NA or NaN) in here somewhere, or include NaN in the list of candidates.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
1

You people were super helpful. For the interested, here's what the final result is, although I clearly have to play with my tolerances a little bit more.

corrIsOkay <- function(x, y){
  correlation  <- cor(x, y, use = "na.or.complete")
  if ((1 - abs(correlation)) <= .01 | is.na(correlation)){
    print(correlation)
    return(FALSE)
  }
  return(TRUE) 
}

It's a lot wordier than I would have liked, since my original "solution" fit inside an if statement by itself, but now I just call this function in the if statement.

makeMvMissing <- function(data) {
  repeat {
      x <- makeMissing(data)[, 1]
      y <- makeMissing(data, variable.missing = "y")[, 2]
      if (corrIsOkay(x, y)){
        break
      }
    }
  return(data.frame(x, y))
}
Danielle McCool
  • 146
  • 2
  • 11