0

I have a matrix with a lot of missing values and I am trying to compute correlations between the columns.

To deal with the missing values, I use

cor(matrix,use="complete")

This gives a matrix with no NA values as desired. However, if I do a pairwise correlation between two of the columns A and B

cor(matrix[,A],matrix[,B],use="complete")

I get a different result than the one in the [A,B] entry in the matrix.

Looking a plot between the two variables, it seems like the second result is more reasonable.

Where could this discrepancy come from?

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
Misha V
  • 3
  • 2
  • Welcome to SO. To help people provide answers, it is generally expected to add your data to the question to make a reproducible example. have a read of http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – dww Aug 25 '16 at 01:49

1 Answers1

0

You are asking the difference between "complete.obs" and "pairwise.complete.obs".

## example matrix
set.seed(0);X <- matrix(rnorm(10*3),ncol=3)
X[1:2,1] <- NA
X[3:4,2] <- NA
X[5:6,3] <- NA

#              [,1]       [,2]        [,3]
# [1,]           NA  0.7635935 -0.22426789
# [2,]           NA -0.7990092  0.37739565
# [3,]  1.329799263         NA  0.13333636
# [4,]  1.272429321         NA  0.80418951
# [5,]  0.414641434 -0.2992151          NA
# [6,] -1.539950042 -0.4115108          NA
# [7,] -0.928567035  0.2522234  1.08576936
# [8,] -0.294720447 -0.8919211 -0.69095384
# [9,] -0.005767173  0.4356833 -1.28459935
#[10,]  2.404653389 -1.2375384  0.04672617

## complete
cor(X, use = "complete.obs")
#            [,1]        [,2]        [,3]
#[1,]  1.00000000 -0.69629279 -0.09773585
#[2,] -0.69629279  1.00000000 -0.01228196
#[3,] -0.09773585 -0.01228196  1.00000000

## pairwise
cor(X, use = "pairwise.complete.obs")
#            [,1]       [,2]        [,3]
#[1,]  1.00000000 -0.5531396  0.08229729
#[2,] -0.55313958  1.0000000 -0.10786401
#[3,]  0.08229729 -0.1078640  1.00000000

For use = "complete.obs", any rows with at least one NA will be dropped. So it essentially does

X1 <- X[7:10, ]  ## only the last 4 rows have no `NA`
cor(X1)
#            [,1]        [,2]        [,3]
#[1,]  1.00000000 -0.69629279 -0.09773585
#[2,] -0.69629279  1.00000000 -0.01228196
#[3,] -0.09773585 -0.01228196  1.00000000

Here, the (1,2) or (2,1) entry -0.69629279 is computed with only 4 data. However, if you do pairwise, it can be computed with 6 data:

cor(X[5:10, 1], X[5:10, 2])
# [1] -0.5531396
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248