`cor()` gives inconsistent results when given the whole matrix and when given just a pair of columns

Question

I have a matrix with a lot of missing values and I am trying to compute correlations between the columns.

To deal with the missing values, I use

cor(matrix,use="complete")

This gives a matrix with no NA values as desired. However, if I do a pairwise correlation between two of the columns A and B

cor(matrix[,A],matrix[,B],use="complete")

I get a different result than the one in the [A,B] entry in the matrix.

Looking a plot between the two variables, it seems like the second result is more reasonable.

Where could this discrepancy come from?

Welcome to SO. To help people provide answers, it is generally expected to add your data to the question to make a reproducible example. have a read of http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — dww, Aug 25 '16 at 01:49

Zheyuan Li · Accepted Answer · 2016-08-25T02:27:34.073

You are asking the difference between "complete.obs" and "pairwise.complete.obs".

## example matrix
set.seed(0);X <- matrix(rnorm(10*3),ncol=3)
X[1:2,1] <- NA
X[3:4,2] <- NA
X[5:6,3] <- NA

#              [,1]       [,2]        [,3]
# [1,]           NA  0.7635935 -0.22426789
# [2,]           NA -0.7990092  0.37739565
# [3,]  1.329799263         NA  0.13333636
# [4,]  1.272429321         NA  0.80418951
# [5,]  0.414641434 -0.2992151          NA
# [6,] -1.539950042 -0.4115108          NA
# [7,] -0.928567035  0.2522234  1.08576936
# [8,] -0.294720447 -0.8919211 -0.69095384
# [9,] -0.005767173  0.4356833 -1.28459935
#[10,]  2.404653389 -1.2375384  0.04672617

## complete
cor(X, use = "complete.obs")
#            [,1]        [,2]        [,3]
#[1,]  1.00000000 -0.69629279 -0.09773585
#[2,] -0.69629279  1.00000000 -0.01228196
#[3,] -0.09773585 -0.01228196  1.00000000

## pairwise
cor(X, use = "pairwise.complete.obs")
#            [,1]       [,2]        [,3]
#[1,]  1.00000000 -0.5531396  0.08229729
#[2,] -0.55313958  1.0000000 -0.10786401
#[3,]  0.08229729 -0.1078640  1.00000000

For use = "complete.obs", any rows with at least one NA will be dropped. So it essentially does

X1 <- X[7:10, ]  ## only the last 4 rows have no `NA`
cor(X1)
#            [,1]        [,2]        [,3]
#[1,]  1.00000000 -0.69629279 -0.09773585
#[2,] -0.69629279  1.00000000 -0.01228196
#[3,] -0.09773585 -0.01228196  1.00000000

Here, the (1,2) or (2,1) entry -0.69629279 is computed with only 4 data. However, if you do pairwise, it can be computed with 6 data:

cor(X[5:10, 1], X[5:10, 2])
# [1] -0.5531396

`cor()` gives inconsistent results when given the whole matrix and when given just a pair of columns

1 Answers1