1

Running:

cor(x, use = "pairwise.complete.obs")`

vs running

c <- cov(x, use = "pairwise.complete.obs")
cov2cor(c)

give different results. Anyone know why and which one gives correct results? Both functions call C++ code which I haven't figured out how to parse.

Reproducible data:

x <- data.frame(a1 = rnorm(10), a2 = rnorm(10), a3 = rnorm(10))
x$a1[c(1,3)] <- NA

c <- cov(x, use = "pairwise.complete.obs")
cov2cor(c)
cor(x, use = "pairwise.complete.obs")
Jaap
  • 81,064
  • 34
  • 182
  • 193
Jaeoc
  • 65
  • 8

3 Answers3

2

Summary

It turns out the difference is in how the variances are computed. If we have only the variables x and y where x has some NAs, then the cov function will compute the y_mean and var_y using all observations in y. The cor function on the other hand, will compute y_mean and var_y using only the values in y that correspond to non-missing values in x.

Since all variable involved in an analysis should have the same number of observations, it doesn't make sense when computing the correlation statistics that one of the inputs (covariances) consist of fewer observations than the other input (variances). Perhaps a matter of opinion, and might not matter so much as long as the proportion missing is low, but clearly undesirable that these two functions should silently provide different results.

Motivation

The following StackoverFlow question explains how to look at the C++ code: How can I view the source code for a function? and here's a link to the code for the C_cov function: https://svn.r-project.org/R/trunk/src/library/stats/src/cov.c

Below some R-code that demonstrates the differences in computation. First the automatic computation:

dat <- data.frame(x = rnorm(10), y = rnorm(10))
dat$x[c(1,3)] <- NA

d <- cov(dat, use = "pairwise.complete.obs")
cov_version <- cov2cor(d)
cor_version <- cor(dat, use = "pairwise.complete.obs")

Compute manually


#Computing the covariance is the same for both cor and cov functions
n <- length(dat$x[!is.na(dat$x)]) #n of non-missing values
x_bar <- mean(dat$x, na.rm = TRUE) 
y_bar <- mean(dat$y[!is.na(dat$x)]) #mean of y values where x is not NA
n1 <- n -1

s_x <- dat$x[!is.na(dat$x)] - x_bar
s_y <- dat$y[!is.na(dat$x)] - y_bar
ss_xy <- sum(s_x*s_y) #sums of squares (x, y)
cov_xy <- ss_xy / n1 #same results as cov(dat, use = "pairwise.complete.obs")
all.equal(cov_xy, d[1,2])
[1] TRUE

#The cor approach to computing variances
ss_x <- sum(s_x*s_x)
ss_y <- sum(s_y*s_y)
var_x <- ss_x / n1
var_y <- ss_y / n1

cor_xy <- cov_xy / (sqrt(var_x) *sqrt(var_y)) #same as cor(dat, use = "pairwise.complete.obs")
all.equal(cor_xy, cor_version[1,2])
[1] TRUE

#The cov approach to computing variances, different for the variable without missing values
y_bar2 <- mean(dat$y) #all values of y
s_y2 <- dat$y - y_bar2 
ss_y2 <- sum(s_y2*s_y2)
var_y2 <- ss_y2 / (length(dat$y) - 1) #n comes from all values of y
cov_cor <- cov_xy / (sqrt(var_x) * sqrt(var_y2))

all.equal(cov_cor, cov_version[1,2])
[1] TRUE
Jaeoc
  • 65
  • 8
1

I'm not entirely sure, but my first guess that it depends on the number of data points you use for the calculation. Your data:

x <- data.frame(a1 = rnorm(10), a2 = rnorm(10), a3 = rnorm(10))
x
           a1          a2           a3
1          NA -0.13838924 -0.321692757
2  -0.4508542  0.94765320  1.951455501
3          NA  2.31819262 -1.411309267
4   0.7828306 -0.53437879 -1.073991019
5   0.2298984 -0.46396636 -0.008471517
6   0.6543559 -1.67425582  0.163433255
7   0.1454043  0.37971651  1.197296537
8   0.2055262  0.67526617 -0.913544378
9   0.3360491  1.84172088  0.366159132
10  0.2507107  0.08284055  1.819004908

Here you see, that a1 has 2 missing values. Thus the correlation between a1 and a3, and a1 and a2 is only based on 8 data points and the correlation between a2 and a3 is based on all 10 data points.

From the help of ?cov2cor:

cov2cor scales a covariance matrix into the corresponding correlation matrix efficiently.

When cov2cor scales your covariance matrix it assumes the same number of data points. When you use only complete observations, the result is the same:

> c <- cov(x, use = "complete.obs")
> cov2cor(c)
           a1         a2        a3
a1  1.0000000 -0.2230619 0.2580796
a2 -0.2230619  1.0000000 0.6152059
a3  0.2580796  0.6152059 1.0000000
> cor(x, use = "complete.obs")
           a1         a2        a3
a1  1.0000000 -0.2230619 0.2580796
a2 -0.2230619  1.0000000 0.6152059
a3  0.2580796  0.6152059 1.0000000

So you need to ask yourself, what information you need. If you use pairwise complete observations you should not use cov2cor in my opinion.

dasds
  • 170
  • 1
  • 11
  • 1
    Hmm, could you clarify what you mean by that `cov2cor` assumes the same number of data points? Perhaps I miss something, but to my understanding `cov2cor` makes no such assumption, rather it scales the covariances simply by using the diagonal of the computed covariance matrix itself (according to the `cov2cor` code). – Jaeoc Oct 29 '19 at 14:01
  • That is what I meant. But the covariance of a1 with both, a2 and a3, is only calculated with 8 data points. The covariance between a2 and a3 on the other hand is calculated by 10 datapoints. When ```cov2cor``` then scales the covariance matrix, it does not know about the missing values in the original dataset. This is why the result is the same with the option ```use = complete.obs``` – dasds Oct 29 '19 at 14:05
  • I have now managed, with some trial and error and looking at the C++ code, to figure out the problem. You were right in that it has to do with the number of observations, but the problem is not with any assumptions of the `cov2cor` function but is rather a difference in how the `cor` and `cov` functions compute the variances. I will leave a complete answer in a bit – Jaeoc Oct 30 '19 at 13:16
0

@Jaesoc gave a great explanation, here is a slightly different intuition:

Intuitively, when you have missing values, you end up with different number of observations for each pair, so you should store different versions of the variance for each variable (say var(y) with all obs, var(y) with same obs as z, etc). Now the problem is that when you compute the covariance matrix, you can only store one version of the variance for each variable (i.e. the diagonal elements). So doing cov2cor() or even computing manually from the covariance matrix will not give you the correct result.

Note that in the extreme case when you have less observations than variables, cov2cor() might even give you correlations bigger than |1|!

set.seed(123)
dat <- data.frame(x = rnorm(4), y = rnorm(4), z=rnorm(4))#, zz=rnorm(4))
dat[1,3] <- NA
dat
#>             x          y          z
#> 1 -0.56047565  0.1292877         NA
#> 2 -0.23017749  1.7150650 -0.4456620
#> 3  1.55870831  0.4609162  1.2240818
#> 4  0.07050839 -1.2650612  0.3598138

## Count number of pairwise observations:
crossprod(as.matrix(!is.na(dat)))
#>   x y z
#> x 4 4 3
#> y 4 4 3
#> z 3 3 3

## Compare results
cov2cor(cov(dat, use = "pairwise.complete.obs"))
#>             x           y          z
#> x  1.00000000 -0.01630914  0.9632927
#> y -0.01630914  1.00000000 -0.4893276
#> z  0.96329271 -0.48932759  1.0000000
cor(dat, use = "pairwise.complete.obs")
#>             x           y          z
#> x  1.00000000 -0.01630914  0.9408492
#> y -0.01630914  1.00000000 -0.4005502
#> z  0.94084924 -0.40055017  1.0000000

## Extreme case: N<p
dat_S <- dat[1:3,]

cov2cor(cov(dat_S, use = "pairwise.complete.obs"))
#>            x          y         z
#> x  1.0000000 -0.1777287  1.109409
#> y -0.1777287  1.0000000 -1.060258
#> z  1.1094092 -1.0602577  1.000000
cor(dat_S, use = "pairwise.complete.obs")
#>            x          y  z
#> x  1.0000000 -0.1777287  1
#> y -0.1777287  1.0000000 -1
#> z  1.0000000 -1.0000000  1

Created on 2020-08-26 by the reprex package (v0.3.0)

Matifou
  • 7,968
  • 3
  • 47
  • 52