Summary
It turns out the difference is in how the variances are computed. If we have only the variables x
and y
where x
has some NAs, then the cov
function will compute the y_mean
and var_y
using all observations in y
. The cor
function on the other hand, will compute y_mean
and var_y
using only the values in y
that correspond to non-missing values in x
.
Since all variable involved in an analysis should have the same number of observations, it doesn't make sense when computing the correlation statistics that one of the inputs (covariances) consist of fewer observations than the other input (variances). Perhaps a matter of opinion, and might not matter so much as long as the proportion missing is low, but clearly undesirable that these two functions should silently provide different results.
Motivation
The following StackoverFlow question explains how to look at the C++ code: How can I view the source code for a function? and here's a link to the code for the C_cov
function: https://svn.r-project.org/R/trunk/src/library/stats/src/cov.c
Below some R-code that demonstrates the differences in computation. First the automatic computation:
dat <- data.frame(x = rnorm(10), y = rnorm(10))
dat$x[c(1,3)] <- NA
d <- cov(dat, use = "pairwise.complete.obs")
cov_version <- cov2cor(d)
cor_version <- cor(dat, use = "pairwise.complete.obs")
Compute manually
#Computing the covariance is the same for both cor and cov functions
n <- length(dat$x[!is.na(dat$x)]) #n of non-missing values
x_bar <- mean(dat$x, na.rm = TRUE)
y_bar <- mean(dat$y[!is.na(dat$x)]) #mean of y values where x is not NA
n1 <- n -1
s_x <- dat$x[!is.na(dat$x)] - x_bar
s_y <- dat$y[!is.na(dat$x)] - y_bar
ss_xy <- sum(s_x*s_y) #sums of squares (x, y)
cov_xy <- ss_xy / n1 #same results as cov(dat, use = "pairwise.complete.obs")
all.equal(cov_xy, d[1,2])
[1] TRUE
#The cor approach to computing variances
ss_x <- sum(s_x*s_x)
ss_y <- sum(s_y*s_y)
var_x <- ss_x / n1
var_y <- ss_y / n1
cor_xy <- cov_xy / (sqrt(var_x) *sqrt(var_y)) #same as cor(dat, use = "pairwise.complete.obs")
all.equal(cor_xy, cor_version[1,2])
[1] TRUE
#The cov approach to computing variances, different for the variable without missing values
y_bar2 <- mean(dat$y) #all values of y
s_y2 <- dat$y - y_bar2
ss_y2 <- sum(s_y2*s_y2)
var_y2 <- ss_y2 / (length(dat$y) - 1) #n comes from all values of y
cov_cor <- cov_xy / (sqrt(var_x) * sqrt(var_y2))
all.equal(cov_cor, cov_version[1,2])
[1] TRUE