Getting different cor() output on same data?

Question

I have a problem when calculating correlations. I have two datasets: d3: One with 1259 observations and 264 variables. d4: One with 1196 observations and 185 variables - these are only blood tests. Both data have same unique ID for participants, so that they can be merged.

When merging (called: d) I have 1150 observations (because them without bloodtests are out, and the bloodtest data had many control samples and so on. When remowing participants without information x (since this is inclusion criteria) we end up with 1144 observations.

In d4 I had two samples with missing values in all the variables. They are included in the 1144 participants.

So after sum datamanagement I want to calculate correlations between informations x (we call it Edu) and all the blood samples (184).

I make a new dataset by subsetting from d:

dbio <- d %>%
  select(Edu, BMP_6:CCL16)

I use these codes: First two functions:

cor.prob <- function (X, dfr = nrow(X) - 2) {
  R <- cor(X, use="complete.obs", method = c("pearson"))
  above <- row(R) < col(R)
  r2 <- R[above]^2
  Fstat <- r2 * dfr/(1 - r2)
  R[above] <- 1 - pf(Fstat, 1, dfr)
  R[row(R) == col(R)] <- NA
  R
}

flattenSquareMatrix <- function(m) {
  if( (class(m) != "matrix") | (nrow(m) != ncol(m))) stop("Must be a square matrix.") 
  if(!identical(rownames(m), colnames(m))) stop("Row and column names must be equal.")
  ut <- upper.tri(m)
  data.frame(i = rownames(m)[row(m)[ut]],
             j = rownames(m)[col(m)[ut]],
             cor=t(m)[ut],
             p=m[ut])
}

Then I start making correlations:

kor_masterlist_dbio <- flattenSquareMatrix (cor.prob(dbio))

kor_masterlist_dbio_ordnet <- kor_masterlist_dbio[order(-abs(kor_masterlist_dbio$cor)),]

kor_masterlist_dbio$threshold <- as.factor(kor_masterlist_dbio$p < 0.05)

kor_masterlist_dbio_Edu <- subset(kor_masterlist_dbio_ordnet, i == "Edu" & j != "ID")

kor_masterlist_dbio_Edu$threshold <- as.factor(kor_masterlist_dbio_Edu$p < 0.05)


kor_masterlist_dbio_Edu[kor_masterlist_dbio_Edu$threshold==T,]

kor_masterlist_dbio_Edu

Now the question: I use "complete.obs" in cor(). The two participants with all blood test missing(NA), if i take them out of the data by following code:

d <- d[rowSums(is.na(d[,3:6]))!=4,]
And end up with d: n = 1142.

I then get different results from the correlations. I end up with two less blood test as signifcant, as in p-value above 0.05. I dont understand the difference when the values are missing for those two participants.

When using "pariwise" ind cor() i get same results with 1142 or 1144 participants. But the last 4 blood tests are different from them I get with "complete.obs"

And the rest with slightly different correlationcoefficients and p-values. But stille same ranking.

I hope someone can help me. I should have same results wether the to participants are included or not.

I've tried to make following small reproducible example that shows my problem. You will get different results if you run it with/without:

d <- d[rowSums(is.na(d[,3:6]))!=4,]

ID <- c(1,2,3,4,5,6,7,8,9,10)
Edu <- c(4,7,9,13,15,18,11,10,12,8)
CLL <- c(NA,0.33,0.45,NA,0.76,0.34,0.49,0.86,0.43,0.26)
BLL <- c(NA,0.93,0.15,NA,0.86,0.14,0.29,0.36,0.93,0.76)
ALL <- c(NA,0.53,0.65,NA,0.26,0.54,0.99,0.76,0.63,NA)
DLL <- c(NA,0.13,0.95,NA,0.36,0.74,0.19,NA,0.83,0.56)

d <- data.frame(ID, Edu, CLL, BLL, ALL, DLL)
d <- d[rowSums(is.na(d[,3:6]))!=4,]

cor.prob <- function (X, dfr = nrow(X) - 2) {
  R <- cor(X, use="complete.obs", method = c("pearson"))
  above <- row(R) < col(R)
  r2 <- R[above]^2
  Fstat <- r2 * dfr/(1 - r2)
  R[above] <- 1 - pf(Fstat, 1, dfr)
  R[row(R) == col(R)] <- NA
  R
}

flattenSquareMatrix <- function(m) {
  if( (class(m) != "matrix") | (nrow(m) != ncol(m))) stop("Must be a square matrix.") 
  if(!identical(rownames(m), colnames(m))) stop("Row and column names must be equal.")
  ut <- upper.tri(m)
  data.frame(i = rownames(m)[row(m)[ut]],
             j = rownames(m)[col(m)[ut]],
             cor=t(m)[ut],
             p=m[ut])
}

kor_masterlist_d <- flattenSquareMatrix (cor.prob(d))
kor_masterlist_d_order <- kor_masterlist_d[order(-abs(kor_masterlist_d$cor)),]
kor_masterlist_d_Edu <- subset(kor_masterlist_d_order, i == "Edu" & j != "ID")
kor_masterlist_d_Edu
#>      i   j        cor         p
#> 8  Edu ALL -0.3319602 0.4217894
#> 3  Edu CLL  0.2646661 0.5264383
#> 12 Edu DLL  0.2609108 0.5325405
#> 5  Edu BLL -0.2492912 0.5515835

^{Created on 2021-07-09 by the reprex package (v2.0.0)}

ID <- c(1,2,3,4,5,6,7,8,9,10)
Edu <- c(4,7,9,13,15,18,11,10,12,8)
CLL <- c(NA,0.33,0.45,NA,0.76,0.34,0.49,0.86,0.43,0.26)
BLL <- c(NA,0.93,0.15,NA,0.86,0.14,0.29,0.36,0.93,0.76)
ALL <- c(NA,0.53,0.65,NA,0.26,0.54,0.99,0.76,0.63,NA)
DLL <- c(NA,0.13,0.95,NA,0.36,0.74,0.19,NA,0.83,0.56)

d <- data.frame(ID, Edu, CLL, BLL, ALL, DLL)

cor.prob <- function (X, dfr = nrow(X) - 2) {
  R <- cor(X, use="complete.obs", method = c("pearson"))
  above <- row(R) < col(R)
  r2 <- R[above]^2
  Fstat <- r2 * dfr/(1 - r2)
  R[above] <- 1 - pf(Fstat, 1, dfr)
  R[row(R) == col(R)] <- NA
  R
}

flattenSquareMatrix <- function(m) {
  if( (class(m) != "matrix") | (nrow(m) != ncol(m))) stop("Must be a square matrix.") 
  if(!identical(rownames(m), colnames(m))) stop("Row and column names must be equal.")
  ut <- upper.tri(m)
  data.frame(i = rownames(m)[row(m)[ut]],
             j = rownames(m)[col(m)[ut]],
             cor=t(m)[ut],
             p=m[ut])
}

kor_masterlist_d <- flattenSquareMatrix (cor.prob(d))
kor_masterlist_d_order <- kor_masterlist_d[order(-abs(kor_masterlist_d$cor)),]
kor_masterlist_d_Edu <- subset(kor_masterlist_d_order, i == "Edu" & j != "ID")
kor_masterlist_d_Edu
#>      i   j        cor         p
#> 8  Edu ALL -0.3319602 0.3487063
#> 3  Edu CLL  0.2646661 0.4599218
#> 12 Edu DLL  0.2609108 0.4665494
#> 5  Edu BLL -0.2492912 0.4873203

^{Created on 2021-07-09 by the reprex package (v2.0.0)}

It would be easier to help if you create a small reproducible example along with expected output. Read about [how to give a reproducible example](http://stackoverflow.com/questions/5963269). — Ronak Shah, Jul 09 '21 at 10:21
Thanks for your comment. I've just added a small reproducible example that demonstrates my issue. — BHS, Jul 09 '21 at 11:29

Armali · Accepted Answer · 2021-07-12T15:04:26.653

1

The difference is introduced with

  Fstat <- r2 * dfr/(1 - r2)

due to different values of dfr (8 and 6).

> r2
 [1] 0.234375000 0.011456074 0.070048129 0.002401998 0.062146106 0.062751663
 [7] 0.096834764 0.110197604 0.165400138 0.216547595 0.057255529 0.068074453
[13] 0.030140179 0.136955494 0.005027238
> r2*8/(1-r2)
 [1] 2.44897959 0.09271069 0.60259574 0.01926226 0.53011332 0.53562464
 [7] 0.85773686 0.99076023 1.58543173 2.21121379 0.48586255 0.58437675
[13] 0.24861473 1.26951037 0.04042111
> r2*6/(1-r2)
 [1] 1.83673469 0.06953302 0.45194680 0.01444669 0.39758499 0.40171848
 [7] 0.64330264 0.74307017 1.18907380 1.65841034 0.36439691 0.43828256
[13] 0.18646105 0.95213278 0.03031583

Neither of 8 or 6 is correct, since it's based on nrow(X), while for use="complete.obs" it has to be based on the number of complete observations. This can be accomplished by changing the function definition to cor.prob <- function (X, dfr = sum(complete.cases(X)) - 2) { …. Therewith, the same results are produced with and without the prior
d <- d[rowSums(is.na(d[,3:6]))!=4,]

But if I choose to use pairwise.complete.obs instead of complete.obs, then i'll keep the code as it was? And further, since i have NA values different places in my data then i will benefite from "pairwise" rather than "complete.obs"?

Indeed if we use pairwise.complete.obs, we gain the observations where only part of the columns are NA. But, since we then have different numbers of observations for the individual columns, a single dfr value is not appropriate; instead, we can use a dfr matrix:

library(psych)
cor.prob <- function (X, dfr = pairwiseCount(X) - 2)
{
  …
  Fstat <- r2 * dfr[above]/(1 - r2)
  R[above] <- 1 - pf(Fstat, 1, dfr[above])
  …

edited Jul 12 '21 at 15:04

answered Jul 10 '21 at 08:55

Armali

18,255
14
57
171

1

Thanks. That's really useful. Do you have a solution so I could still calculate correlations and make matrix with coeff and p values as above? But with right results? – BHS Jul 10 '21 at 20:33
Maybe, but I'd first have to make myself familiar with the theory behind this computation. Perhaps you can point to a short introduction? – Armali Jul 10 '21 at 21:01
1

I have this, it might help (the linked mail in one of the answers is quite helpful). https://stackoverflow.com/questions/13486000/pairwise-correlation-table – BHS Jul 11 '21 at 11:38
I may be wrong, but I think I see more clearly now. The documentation says about `use="complete.obs"`: _missing values are handled by casewise deletion_. The key word likely is _casewise_ and means that whole rows of `X` are deleted from the input if any column is `NA`. So, not only rows 1 and 4 with four `NA` are excluded from the computation, but as well rows 8 and 10 with only one `NA`. So neither 8 nor 6 is the correct `dfr`, but rather 4 (number of complete observations 6, minus 2). I'll add to the answer above a way to take the correct value. – Armali Jul 11 '21 at 19:50
1

Thank you. It really helps me out. And further, since I have NA values different places in my data then i will benefite from "pairwise" rather than "complete.obs"? – BHS Jul 12 '21 at 10:56
1

I've just seen that i get same results either way. Thanka again. – BHS Jul 12 '21 at 11:55
I added a paragraph regarding "pairwise". – Armali Jul 12 '21 at 14:48
1

That is great. Thank you. – BHS Jul 12 '21 at 22:10

Getting different cor() output on same data?

1 Answers1