0

I have three dataframes in R, let's call them A, B, and C. dataframe C contains two columns, the first one contains various row names from dataframe A and the second one contains row names in dataframe B:

C <- data.frame(col1 = c("a12", "a9"), col2 = c("b6","b54"))

I want to calculate the correlation coefficient and p-values for each row of the table C using the corresponding values from the rows of table A and B (i.e. correlating values from the a12 row in the table A with values from b6 row from table B, a9 row from table A with b54 row from table B, etc.) and put the resulting values in additional columns in the table C. This is my current naive and highly inefficient code:

for (i in 1:nrow(C)) { 
correlation <- cor.test(unlist(A[C[i,1],]), unlist(B[C[i,2],]), method = "spearman")
C[i,3] <-correlation$estimate
C[i,4] <- correlation$p.value
}

The main problem is that with my current large datasets this analysis can literally take months. so I'm looking for a more efficient way to accomplish this task. I also tried the following code using the "Hmisc" package but the server I'm working on can't handle the large vectors:

A <- t(A)
B <- t(B)
ind.A <- match(C[,1], colnames(A)) 
A<- A[,ind.A]
ind.B <- match(C[,2], colnames(B))
B<- B[,ind.B]
C[,3]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$r[c(1:ncol(A)),c(1:ncol(A))])
C[,4]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$P[c(1:ncol(A)),c(1:ncol(A))])
  • One way to speed it up is to use parallelization. See if the following post helps https://stackoverflow.com/questions/46532657/convert-r-apply-statement-to-lapply-for-parallel-processing – hyena Jan 02 '22 at 04:12
  • Thank you very much for your guidance. This approach accelerated my analysis approximately 4 times – BioNovice247 Jan 02 '22 at 08:56

2 Answers2

0

Based on the comment by @HYENA, I tried parallelize processing. This approach accelerated the process approximately 4 times (with 8 cores). The code:

library(foreach)
library(doParallel)
cl<- makeCluster(detectCores())
registerDoParallel(cl)
cor.res<- foreach (i=1:nrow(C)) %dopar% {
  a<- C[i,1]
  b<- C[i,2]
  correlation<- cor.test(unlist(A[a,]),unlist(B[b,]), method = "spearman")
  c(correlation$estimate,correlation$p.value)
}
cor.res<- data.frame(Reduce("rbind",cor.res))
C[,c(3,4)]<- cor.res
  • which cpu do you have on your machine? – hyena Jan 02 '22 at 18:05
  • @HYENA I tested this approach on my own laptop with 12 Gb RAM and the following processor: Intel Corei7-10510U CPU @ 1.80GHz 2.30 GHz but the server I'm working on has a 30 Gb RAM and the following processor with 20 cores: Intel Xeon CPU E5-2699 v4 @ 2.20GHz 2.20 GHz – BioNovice247 Jan 04 '22 at 07:41
  • there are other things you can do to speed this up: 1, try to use Intel oneAPI MKL instead of the default library; 2, set the number of threads a bit higher than the # of physical cores just to saturate the capacity; 3, strip unnessesary argument checking code like @G. Grothendieck's anwser here. You may directly write c++ code using `Rcpp` to go further. – hyena Jan 04 '22 at 16:57
0

Extract just the part you need from cor.test giving cor_test1 and use that instead or, in addition, create a lookup table for the p values giving cor_test2 which is slightly faster than cor_test1.

Based on the median column with 10-vectors these run about 3x faster than cor.test. Although cor_test2 is only slightly faster than cor_test1 here we have included it since the speed could depend on size of input which we don't have but you can try it out yourself with whatever sizes you have.

# given correlation and degrees of freedom output p value
r2pval <- function(r, dof) {
  tval <- sqrt(dof) * r/sqrt(1 - r^2)
  min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
}

# faster version of cor.test
cor_test1 <- function(x, y) {
  r <- cor(x, y)
  dof <- length(x) - 2
  tval <- sqrt(dof) * r/sqrt(1 - r^2)
  pval <- min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
  c(r, pval)
}

# even faster version of cor.test.
# Given x, y and the pvals table calculate a 2-vector of r and p value
cor_test2 <- function(x, y, pvals) {
 r <- cor(x, y)
 c(r, pvals[100 * round(r, 2) + 101])
}

# test
set.seed(123)
n <- 10
x <- rnorm(n); y <- rnorm(n)
dof <- n - 2
# pvals is the 201 p values for r = -1, -0.99, -0.98, ..., 1
pvals <- sapply(seq(-1, 1, 0.01), r2pval, dof = dof)

library(microbenchmark)
microbenchmark(cor.test(x, y), cor_test1(x, y), cor_test2(x, y, pvals))

giving:

Unit: microseconds
                   expr   min    lq    mean median     uq     max neval cld
         cor.test(x, y) 253.7 256.7 346.278 266.05 501.45   650.6   100   a
        cor_test1(x, y)  84.8  87.2 346.777  89.10 107.40 22974.4   100   a
 cor_test2(x, y, pvals)  72.4  75.0 272.030  79.45  91.25 17935.8   100   a
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341