0

First post, so be gentle ;-)

I have a scenario, where I want to correlate all the rows of mat A (around 50,000) with all columns of mat B (around 100). I have solved this by doing so:

output = c()
for( i in 1:nrow(A) ){
    for(j in 1:ncol(B) ){
        myTest = cor.test(A[i,],B[,j],method="spearman")
        output = rbind(output,c(rownames(A)[i],colnames(B)[j],
                                myTest$p.value,myTest$estimate))
    }
}

But it is hopelessly slow, after 30 hours of runtime, it is still not finished.

There must be a smarter way of doing this? :-)

Cheers!

smci
  • 32,567
  • 20
  • 113
  • 146
LeonDK
  • 142
  • 10
  • 1
    Please provide small examples of `A` and `B` and the desired result. – David Arenburg Mar 04 '15 at 12:05
  • `A` and `B` are numeric matrices with NA values. The desired result is a (big) table of 5e6 rows and 4 columns, which I will then sort and identify the strongest correlations (lowest p-value after fdr correction) - Does this make sense? :-) – LeonDK Mar 04 '15 at 12:11
  • Please follow the examples in that [link](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – David Arenburg Mar 04 '15 at 12:17
  • Create these matrices: `A = matrix(rnorm(16),nrow=4,ncol=4) B = matrix(rnorm(16),nrow=4,ncol=4) rownames(A)=c("Arow1","Arow2","Arow3","Arow4") colnames(A)=c("Acol1","Acol2","Acol3","Acol4") rownames(B)=c("Brow1","Brow2","Brow3","Brow4") colnames(B)=c("Bcol1","Bcol2","Bcol3","Bcol4")` And then run the code in original question - Better? :-) – LeonDK Mar 04 '15 at 12:27

2 Answers2

2

Ok, so I tried out @Math suggestion and decided to time it using the following code:

# Clear workspace
rm(list = ls())

# Reproducible results
set.seed(42)

# Set dimensions
n1 = 500
n2 = 150
n3 = 100

# Create matrices
A = matrix(rnorm(n1*n2),nrow=n1,ncol=n2)
B = matrix(rnorm(n2*n3),nrow=n2,ncol=n3)

# Assign row/col names
rownames(A)=paste("Arow",seq(1,nrow(A)),sep="")
colnames(A)=paste("Acol",seq(1,ncol(A)),sep="")
rownames(B)=paste("Brow",seq(1,nrow(B)),sep="")
colnames(B)=paste("Bcol",seq(1,ncol(B)),sep="")

# State number of correlations to be performed
cat(paste("Total number of correlations =",nrow(A)*ncol(B),"\n"))

# Test 1 using rbind()
cat("Starting test 1 with rbind()\n")
ptm = proc.time()
output = c()
for( i in 1:nrow(A) ){
    for( j in 1:ncol(B) ){
        myTest = cor.test(A[i,],B[,j],method="spearman")
        output = rbind(output,c(rownames(A)[i],colnames(B)[j],
                                myTest$p.value,myTest$estimate))
    }
}
print(proc.time() - ptm)

# Test 2 using pre-built matrix
cat("Starting test 2 with pre-built matrix\n")
ptm = proc.time()
output = matrix(0, nrow=nrow(A)*ncol(B), ncol=4)
count  = 1
for( i in 1:nrow(A) ){
    for( j in 1:ncol(B) ){
        myTest = cor.test(A[i,],B[,j],method="spearman")
        output[count,] = c(rownames(A)[i],colnames(B)[j],
                           myTest$p.value,myTest$estimate)
        count = count + 1
    }
}
print(proc.time() - ptm)

Running this code, produces the following results:

Total number of correlations = 50000 
Starting test 1 with rbind()
   user  system elapsed 
275.560   6.963 282.913 
Starting test 2 with pre-built matrix
   user  system elapsed 
 29.869   0.218  30.114 

So evidently there is quite a difference, ac. I was not aware of this 'problem' with using the rbind() function to gradually build matrices. Thank you @Math for pointing this out! :-)

Cheers!

LeonDK
  • 142
  • 10
1

Your code is slow mainly because your do rbind, which creates a new matrix and copies all the data from the previous one. This generates a huge overhead.

A simple solution is to create the matrix before the loop, and then fill it :

output = matrix(0, nrow=nrow(A)*ncol(B), ncol=4)
for(i in 1:nrow(A)){
    for(j in 1:ncol(B) ){
        myTest = cor.test(A[i,],B[,j],method="spearman")
        output[(i-1)*ncol(B)+j,] = c(rownames(A)[i],colnames(B)[j],
                                myTest$p.value,myTest$estimate)
    }
}
Math
  • 2,399
  • 2
  • 20
  • 22
  • I accepted the answer, but the part about indexing the new matrix: `output[i*nrow(A)+j,]` will not work. I introduced a simple count variable (see my post) to fix this. Cheers! :-) – LeonDK Mar 05 '15 at 07:26
  • Yep, it's `i*ncol(B)+j` (or `i+j*nrow(A)`) instead of `i*nrow(A)+j`. Sorry for that, I fixed it (but the counter solution is a good one to). – Math Mar 05 '15 at 09:16
  • I decided on the counter, thinking that that way i'd avoid having to compute the appropriate combination value of `i` and `j` a total of `nrow(A)*ncol(B)` times. Btw. I still don't think your suggestion will work? Given `i*ncol(B)+j`, then the very first row numbering will yield `i*ncol(B)+j = 1*ncol(B)+1 > 1`, for all values of `ncol(B) > 0`? :-) – LeonDK Mar 05 '15 at 09:37
  • Yes, it's `i-1` or `j-1`. I'm used to arrays starting at zero, and to for loop ranging from 0 to len-1, it's a common mistake I make, sorry. – Math Mar 05 '15 at 09:57