1

I have an array data = array[1:50,1:50,1:50] the values inside are real numbers between -1, 1.

"Data" could be treated as cube 50x50x50.

I need to create a correlation matrix (removing all zeros) based on this equation =>

value = (x+y)-|x-y| and the matrix size is 2 times the possible combinations (50x50x50)*((50x50x50)-1)/2 = 7.812.437.500 this 2 times = correlation matrix.

I did this:

Lets say we have 3x3x3:

arr = array(rnorm(10), dim=c(3,3,3))

data = data.frame(array(arr))


data$voxel <- rownames(data) 

#remove zeros
data<-data[!(data[,1]==0),]

rownames(data) = data$voxel

data$voxel = NULL


#######################################################################################
#Create cluster

no_cores <- detectCores() #- 1

clus <- makeCluster(no_cores)

clusterExport(clus, list("data") , envir=environment())

clusterEvalQ(clus,
             compare_strings <- function(j,i) {
               value <- (data[i,]+data[j,])-abs(data[i,]- data[j,])   
               pair <- rbind(rownames(data)[j],rownames(data)[i],value)
               return(pair)
             })

i = 0 # start 0
kk = 1
table <- data.frame()

ptm <- proc.time()

while(kk<nrow(data)) {

  out <-NULL  
  i = i+1 # fix row
  j = c((kk+1):nrow(data)) # rows to be compared

  #Apply the declared function  
  out = matrix(unlist(parRapply(clus,expand.grid(i,j), function(x,y) compare_strings(x[1],x[2]))),ncol=3, byrow = T)

  table <- rbind(table,out)

  kk = kk +1

}

proc.time() - ptm

The result is data.frame:

v1  v2  v3
1   2   2.70430114250358
1   3   0.199941717684129
... up to 351 rows

but this will take days...

Also I would like to create an matrix for this correlation:

   1                         2              3...
1  1                  2.70430114250358 
2  2.70430114250358          1
3...

Is there a faster way to do it?

Thanks

DemetriusRPaula
  • 377
  • 1
  • 10
  • 3
    Please give us a small [reproducible example](http://stackoverflow.com/a/5963610/1412059) (e.g., with a 3x3x3 array) to work with and show the expected output. If a vectorized solution can't be found (doubtful), you should do this with Rcpp (i.e., do the loop in compiled code). – Roland Sep 25 '15 at 13:41
  • Your current code to generate `data` cannot be run, as `S` is nowhere to be found. – Heroka Sep 25 '15 at 13:42
  • Hi guys, I have edited the post with some more explanation. Thanks – DemetriusRPaula Sep 25 '15 at 16:58

1 Answers1

0

There are a number of performance mistakes in your code:

  1. You loop when you should rely on vectorization.
  2. You grow an object in a loop.
  3. You parallelize each single iteration of the loop instead of parallelizing the outer loop.

You can avoid all these problems if you avoid the first problem.

Apparently, you want to compare each combination of rows. For this you should first get all combinations of row indices:

combs <- t(combn(1:27, 2))

Then you can apply the comparison function to these:

compare <- function(j,i, data) {
  as.vector((data[i,]+data[j,])-abs(data[i,]- data[j,]))
}

res <- data.frame(V1 = combs[,1], V2 = combs[,2], 
                  V3 = compare(combs[,1], combs[,2], data))

Now, if we want to check if this gives the same result as your code, we first need to fix your output. By combining characters (the rownames) with numerics in a matrix, you get a character matrix and the columns of your final data.frame are all characters. We can use type.convert to fix that afterwards (although it should be avoided from the beginning):

table[] <- lapply(table, function(x) type.convert(as.character(x)))

Now we can see that results are the same:

all.equal(res, table)
#[1] TRUE

If you like, you can turn the result into a sparse matrix:

library(Matrix)
m <- sparseMatrix(i = res$V1, j = res$V2, x = res$V3, 
                  dims = c(27, 27), symmetric = TRUE)
diag(m) <- 1
Roland
  • 127,288
  • 10
  • 191
  • 288
  • combs <- t(combn(1:83346, 2)) does not work for the size :( – DemetriusRPaula Sep 27 '15 at 23:08
  • Well, that would be `3,473,236,185` combinations. I believe you should reconsider what you are trying to do, but if you insist on doing this, you can use Rcpp. Of course, you'll need a large RAM or combine Rcpp with one of the packages for out-of-memory data. – Roland Sep 28 '15 at 07:04
  • cppFunction('Rcpp::DataFrame combi2inds(const Rcpp::CharacterVector inputVector){ const int len = inputVector.size(); const int retLen = len * (len-1) / 2; Rcpp::IntegerVector outputVector1(retLen); Rcpp::IntegerVector outputVector2(retLen); int indexSkip; for (int i = 0; i < len; ++i){ indexSkip = len * i - ((i+1) * i)/2; for (int j = 0; j < len-1-i; ++j){ outputVector1(indexSkip+j) = i+1; outputVector2(indexSkip+j) = i+j+1+1;}} return(Rcpp::DataFrame::create(Rcpp::Named("xid") = outputVector1, Rcpp::Named("yid") = outputVector2)); };') – DemetriusRPaula Sep 28 '15 at 14:34
  • d <- data.table(id=as.character(1:85000)) – DemetriusRPaula Sep 28 '15 at 14:34
  • indices <- combi2inds(d$id) – DemetriusRPaula Sep 28 '15 at 14:34
  • Error in .Primitive(".Call")(, inputVector) : negative length vectors are not allowed – DemetriusRPaula Sep 28 '15 at 14:34
  • I am trying to do this: compare <- function(i,j, data) { if((data[i,]*data[j,])>=0){ as.vector((sum(data[i,],data[j,]))-abs(data[i,]-data[j,])) } else{ as.vector((sum(data[i,],data[j,]))-abs(data[i,]+data[j,])) } } Not working :( – DemetriusRPaula Oct 06 '15 at 01:22
  • Check out `help("ifelse")`. – Roland Oct 06 '15 at 05:40