I am trying to figure out why apply functions (with multiple %in% and == operators inside) get terribly slow for very large row sizes.
A relevant discussion was made in this link , however for my particular case, I believe speed by vectorization might not solve my problem - (Am I correct to assume this ??) apply() is slow - how to make it faster or what are my alternatives?
I am adding the code to generate a representative data for my problem and the associated benchmarking results.
set.seed(123)
# Representative data
data <- matrix(rnorm(25000*1000),byrow=T,ncol=1000,nrow=25000)
tmp_data <- data
# Discretizing the data
data[tmp_data <=-1] = -2
data[tmp_data >= 1] = 2
data[tmp_data > -1 & tmp_data < 1] = 0
rm(tmp_data)
rownames(data) <- paste("Gene",c(1:nrow(data)),sep="_")
colnames(data) <- paste("Sample",c(1:ncol(data)),sep="_")
# Pair combination of any 2000 sampled rownames
gene_sample <- rownames(data)[sample(c(1:nrow(data)), 2000, replace=F)]
gene_pairs <- t(combn(gene_sample,2))
# Different size of rows to be generated for speed testing
test_size = c(500, 1000, 5000, 10000, 20000, 50000, 100000)
time_measure <- list()
for ( i in 1: length(test_size))
{
sample_rows <- sample(nrow(gene_pairs),test_size[i],replace=F)
gene_pairs_sel <- gene_pairs[sample_rows,]
proc_time <- system.time(
# The actual analysis I perform within the apply function
# The aim is to calculate a co occurance score and something like a mutual
# information score, for the co-occurances of 2's in the given gene pairs (i.e per row) from the main data.
# Simply put all the common samples having common 2's between the two row names (pairs of genes)
result <- t(apply(gene_pairs_sel,1,function(y){
pat1 <- data[rownames(data) %in% y[1],,drop=F]
pat1 <- colnames(pat1[,which(pat1 == 2)])
pat2 <- data[rownames(data) %in% y[2],,drop=F]
pat2 <- colnames(pat2[,which(pat2 == 2)])
common_pat <- intersect(pat1,pat2)
if(length(common_pat)>0)
{
mis <- round((length(common_pat)/ncol(data)) * log2((length(common_pat)/ncol(data))/((length(pat1)/ncol(data)) * (length(pat2)/ncol(data)))),3)
co_occur <- round(length(common_pat)/ncol(data),3)
}else{mis=0;co_occur=0}
return(c(sort(as.character(y[c(1,2)])),co_occur,mis))
}))
)
time_measure[[i]] <- proc_time
}
names(time_measure) <- paste("For_row_size_of",test_size,sep="_")
## time_measure ##
$For_row_size_500
user system elapsed
2.569 0.000 2.571
$For_row_size_1000
user system elapsed
5.000 0.000 5.001
$For_row_size_5000
user system elapsed
25.498 0.212 25.715
$For_row_size_10000
user system elapsed
50.271 0.124 50.389
$For_row_size_20000
user system elapsed
100.942 0.012 100.956
$For_row_size_50000
user system elapsed
250.760 0.356 251.134
$'For_row_size_1e+05'
user system elapsed
496.655 0.712 497.410
As you all see the computing time increases with increasing row sizes quite exponentially !!
The row sizes that I am dealing with are atleast 3 times bigger than the max size (i.e nrow = 100000) that I have used here for benchmarking .The whole analysis with ~ 500 matrices with large varying row sizes (> 100000) is taking me way too long to compute. Is there any way to speed this up substantially by some manipulation (of or within) apply() ??
I was hoping for a solution without having to resort to parallelization approaches as I am not too familiar with R packages like snow or multicore, but am not averse to using them if need be :-)
Help is much appreciated !!
Regards
Ashwin