0

I am trying to figure out why apply functions (with multiple %in% and == operators inside) get terribly slow for very large row sizes.

A relevant discussion was made in this link , however for my particular case, I believe speed by vectorization might not solve my problem - (Am I correct to assume this ??) apply() is slow - how to make it faster or what are my alternatives?

I am adding the code to generate a representative data for my problem and the associated benchmarking results.

set.seed(123)

# Representative data
data     <- matrix(rnorm(25000*1000),byrow=T,ncol=1000,nrow=25000)
tmp_data <- data

# Discretizing the data
data[tmp_data <=-1] = -2
data[tmp_data >= 1] =  2
data[tmp_data > -1  &  tmp_data < 1] = 0

rm(tmp_data)

rownames(data) <- paste("Gene",c(1:nrow(data)),sep="_")
colnames(data) <- paste("Sample",c(1:ncol(data)),sep="_")

# Pair combination of any 2000 sampled rownames
gene_sample <- rownames(data)[sample(c(1:nrow(data)), 2000, replace=F)]
gene_pairs  <- t(combn(gene_sample,2))

# Different size of rows to be generated for speed testing
test_size = c(500, 1000, 5000, 10000, 20000, 50000, 100000)
time_measure <- list()

for ( i in 1: length(test_size))
{
  sample_rows    <- sample(nrow(gene_pairs),test_size[i],replace=F)
  gene_pairs_sel <- gene_pairs[sample_rows,]

  proc_time <- system.time(

      # The actual analysis I perform within the apply function
      # The aim is to calculate a co occurance score and something like a mutual  
      # information score, for the co-occurances of 2's in the given gene pairs (i.e per row) from the main data.
      # Simply put all the common samples having common 2's between the two row names (pairs of genes) 
  result <- t(apply(gene_pairs_sel,1,function(y){
                        pat1 <- data[rownames(data) %in% y[1],,drop=F]
                        pat1 <- colnames(pat1[,which(pat1 == 2)])

                        pat2 <- data[rownames(data) %in% y[2],,drop=F]
                        pat2 <- colnames(pat2[,which(pat2 == 2)])

                        common_pat <- intersect(pat1,pat2)
                        if(length(common_pat)>0)
                        {
                          mis      <- round((length(common_pat)/ncol(data)) * log2((length(common_pat)/ncol(data))/((length(pat1)/ncol(data)) * (length(pat2)/ncol(data)))),3)
                          co_occur <- round(length(common_pat)/ncol(data),3)
                        }else{mis=0;co_occur=0}

                          return(c(sort(as.character(y[c(1,2)])),co_occur,mis))

                        }))
              )

time_measure[[i]] <- proc_time 
}
names(time_measure) <- paste("For_row_size_of",test_size,sep="_")

## time_measure ##

$For_row_size_500
  user  system elapsed 
  2.569   0.000   2.571 

$For_row_size_1000
  user  system elapsed 
  5.000   0.000   5.001 

$For_row_size_5000
  user  system elapsed 
25.498   0.212  25.715 

$For_row_size_10000
  user  system elapsed 
50.271   0.124  50.389 

$For_row_size_20000
  user  system elapsed 
100.942   0.012 100.956 

$For_row_size_50000
  user  system elapsed 
250.760   0.356 251.134 

$'For_row_size_1e+05'
  user  system elapsed 
496.655   0.712 497.410 

As you all see the computing time increases with increasing row sizes quite exponentially !!

The row sizes that I am dealing with are atleast 3 times bigger than the max size (i.e nrow = 100000) that I have used here for benchmarking .The whole analysis with ~ 500 matrices with large varying row sizes (> 100000) is taking me way too long to compute. Is there any way to speed this up substantially by some manipulation (of or within) apply() ??

I was hoping for a solution without having to resort to parallelization approaches as I am not too familiar with R packages like snow or multicore, but am not averse to using them if need be :-)

Help is much appreciated !!

Regards
Ashwin

Community
  • 1
  • 1
Ashwin
  • 329
  • 1
  • 4
  • 12
  • 1
    How do you get "exponentially" increasing time? It looks perfectly linear: 500 rows -> 2.5 seconds; 100k rows -> 500 seconds. – Hong Ooi Mar 03 '14 at 13:34
  • I wonder how that `t` influences the timing, too. – Roman Luštrik Mar 03 '14 at 13:35
  • @Hong sorry maybe the choice of the word exponential was quite misleading or rather outright wrong ... However would still like to know if I could do something within my apply() to improve the speed !! – Ashwin Mar 03 '14 at 13:42
  • 1
    It would be nice if you could describe what the code is doing. The best speed-up you can usually achieve by using a different algorithm or specialized R functions that are implemented in C. However, apart from that, you should use `Rprof` to profile your code and find out where most of the time is spent. – Roland Mar 03 '14 at 13:51
  • @RomanLuštrik Tried repeating the same thing without transpose 't' , found no significant changes in time .. 500 rows = 2.759 sec; 10K rows = 51.323 sec; 100k = 506.326sec – Ashwin Mar 03 '14 at 14:12
  • @Roland Have added more comment before my apply function hope this clarifies the code more..Also will definitely try Rprof – Ashwin Mar 03 '14 at 14:13
  • 1
    I don't have time to dig into this, but I believe you should use the `FUN` argument of `combn` instead if using `apply`. However, I believe this would still be O(n), but might be a bit faster. You should also turn your `data` matrix into a logical matrix (`data==2L`) instead of repeatedly searching for 2s. PS: I don't think you should return character vectors from the function. – Roland Mar 03 '14 at 14:34

2 Answers2

2

For some data frame of gene pairs

sample_rows    <- sample(nrow(gene_pairs),test_size[i],replace=FALSE)
df <- data.frame(gene1=gene_pairs[sample_rows, 1],
                 gene2=gene_pairs[sample_rows, 2],
                 stringsAsFactors=FALSE)

The focus is on data values equal to 2 so let's get that out of the way

data2 = data == 2

We need the number of samples of gene 1 and gene 2

df$n1 <- rowSums(data2[df$gene1,])
df$n2 <- rowSums(data2[df$gene2,])

and the number of times genes 1 and 2 co-occur

df$n12 <- rowSums(data2[df$gene1,] & data2[df$gene2,])

The statistics are then

df$co_occur <- df$n12 / ncol(data)
tmp <- df$n1 * df$n2 / (ncol(data) * ncol(data))
df$mis <- df$co_occur * log2(df$co_occur / tmp)

There is no need for an explicit loop. As a slightly modified function we might have

cooccur <- function(data, gene1, gene2) {
    data <- data == 2
    x1 <- rowSums(data)[gene1] / ncol(data)
    x2 <- rowSums(data)[gene2] / ncol(data)
    x12 <- rowSums(data[gene1,] & data[gene2,]) / (ncol(data)^2)
    data.frame(gene1=gene1, gene2=gene2,
               co_occur=x12, mis=x12 * log2(x12 / (x1 * x2)))
}

If there are very many rows in df, then it would make sense to process these in groups of say 500000. This still scales linearly, but is about 25x faster (e.g., about 3s for 10000 rows) than the original implementation. There are probably significant further space / time speed-ups to be had, particularly by treating the data matrix as sparse. No guarantees that I've accurately parsed the original code.

This can be optimized a little by looking up the character-based row index once and using the integer index instead, i1 <- match(gene1, rownames(data)), etc., but the main memory and speed limitation is the calculation of x12. It's relatively easy to implement this in C, using the inline package. We might as well go for broke and use multiple cores if available

library(inline)
xprod <- cfunction(c(data="logical", i1="integer", i2="integer"), "
    const int n = Rf_length(i1),
        nrow = INTEGER(Rf_getAttrib(data, R_DimSymbol))[0],
        ncol = INTEGER(Rf_getAttrib(data, R_DimSymbol))[1];
    const int *d = LOGICAL(data),
        *row1 = INTEGER(i1),
        *row2 = INTEGER(i2);
    SEXP result = PROTECT(Rf_allocVector(INTSXP, n));
    memset(INTEGER(result), 0, sizeof(int) * n);
    int *sum = INTEGER(result);
    for (int j = 0; j < ncol; ++j) {
        const int j0 = j * nrow - 1;
#pragma omp parallel for
        for (int i = 0; i < n; ++i)
            sum[i] += d[j0 + row1[i]] * d[j0 + row2[i]];
    }
    UNPROTECT(1);
    return result;
", cxxargs="-fopenmp -O3", libargs="-lgomp") 

A more optimized version is then

cooccur <- function(data, gene1, gene2) {
    data <- (data == 2)[rownames(data) %in% c(gene1, gene2), , drop=FALSE]
    n2 <- ncol(data)^2
    i1 <- match(gene1, rownames(data))
    i2 <- match(gene2, rownames(data))
    x <- rowSums(data)
    x_12 <- x[i1] * x[i2] / n2
    x12 <- xprod(data, i1, i2) / n2
    data.frame(gene1=gene1, gene2=gene2,
               co_occur=x12, mis=x12 * log2(x12 / x_12))
}

handling for me 1,000,000 gene pairs in about 2s. This still scales linearly with number of gene pairs; the openMP parallel evaluation isn't supported under the clang compiler, and this seems like one of those relatively rare situations where my code, on my processor, benefited substantially from re-arrangement to localize data access.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • Thanks this surely helps, it definitely makes the process faster, overall the whole thing still takes a long time but since this is a one time analysis I would not mind much :-) – Ashwin Mar 04 '14 at 11:06
1

Here is what I see when I Rprof your code. Half the time is in the "%in%" function: This shows the hierarchy of the function calls. It ran for 23 seconds and all that time was within "FUN" which is within the "apply" call

  0  23.2 root
  1.   23.2 "source"
  2. .   23.2 "withVisible"
  3. . .   23.2 "eval"
  4. . . .   23.2 "eval"
  5. . . . .   23.2 "system.time"
  6. . . . . .   23.2 "t"
  7. . . . . . .   23.2 "apply"
  8. . . . . . . .   23.1 "FUN"
  9. . . . . . . . .   11.7 "%in%"   ##half the time is here
 10. . . . . . . . . .   10.9 "match"
 11. . . . . . . . . . .    0.0 "rownames"
  9. . . . . . . . .    0.5 "colnames"
 10. . . . . . . . . .    0.4 "is.data.frame"
 11. . . . . . . . . . .    0.3 "which"
 12. . . . . . . . . . . .    0.2 "=="
 10. . . . . . . . . .    0.0 "NCOL"
  9. . . . . . . . .    0.3 "intersect"
 10. . . . . . . . . .    0.3 "unique"
 11. . . . . . . . . . .    0.0 "unique.default"
 11. . . . . . . . . . .    0.0 "match"
 10. . . . . . . . . .    0.0 "as.vector"
  9. . . . . . . . .    0.3 "sort"
 10. . . . . . . . . .    0.1 "sort.default"
 11. . . . . . . . . . .    0.1 "sort.int"
 12. . . . . . . . . . . .    0.0 "any"
 12. . . . . . . . . . . .    0.0 "is.na"
  9. . . . . . . . .    0.1 "c"
  6. . . . . .    0.0 "gc"
  • I suspected either the %in% or == to be the culprit, Rprof surely helps !! I think its worth keeping this in mind while using %in% in large row sized matrices or df. – Ashwin Mar 04 '14 at 11:09