1

I have a named list that represents a collection of biological pathways, where names are pathway names, and vectors in the list are the proteins that belong to that pathway. A small example is:

ann <- structure(list(`GO:0000010` = c("Q33DR2", "Q9CZQ1", "D6RHT8", 
"F6ZCX7", "B8JJX0", "Q33DR3", "F6T4Z4", "E0CYM9"), `GO:0000016` = c("Q5XLR9", 
"Q3TZ78", "F8VPT3"), `GO:0000026` = c("Q8BTP0", "Q3TZM9", "A0A077K846", 
"F6R220", "A0A077K9W9"), `GO:0000032` = c("Q924M7", "Q3V100", 
"F6Q3K8", "Q921Z9"), `GO:0000033` = c("Q9DBE8", "F6RBY3", "Q8BMZ4", 
"Q8K2A8", "F6XUH0", "D6RCW8", "Q6P8H8", "Q3URN2")), .Names = c("GO:0000010", 
"GO:0000016", "GO:0000026", "GO:0000032", "GO:0000033"))

I am interested in pairs of pathways:

pairs <- t(combn(names(ann), 2))

For each pair of pathways, I want to get all possible combinations of proteins where protein #1 is in pathway #1, and protein #2 is in pathway #2. The desired output is a list of two-column matrices, where column #1 contains proteins in pathway #1 and column #2 contains proteins in pathway #2. So far, I have this:

protein_pairs <- purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid(ann[[.x]], ann[[.y]])))

However, because the total number of pairs I'm interested in is quite large (typically >1,000), mapping expand.grid over all possible pairs takes a very long time - on the order of hours.

Is there a faster way to get all possible combinations of proteins in each pair of biological pathway from this list?

2 Answers2

1

I reckon rep.int() works much faster as stated in this other question:

Try the following:

expand.grid.jc <- function(seq1,seq2) {
  cbind(Var1 = rep.int(seq1, length(seq2)), 
        Var2 = rep.int(seq2, rep.int(length(seq1),length(seq2))))
}
protein_pairs <- purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid.jc(ann[[.x]], ann[[.y]])))

Cheers!,

Carles
  • 2,731
  • 14
  • 25
1

If you are looking for speed, you can easily role out an Rcpp version:

// [[Rcpp::export]]
CharacterMatrix fast2Expand(CharacterVector x, CharacterVector y) {

    unsigned long int lenX = x.size(), lenY = y.size();
    CharacterMatrix result = no_init_matrix(lenX * lenY, 2);

    for (std::size_t i = 0, count = 0; i < lenY; ++i) {
        for (std::size_t j = 0; j < lenX; ++j, ++count){
            result(count, 0) = x[j];
            result(count, 1) = y[i];
        }
    }

    return result;
}

It is about 10x faster than the original and 20% than the rep.int version (for this example):

microbenchmark(OP = purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid(ann[[.x]], ann[[.y]]))),
               Rcpp = purrr::map2(pairs[, 1], pairs[, 2], ~ fast2Expand(ann[[.x]], ann[[.y]])),
               repInt = purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid.jc(ann[[.x]], ann[[.y]]))))
Unit: microseconds
  expr      min        lq      mean    median        uq      max neval
    OP 1104.700 1136.4370 1536.4048 1188.9990 1481.4940 6730.960   100
  Rcpp  105.505  126.9975  149.9009  138.1195  150.2015  663.146   100
repInt  133.044  151.0175  223.9815  165.5435  203.5335 1269.194   100

Here is a contrived example based off of the OP's example meant purely for comparing efficiency:

annBig <- lapply(1:5, function(x) rep(ann[[x]], 100))
names(annBig) <- names(ann)

microbenchmark(OP = purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid(annBig[[.x]], annBig[[.y]]))),
               Rcpp = purrr::map2(pairs[, 1], pairs[, 2], ~ fast2Expand(annBig[[.x]], annBig[[.y]])),
               repInt = purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid.jc(annBig[[.x]], annBig[[.y]]))), times = 20)
Unit: milliseconds
  expr       min        lq      mean    median       uq      max neval
    OP 522.56536 533.39393 562.60750 555.45345 588.4514 640.8584    20
  Rcpp  48.12683  56.17155  92.30095  92.23838 125.8065 142.2949    20
repInt  80.28625 107.32329 140.32793 152.13732 160.9656 193.1310    20
Joseph Wood
  • 7,077
  • 2
  • 30
  • 65