I have a named list that represents a collection of biological pathways, where names are pathway names, and vectors in the list are the proteins that belong to that pathway. A small example is:
ann <- structure(list(`GO:0000010` = c("Q33DR2", "Q9CZQ1", "D6RHT8",
"F6ZCX7", "B8JJX0", "Q33DR3", "F6T4Z4", "E0CYM9"), `GO:0000016` = c("Q5XLR9",
"Q3TZ78", "F8VPT3"), `GO:0000026` = c("Q8BTP0", "Q3TZM9", "A0A077K846",
"F6R220", "A0A077K9W9"), `GO:0000032` = c("Q924M7", "Q3V100",
"F6Q3K8", "Q921Z9"), `GO:0000033` = c("Q9DBE8", "F6RBY3", "Q8BMZ4",
"Q8K2A8", "F6XUH0", "D6RCW8", "Q6P8H8", "Q3URN2")), .Names = c("GO:0000010",
"GO:0000016", "GO:0000026", "GO:0000032", "GO:0000033"))
I am interested in pairs of pathways:
pairs <- t(combn(names(ann), 2))
For each pair of pathways, I want to get all possible combinations of proteins where protein #1 is in pathway #1, and protein #2 is in pathway #2. The desired output is a list of two-column matrices, where column #1 contains proteins in pathway #1 and column #2 contains proteins in pathway #2. So far, I have this:
protein_pairs <- purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid(ann[[.x]], ann[[.y]])))
However, because the total number of pairs I'm interested in is quite large (typically >1,000), mapping expand.grid
over all possible pairs takes a very long time - on the order of hours.
Is there a faster way to get all possible combinations of proteins in each pair of biological pathway from this list?