5

I have a data.table with two columns of genes and each row treated as a pair. Some gene pairs are duplicated with the order reversed. I am looking for a faster method, preferably without using a loop like the one I've provided, to keep unique pairs in my table.

library(data.table)
genes <- data.table(geneA = LETTERS[1:10], geneB = c("C", "G", "B", "E", "D", "I", "H", "J", "F", "A"))

revG <- genes[,.(geneA = geneB, geneB = geneA)]
d <- fintersect(genes, revG)

for (x in 1:nrow(d)) {
  entry <- d[,c(geneA[x], geneB[x])]; revEntry <- rev(entry)
  dupEntry <- d[geneA %chin% revEntry[1] & geneB %chin% revEntry[2]]
  if (nrow(dupEntry) > 0) {
    d <- d[!(geneA %chin% dupEntry[,geneA] & geneB %chin% dupEntry[,geneB])]
  }
}

The table object d contains the duplicated, reversed pairs. After the loop, one copy of each is remaining. I used the original genes table and took a subset, excluding the copies in d and storing the index. I have a list whose names are the same as the first column in genes. The index is used to filter the list based on the duplicate pairs that were removed with the loop.

idx <- genes[!(geneA %chin% d[,geneA] & geneB %chin% d[,geneB]), which = TRUE]

geneList <- vector("list", length = nrow(genes)); names(geneList) <- genes[,geneA]
geneList <- geneList[idx]

The above method isn't necessarily too slow, but I plan on using ~12K genes so the speed might be noticeable then. I found a question with the same problem posted but without using data.table. It uses an apply function to get the job done but that might also be slow when dealing with larger numbers.

TylerH
  • 20,799
  • 66
  • 75
  • 101
abbas786
  • 401
  • 3
  • 11
  • 1
    You can try `unique(d[,list(geneA=do.call(pmin,d),geneB=do.call(pmax,d))])`, but it works only if you have two columns (it should be ok for you). – nicola Mar 15 '17 at 06:19
  • 1
    You can apply the method of @nicola also directly to `genes`. It will give the same result and you don't need to create `revG` and `d`. – Jaap Mar 15 '17 at 06:56

1 Answers1

1

I believe, what you are asking is similar to, given a list of permutations by 2, how can I get the combinations. This can be an option, using igraph.

library(data.table)
library(igraph)
genes <- data.table(geneA = LETTERS[1:10], geneB = c("C", "G", "B", "E", "D", "I", "H", "J", "F", "A"))
g <-graph_from_data_frame(genes, directed = F)
g <- simplify(g, remove.multiple = T, remove.loops = T)
get.data.frame(g)
  from to
1    A  C
2    A  J
3    B  C
4    B  G
5    D  E
6    F  I
7    G  H
8    H  J

#benchmark
set.seed(1283782)
fn1<-function(genes){
  g <-graph_from_data_frame(genes, directed = F)
  g <- simplify(g, remove.multiple = T, remove.loops = T)
  get.data.frame(g)}
genes <- data.table(geneA = sample(LETTERS, 20000, T), geneB = sample(LETTERS, 20000, T))
microbenchmark(fn1(genes), times = 1)
       expr      min       lq     mean   median       uq      max neval
 fn1(genes) 8.605717 8.605717 8.605717 8.605717 8.605717 8.605717     1
Mario GS
  • 859
  • 8
  • 22