1

data.table is very useful but I could not find an elegant way to solve the following problem. There are some closer answers out there, but none solved my problem. Lets say the below is the data.table object and I want to filter duplicate rows based on the gene pairs (Gene1 and Gene2) but in both ways.

     Gene1    Gene2     Ens.ID.1              Ens.ID.2             CORR
1:   FOXA1    MYC       ENSG000000129.13.     ENSG000000129.11     0.9953311
2:   EGFR     CD4       ENSG000000129         ENSG000000129.12     0.9947215
3:   CD4      EGFR      ENSG000000129.12      ENSG000000129.11     0.9940735
4:   EGFR     CD4       ENSG000000129         ENSG000000129.12     0.9947215 

If there are such duplicates with respect to Gene1 and Gene2, then I want to get this:

     Gene1    Gene2     Ens.ID.1              Ens.ID.2             CORR
1:   FOXA1    MYC       ENSG000000129.13.     ENSG000000129.11     0.9953311
2:   EGFR     CD4       ENSG000000129         ENSG000000129.12     0.9947215

It is very slow with standard coding over millions of rows. Is there an elegant and fast way of doing this in data.table?

thelatemail
  • 91,185
  • 12
  • 128
  • 188
entropy
  • 191
  • 11
  • 3
    Have you seen https://stackoverflow.com/a/25151395/496803 ? It seems to be exactly what you are trying to do. – thelatemail Oct 26 '17 at 00:00
  • Just a suggestion, as I don't know how long it is taking you, but maybe you can consider to create a new column, with concatenating GEN1 and GEN2 and then remove the duplicates? – RomRom Oct 26 '17 at 00:25

1 Answers1

6

The linked answer ( https://stackoverflow.com/a/25151395/496803) is nearly a duplicate, and so is https://stackoverflow.com/a/25298863/496803 , but here goes again, with a slight twist:

dt[!duplicated(data.table(pmin(Gene1,Gene2),pmax(Gene1,Gene2)))]

#   Gene1 Gene2          Ens.ID.1         Ens.ID.2      CORR
#1: FOXA1   MYC ENSG000000129.13. ENSG000000129.11 0.9953311
#2:  EGFR   CD4     ENSG000000129 ENSG000000129.12 0.9947215

If you have >2 or many keys to dedup by, you are probably best off converting to a long file, sorting, back to a wide file and then de-duplicating. Like so:

dupvars <- c("Gene1","Gene2")
sel <- !duplicated(
  dcast(
      melt(dt[, c(.SD,id=.(.I)), .SDcols=dupvars], id.vars="id")[
          order(id,value), grp := seq_len(.N), by=id],
      id ~ grp
  )[,-1])
dt[sel,]
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • 1
    In case someone will use this answer, just be careful and consider also the following correction. Although, it seemed to be working fine; I came across with incorrect results in another analysis, though there was no error. Somehow, the types of Gene1 and Gene2 appeared to be integer and that was causing the incorrect unification. I solved it by forcing them to be in character format as dt$Gene1 <- as.character(dt$Gene1) and same for the Gene2 column, which solved the problem. – entropy Mar 15 '18 at 18:23
  • @entropy thanks for your comment, it was a life-saver. I had the same question as you (though not about genomics), and the accepted answer would yield very strange results if my `IDs` were numeric. I had to convert them to characters for the answer to work properly. Thanks a lot again. Perhaps @thelatemail could add a note in the answer body about this issue? – thiagoveloso Mar 13 '20 at 01:13
  • 1
    @thiagoveloso - that would be an issue with floating point errors I presume, i.e.: `1` and *very close to* `1` might cause weird issues with deduplication. `duplicated(c(1, 1.000000000000001))` for example. I don't really understand why it would fail with integers however as `duplicated(c(1L, 1L))` should always work. I will think about what I can add to the answer to clarify this issue. – thelatemail Mar 13 '20 at 03:48