1

I have performance problem to remove duplicates in form A-B and B-A from data.table (118 milion rows) which is the result of covariance calculation.

The result of covariance calculation is a data.table of 118.000.000 rows (2.8 GB) in form: . There are cases with rows A-B and B-A which are duplicates as val in those cases is the same. I would like to remove those duplicates. I found some solutions (see code below) but both execution time and ram needed are to high and the execution runs into error.

Example:

library(data.table)

# create test dataset
size <- 118000000
key1 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) )
key2 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) )
val <- runif(size, 0.0, 5.0)

# create data table
dt <- data.table(key1, key2, val, stringsAsFactors=FALSE)

# order data row wise in order to identify duplicates
#system.time(dt_sorted <- t(apply(dt,1,sort)))

# remove duplicates
#system.time(dt_no_duplicates <- dt_sorted [!duplicated(dt_sorted ),])

Solution with ordering data row wise and eliminating duplicates works with smaler data sets (see Select equivalent rows [A-B & B-A]) but how can I handle such a big or even bigger data sets?

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
joe
  • 87
  • 8
  • 2
    You already found the duplicate of which there are many others as well (https://stackoverflow.com/questions/25297812/pair-wise-duplicate-removal-from-dataframe, https://stackoverflow.com/questions/29170099/remove-duplicate-column-pairs-sort-rows-based-on-2-columns, https://stackoverflow.com/questions/54447590/find-unique-pairs-of-words-ignoring-their-order-in-two-columns-in-r, etc). I don't think there's any reason that people are saving a special "fast" answer from those questions. I'm not sure what more you want. Did you try any of the methods that use `igraph`? – MrFlick Apr 16 '19 at 14:54
  • I'd be wary of any of the `igraph` solutions with this many rows. Or `apply`. Keep it in `data.table`. I don't see the answer I'd want in any of those suggestions, so I'll answer here. – Gregor Thomas Apr 16 '19 at 14:58
  • Found a `data.table`-specifc one, that I think makes it a suitable dupe. And eddi's answer there is better. – Gregor Thomas Apr 16 '19 at 15:04
  • I suspect the example data is not a good approximation to OP's real data, but anyway, a speedup over eddi's solution is possible 10-20x with it https://chat.stackoverflow.com/transcript/message/45952489#45952489 @Gregor Looks like OP should have over 10k distinct values for A & B rather than 26, in which case eddi's would probably be faster, though. – Frank Apr 16 '19 at 18:45
  • 1
    @Frank might as well add it as an answer at the dupe. – Gregor Thomas Apr 16 '19 at 18:50

0 Answers0