I have performance problem to remove duplicates in form A-B and B-A from data.table (118 milion rows) which is the result of covariance calculation.
The result of covariance calculation is a data.table of 118.000.000 rows (2.8 GB) in form: . There are cases with rows A-B and B-A which are duplicates as val in those cases is the same. I would like to remove those duplicates. I found some solutions (see code below) but both execution time and ram needed are to high and the execution runs into error.
Example:
library(data.table)
# create test dataset
size <- 118000000
key1 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) )
key2 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) )
val <- runif(size, 0.0, 5.0)
# create data table
dt <- data.table(key1, key2, val, stringsAsFactors=FALSE)
# order data row wise in order to identify duplicates
#system.time(dt_sorted <- t(apply(dt,1,sort)))
# remove duplicates
#system.time(dt_no_duplicates <- dt_sorted [!duplicated(dt_sorted ),])
Solution with ordering data row wise and eliminating duplicates works with smaler data sets (see Select equivalent rows [A-B & B-A]) but how can I handle such a big or even bigger data sets?