I want to merge 2 data frames (data1
and data2
). Both initially contain around 35 million observations (around 2GB each).
I removed the duplicates from data2
. I would need to keep the duplicates in data 1
, as I wish to use them for further calculations per observation in data1
.
I initially get the well documented error:
Check for duplicate key values in
i
, each of which join to the same group inx
over and over again. If that's ok, try includingj
and droppingby
(by-without-by) so thatj
runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun withallow.cartesian=TRUE
. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
As a solution (I looked at several topics, such here, here, and here), I included allow.cartesian=TRUE
, but now I run into memory issues. Also, for a subset it works, but it gives me more observations than I wish (data1
now has 50 million observations, although is specify all.x=TRUE
).
My code is:
#Remove duplicates before merge
data2 <- unique(data2)
#Merge
require(data.table)
data1 <- merge(data1, data2, by="ID", all.x=TRUE, allow.cartesian=TRUE)
Any advice on how to merge this, is very welcome.