Merge with allow.cartasian=TRUE results in too many observations

Question

I want to merge 2 data frames (data1 and data2). Both initially contain around 35 million observations (around 2GB each).

I removed the duplicates from data2. I would need to keep the duplicates in data 1, as I wish to use them for further calculations per observation in data1.

I initially get the well documented error:

Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

As a solution (I looked at several topics, such here, here, and here), I included allow.cartesian=TRUE, but now I run into memory issues. Also, for a subset it works, but it gives me more observations than I wish (data1 now has 50 million observations, although is specify all.x=TRUE).

My code is:

#Remove duplicates before merge
data2 <- unique(data2)

#Merge
require(data.table)
data1 <- merge(data1, data2, by="ID", all.x=TRUE, allow.cartesian=TRUE)

Any advice on how to merge this, is very welcome.

Could you please specify the type of join you are trying to achieve with the merge? It's not likely that you actually want a cartesian merge. It's a cross-product. You're running out of memory because the result would be in the order of 35M^2 data points. See http://www.dofactory.com/Images/sql-joins.png — Mekki MacAulay, Mar 14 '16 at 13:52
I would need a left join. If I try without the allow.cartesian, I get an error message ("Check for duplicate...", as mentioned in the post above) — Res1234, Mar 14 '16 at 14:12
examples not reproducible, read how to ask questions on r tag — jangorecki, Mar 14 '16 at 17:33

score 1 · Answer 1 · answered Mar 14 '16 at 14:52

In order to do a left join, the merge statement needs to understand which column you are using as the "key" or "index" for the join. If you have duplicate column names that are used as the key/index, it doesn't know what to do and gives that error. Further, it needs to know what to do if columns are being joined that have the same name as existing columns.

The solution is to temporarily rename the key/index column in your left (data1) dataset As a general rule, having duplicate column names is "bad" in R because it will confuse a lot of functions. Many functions silently call make.unique() to de-duplicate column names to avoid confusion.

If you have duplicate ID columns in data1 change them with colnames(data1) <- make.unique(colnames(data1)), which will set them to ID.1, ID.2, etc. Then do your merge (make sure to specify by.x="ID.1", by.y="ID" because of the rename. By default, duplicate columns that are merged will be appended with .y although you can specify the suffix with the suffixes= option (See Merge helpfile for details)

Lastly, it's worth noting that the merge() function in the data.table package tends to be a lot faster than the base merge() function with similar syntax. Seepage 47 of the data.table manual.

Merge with allow.cartasian=TRUE results in too many observations

1 Answers1