data.table dplyr full_join equivalent modify in place

Question

I recently discovered data.table and I love its ability to modify values in place, because I am usually working with datasets sized around 10-15GB.

While I was able to do a left/right join with modification in place, I could not find a solution how to do it with a full(outer) join (i.e. take all values from table A and all values from table B and fill missing with NA, like dplyr::full_join)

I found a stackoverflow answer here that shows how to do this for a left/right join:

A <- data.table(a = 1:4, b = 12:15)
B <- data.table(c = 2:3, b = 13:14)

B[A, b.A:= i.b, on = c(c = "a")]

I know that it is possible to use merge(A, B, by.x = "a", by.y = "c", all = T) for a full_join, but then I need to assign the result to a variable again and this would require a lot of memory in my case. I also found another interesting approach here, but I would also need to create a new variable in this case.

Any help greatly appreciated!

Not really a technical explanation but I think of it as: whenever you use `:=` in a data.table to update by reference/in place, you either add, change, or remove one or more columns but you cannot modify the number of rows of the table. In a full join, you'd need to modify the number of rows and hence you cannot do it with `:=` as it's currently implemented. Hence, you have to assign the result to a new object to keep it. — talat, Nov 29 '17 at 14:19
Hmm, that makes sense. But that would make combining two datasets that have say all but one column in common extremely memory intensive if I understand R's copy-on-change style correctly. Any ideas on how to deal with this problem are more than welcome:) — Christoph, Nov 29 '17 at 14:23
Related to my comment https://stackoverflow.com/questions/10790204/ — talat, Nov 29 '17 at 14:34
Re "many columns in common" you might need to rethink your data structure. Hadley's tidy data paper is a good place to start on that https://www.jstatsoft.org/article/view/v059i10 (Hard to be more specific without a reproducible example.) — Frank, Nov 29 '17 at 16:20
Ah, sry ... I wrote exactly the opposite of what I wanted to say... I often have two datasets that have say 1-2 columns in common and I want to do joins by reference based on those columns. Sry for the confusion. — Christoph, Nov 29 '17 at 16:42
Iiuc this question boils down to "can I increase the number of rows in a `data.table` without copying the data to a new place", and the answer is (by construction) no. Each column is stored in contiguous chunks of memory, and there is no guarantee that there is available space after it. Please correct me if I didn't understand the question correctly. — eddi, Nov 29 '17 at 19:26
Yes, that is what I was referring to. Although as an intermediate step it would also be helpful to do a left/right join like I showed in the example above, but were all columns are added automatically without copying the data. (e.g. := for all columns in A, regardless of the total number). I found some workarounds using Set and loops, but I would love to do it in a syntax similar to above. Thx for your comments! — Christoph, Nov 30 '17 at 10:41

data.table dplyr full_join equivalent modify in place

0 Answers0