Efficiently joining more than 2 data.tables

Question

I was wondering if there is a memory efficient way to join n data.tables (or data frames). For example, if I have the following 4 data.tables:

df1 = data.table(group = c(1L,2L,3L),value = rnorm(3),key = "group")
df2 = data.table(group = c(2L,1L,3L),value2 = rnorm(3),key = "group")
df3 = data.table(group = c(3L,2L,1L),value3 = rnorm(3),key = "group")
df4 = data.table(group = c(1L,3L,2L),value4 = rnorm(3),key = "group")

I could merge them like so:

merge(df1,merge(df2,merge(df3,df4)))

but that does not seem like an optimal solution. I might potentially have many data.tables that need to be merged. Is there a way to generalize the above without copying each successive merge to memory? Is there an already accepted way outside of data.table to do this?

One solution is to put all of the data frames in a list, then use `Reduce()` to merge them. See http://stackoverflow.com/questions/8091303/simultaneously-merge-multiple-data-frames-in-a-list — ulfelder, Jun 22 '16 at 16:45
`Reduce(function(x, y) x[y], list(df1, df2, df3, df4))` seems to be a little faster on the data you provided, but you would have to test this out on larger inputs to see how it scales. — nrussell, Jun 22 '16 at 16:49
@ulfelder That's just semantic sugar. The same merge operations will take place. — Matthew Lundberg, Jun 22 '16 at 16:49
It depends on your actual application. If your merging variables match up exactly in all tables, as here, then just copy the cols over — Frank, Jun 22 '16 at 16:53
@Frank, you're right, though in my actual application there's very large discrepancies. I probably should have demonstrated that in the example. — Boudewijn Aasman, Jun 22 '16 at 16:55
I guess there are many questions you'd have to consider. The re-arranging Matthew did isn't a real problem on its own -- just `setorder` before copying over. If you know that each `group` value appears in at most one row, then you can just make the final table (with one row per group) and then do a bunch of merge-assigns. If you know that each table just has a single extra column (patterned `value*` as here) then you can `rbindlist` them and then reshape... I guess there are other cases as well. — Frank, Jun 22 '16 at 16:59

Frank · Accepted Answer · 2016-06-22T17:32:50.357

Here are some other options you may have, depending on your data. Other options apart from the obvious path of doing a ton of merges, I mean: in a loop, with Reduce or with hadley's join_all/merge_all/wrap_em_all_up.

These are all methods that I have used and found to be faster in my own work, but I don't intend to attempt a general benchmarking case. First, some setup:

DFlist = list(df1,df2,df3,df4)
bycols = key(DFlist[[1]])

I'll assume the tables are all keyed by the bycols.

Stack. If the new cols from each table are somehow related to each other and appear in the same positions in every table, then consider just stacking the data:

DFlong = rbindlist(DFlist, use.names = FALSE, idcol = TRUE)

If for some reason you really want the data in wide format, you can dcast:

dcast(DFlong, 
  formula = sprintf("%s ~ .id", paste(bycols, collapse = "+")), 
  value.var = setdiff(names(DFlong), c(bycols, ".id"))
)

Data.table and R work best with long-format data, though.

Copy cols. If you know that the bycols take all the same values in all of the tables, then just copy over:

DF = DFlist[[1]][, bycols, with=FALSE]
for (k in seq_along(DFlist)){
  newcols = setdiff(names(DFlist[[k]]), bycols)
  DF[, (newcols) := DFlist[[k]][, newcols, with=FALSE]]
}

Merge assign. If some levels of bycols may be missing from certain tables, then make a master table with all combos and do a sequence of merge-assigns:

DF = unique(rbindlist(lapply(DFlist, `[`, j = bycols, with = FALSE)))
for (k in seq_along(DFlist)){
  newcols = setdiff(names(DFlist[[k]]), bycols)
  DF[DFlist[[k]], (newcols) := mget(newcols)]
}

score 0 · Answer 2 · edited May 23 '17 at 12:08

0

In dplyr:

As your trials all have the same names (and you have scrubbed out the NA's) you can just bind on the rows and summarise.

library(dplyr)

DF <- bind_rows(df1,df2,df3,df4) %>%
    group_by(group) %>%
    summarise_each(funs(na.omit))

Otherwise there is the simple, local minima solution: though at least coding in this dialect saves shaving a few layers off your own onion.

DF <- 
    df1 %>% 
    full_join(df2) %>% 
    full_join(df3) %>% 
    full_join(df4)

As dplyr runs in C++ not S, it should be faster. I unfortunately am unable to speak to the efficiency of memory usage.

(for similar situations see: R: Updating a data frame with another data frame's dplyr sol'n)

edited May 23 '17 at 12:08

Community

1
1

answered Jun 22 '16 at 17:50

leerssej

14,260
6
48
57

1

If some cols have legit NAs, the first approach might get weird. – Frank Jun 22 '16 at 18:02
I am testing how it responds to NA's and (un)fortunately am not finding any problems. `df3 = data.table(group = c(3L,2L),value3 = rnorm(2),key = "group")` – leerssej Jun 22 '16 at 18:40
By NAs, I mean NAs :) Try `df3 = data.table(group = c(NA,2L,1L),value3 = rnorm(3),key = "group")` or with `df3 = data.table(group = c(3L,2L,1L),value3 = rnorm(3),key = "group")[2, value3 := NA]`. Both of these lead to an error with your first approach on my computer. – Frank Jun 22 '16 at 18:43
Hmm. True: `Error: expecting a single value` At least it doesn't return incorrect values. Will add caveat above. – leerssej Jun 22 '16 at 18:55

Efficiently joining more than 2 data.tables

2 Answers2