Join of multiple dataframes by multiple variables with data.table

Question

So I have seen some answers regarding joining with different packages. I need to join several dataframes which is at the moment really expensive operation for my computer to handle with the base "merge" algorithm.

My data:

list.of.data.frames = list( data.table("P1" = c(1:3,1:3), "P2" = c(rep(2.5,3),rep(1.5,3)), "D1" = c(3.5,4.5,5.5,2.5,3.5,4.5)),
                        data.table("P1" = c(1:3,1:3), "P3" = c(rep(2,3),rep(3,3)), "D3" =c(3:5,4:6)),
                        data.table("P1" = c(2:4), "P4" = c(2:4))
                        )

I have tried this two codes:
Using reshape

library(reshape)
merge_recurse(list.of.data.frames)

Using base R

Reduce(function(...) merge(..., all=T), list.of.data.frames)

Output:

    P1  P2  D1 P3 D3 P4
 1:  1 2.5 3.5  2  3 NA
 2:  1 2.5 3.5  3  4 NA
 3:  2 2.5 4.5  2  4  2
 4:  2 2.5 4.5  3  5  2
 5:  3 2.5 5.5  2  5  3
 6:  3 2.5 5.5  3  6  3
 7:  1 1.5 2.5  2  3 NA
 8:  1 1.5 2.5  3  4 NA
 9:  2 1.5 3.5  2  4  2
 10: 2 1.5 3.5  3  5  2
 11: 3 1.5 4.5  2  5  3
 12: 3 1.5 4.5  3  6  3
 13: 4  NA  NA NA NA  4

I am trying to get the fastest method in order to execute this, which can be easily extended to work on a list. For data.table I got stuck with the keys, because each data.frame (or data table) can have a different columns, and some of them may intersect or may not with the other tables..

Also, I saw the data.table::merge.data.table() function, but I don`t know if this is from an older version, because I can not find it in my console.

Any idea how to proceed?

Thank you in advance

What you did there was a `data.table` merge rather a base one. `merge` is generic. — David Arenburg, Mar 06 '17 at 12:38
See `methods(merge)` as well as `data.table:::merge.data.table`. — Axeman, Mar 06 '17 at 13:05
Fyi, you can make it infinitely faster by using long-form data: `L[[3]][, "Dn" := .(NA)]; rbindlist(L, use.names=FALSE, idcol = TRUE)` where `L` is your list. That is, if speed is really important, you should reconsider managing your data in this (rather messy) way. — Frank, Mar 06 '17 at 15:22

score 4 · Accepted Answer · edited May 23 '17 at 11:53

Of note: neither of your included approaches work for me, and I cannot manipulate the reshape approach to get it to run.

As @David mentioned in a comment, you are already using merge.data.table in your base approach as merge is a generic that "hands-off" to more specific methods (in this case, for a data.table).

This answer has a version for doing a multiple merge using dplyr's left_join, which can be modified here:

Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2), list.of.data.frames)

We can explicitly test the various approaches using the microbenchmark package. I am adding a version where I tell left_join which column to join by instead of having it figure it out (though that would not work if each join needed to use a different set of columns to match). I am also including @Axeman's suggestion to use reduce from purrr instead of Reduce.

microbenchmark(
  base = Reduce(function(...) merge(..., all=T, by = "P1"), list.of.data.frames)
  , dplyr = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2), list.of.data.frames)
  , dplyrSet = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2, by = "P1"), list.of.data.frames)
  , dplyrPurrr = reduce(list.of.data.frames, full_join, by = "P1")
)

gives:

Unit: microseconds
       expr      min        lq      mean    median       uq      max neval cld
       base 2911.495 3025.2325 3227.3762 3077.8530 3211.995 5513.166   100   c
      dplyr  946.367 1022.0960 1087.8771 1066.3615 1131.675 1429.581   100  b 
   dplyrSet  443.828  485.3235  543.7130  511.1545  553.040 1918.009   100 a  
 dplyrPurrr  465.329  494.6615  548.7349  515.6695  551.943 1804.394   100 a

So, left_join is roughly 3 times faster than merge, and setting the variable to join on further cuts about half of the time. reduce does not cut the time, though it does make for much cleaner code.

We can (and should, as @Frank points out), confirm that the returned values are the same. There is some debate about what "same" may mean for the purposes of this type of result, so I am using compare from the compare package to check for differences (each of the full_join approaches is completely identical, so I am only showing the interesting one):

compare(
  Reduce(function(...) merge(..., all=T, by = "P1"), list.of.data.frames)
  , reduce(list.of.data.frames, full_join, by = "P1")
  , allowAll = TRUE
  )

returns:

TRUE
  sorted
  renamed rows
  dropped row names
  dropped attributes

So, the values are the same, but they are in a different order (needed to be sorted), have different row names (needed to be renamed/dropped), and have different attributes (needed to be dropped). If any of those are concerning for the use case, then a user would need to determine which approach gave the sort/rownames/attributes that they wanted.

As @DavidArenburg points out, different sizes may lead to different results. So, here is some code that checks these for different sizes.

medianTimes_dataTable <- lapply(10^(1:5), function(n){
  list_of_longer_ones = list( data.table("P1" = c(1:n), "P2" = rnorm(n), "D1" = rnorm(n)),
                              data.table("P1" = sample(1:n), "P3" = rnorm(n), "D3" =rnorm(n)),
                              data.table("P1" = sample(1:n), "P4" = rnorm(n))
  )


  microbenchmark(
    base = Reduce(function(...) merge(..., all=T, by = "P1"), list_of_longer_ones)
    , dplyr = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2), list_of_longer_ones)
    , dplyrSet = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2, by = "P1"), list_of_longer_ones)
    , dplyrPurrr = reduce(list_of_longer_ones, full_join, by = "P1")
  ) %>%
    group_by(expr) %>%
    summarise(median = median(time)) %>%
    mutate(nRows = n)
}) %>%
  bind_rows

medianTimes_dataTable %>%
  mutate_at(c("median", "nRows"), format, big.mark = ",", scientific = FALSE) %>%
  spread(nRows, median)

gives

        expr     `     10`     `    100`     `  1,000`     ` 10,000`     `100,000`
*     <fctr>         <chr>         <chr>         <chr>         <chr>         <chr>
1       base   2,032,614.5   2,059,519.0   2,716,534.0   4,475,653.5  29,655,330.0
2      dplyr   1,147,676.5   1,205,818.0   2,369,464.5  11,170,513.5 154,767,265.5
3   dplyrSet     537,434.0     613,785.5   1,602,681.0  10,215,099.5 145,574,663.0
4 dplyrPurrr     540,455.5     626,076.5   1,549,114.0  10,040,808.5 145,086,376.0

So, the dplyr advantage slips away somewhere between 1,000 and 10,000.

@David also asked about the impact of data.table vs data.frame, so I ran the same code on data.frames

medianTimes_dataFrame <- lapply(10^(1:5), function(n){
  list_of_longer_ones = list( data.frame("P1" = c(1:n), "P2" = rnorm(n), "D1" = rnorm(n)),
                              data.frame("P1" = sample(1:n), "P3" = rnorm(n), "D3" =rnorm(n)),
                              data.frame("P1" = sample(1:n), "P4" = rnorm(n))
  )


  microbenchmark(
    base = Reduce(function(...) merge(..., all=T, by = "P1"), list_of_longer_ones)
    , dplyr = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2), list_of_longer_ones)
    , dplyrSet = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2, by = "P1"), list_of_longer_ones)
    , dplyrPurrr = reduce(list_of_longer_ones, full_join, by = "P1")
  ) %>%
    group_by(expr) %>%
    summarise(median = median(time)) %>%
    mutate(nRows = n)
}) %>%
  bind_rows

medianTimes_dataFrame %>%
  mutate_at(c("median", "nRows"), format, big.mark = ",", scientific = FALSE) %>%
  spread(nRows, median)

gives

        expr     `     10`     `    100`     `  1,000`     ` 10,000`     `100,000`
*     <fctr>         <chr>         <chr>         <chr>         <chr>         <chr>
1       base     806,009.5     973,636.0   2,046,009.5  19,088,482.5 519,159,607.0
2      dplyr   1,092,747.0   1,242,550.5   2,010,648.5  10,618,735.5 156,958,793.0
3   dplyrSet     526,030.0     616,996.0   1,343,766.5   9,767,689.5 147,919,013.5
4 dplyrPurrr     541,182.0     624,208.0   1,351,910.0   9,711,435.0 146,379,176.5

Here, full_join continues to beat merge -- which suggests that the merge.data.table is better than the merge.data.frame method (and by a lot).

left join is not the same as `all = TRUE` in `merge`. Also, you called your first version "base", are you joining on data.frames or data.tables there? — David Arenburg, Mar 06 '17 at 12:51
Thank you @DavidArenburg -- this is what I get for answering anything before I finish my coffee. I switched it to `full_join` instead, which should do the trick. Nice catch. — Mark Peterson, Mar 06 '17 at 12:54
Going full `hadleyverse`: `reduce(list.of.data.frames, full_join, by = "P1")`. (Just syntax differences.) — Axeman, Mar 06 '17 at 13:02
Thanks for reminding me of `reduce` @Axeman. Added to the microbenchmark. — Mark Peterson, Mar 06 '17 at 13:09
I still don't understand though, are you testing this on data.frames or data.tables? Also, perhaps it is better to test on a bigger list. — David Arenburg, Mar 06 '17 at 13:12
@DavidArenburg -- I was using this on the data the OP provided. You are right though that both scale and data type matter. I updated to address both of those now. — Mark Peterson, Mar 06 '17 at 13:48
A standard benchmarking step is to ensure that each thing being tested produces the same result, but that seems to be missing here. Using your names for it, `all.equal(setkey(base,NULL), data.table(dplyr))` is not TRUE. — Frank, Mar 06 '17 at 13:49
Good point @Frank -- this is one of my first attempts to answer an explicitly benchmarking question. Looking at the results, the difference appears to be the sort order (merge appears to put all of the 1's together, `full_join` does not). The returned data, however, appear to be identical otherwise. I've added results from a comparison that makes that testing explicit. — Mark Peterson, Mar 06 '17 at 14:09
This was an outstanding response. I appreciate it very much @Mark Peterson. It cleared my doubts. Thank you for making the approach for different sizes of data. — Saul Garcia, Mar 07 '17 at 08:41

Join of multiple dataframes by multiple variables with data.table

1 Answers1