Of note: neither of your included approaches work for me, and I cannot manipulate the reshape
approach to get it to run.
As @David mentioned in a comment, you are already using merge.data.table
in your base
approach as merge
is a generic that "hands-off" to more specific methods (in this case, for a data.table
).
This answer has a version for doing a multiple merge using dplyr
's left_join
, which can be modified here:
Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2), list.of.data.frames)
We can explicitly test the various approaches using the microbenchmark
package. I am adding a version where I tell left_join
which column to join by instead of having it figure it out (though that would not work if each join needed to use a different set of columns to match). I am also including @Axeman's suggestion to use reduce
from purrr
instead of Reduce
.
microbenchmark(
base = Reduce(function(...) merge(..., all=T, by = "P1"), list.of.data.frames)
, dplyr = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2), list.of.data.frames)
, dplyrSet = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2, by = "P1"), list.of.data.frames)
, dplyrPurrr = reduce(list.of.data.frames, full_join, by = "P1")
)
gives:
Unit: microseconds
expr min lq mean median uq max neval cld
base 2911.495 3025.2325 3227.3762 3077.8530 3211.995 5513.166 100 c
dplyr 946.367 1022.0960 1087.8771 1066.3615 1131.675 1429.581 100 b
dplyrSet 443.828 485.3235 543.7130 511.1545 553.040 1918.009 100 a
dplyrPurrr 465.329 494.6615 548.7349 515.6695 551.943 1804.394 100 a
So, left_join
is roughly 3 times faster than merge
, and setting the variable to join on further cuts about half of the time. reduce
does not cut the time, though it does make for much cleaner code.
We can (and should, as @Frank points out), confirm that the returned values are the same. There is some debate about what "same" may mean for the purposes of this type of result, so I am using compare
from the compare
package to check for differences (each of the full_join
approaches is completely identical, so I am only showing the interesting one):
compare(
Reduce(function(...) merge(..., all=T, by = "P1"), list.of.data.frames)
, reduce(list.of.data.frames, full_join, by = "P1")
, allowAll = TRUE
)
returns:
TRUE
sorted
renamed rows
dropped row names
dropped attributes
So, the values are the same, but they are in a different order (needed to be sorted), have different row names (needed to be renamed/dropped), and have different attributes (needed to be dropped). If any of those are concerning for the use case, then a user would need to determine which approach gave the sort/rownames/attributes that they wanted.
As @DavidArenburg points out, different sizes may lead to different results. So, here is some code that checks these for different sizes.
medianTimes_dataTable <- lapply(10^(1:5), function(n){
list_of_longer_ones = list( data.table("P1" = c(1:n), "P2" = rnorm(n), "D1" = rnorm(n)),
data.table("P1" = sample(1:n), "P3" = rnorm(n), "D3" =rnorm(n)),
data.table("P1" = sample(1:n), "P4" = rnorm(n))
)
microbenchmark(
base = Reduce(function(...) merge(..., all=T, by = "P1"), list_of_longer_ones)
, dplyr = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2), list_of_longer_ones)
, dplyrSet = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2, by = "P1"), list_of_longer_ones)
, dplyrPurrr = reduce(list_of_longer_ones, full_join, by = "P1")
) %>%
group_by(expr) %>%
summarise(median = median(time)) %>%
mutate(nRows = n)
}) %>%
bind_rows
medianTimes_dataTable %>%
mutate_at(c("median", "nRows"), format, big.mark = ",", scientific = FALSE) %>%
spread(nRows, median)
gives
expr ` 10` ` 100` ` 1,000` ` 10,000` `100,000`
* <fctr> <chr> <chr> <chr> <chr> <chr>
1 base 2,032,614.5 2,059,519.0 2,716,534.0 4,475,653.5 29,655,330.0
2 dplyr 1,147,676.5 1,205,818.0 2,369,464.5 11,170,513.5 154,767,265.5
3 dplyrSet 537,434.0 613,785.5 1,602,681.0 10,215,099.5 145,574,663.0
4 dplyrPurrr 540,455.5 626,076.5 1,549,114.0 10,040,808.5 145,086,376.0
So, the dplyr
advantage slips away somewhere between 1,000 and 10,000.
@David also asked about the impact of data.table
vs data.frame
, so I ran the same code on data.frames
medianTimes_dataFrame <- lapply(10^(1:5), function(n){
list_of_longer_ones = list( data.frame("P1" = c(1:n), "P2" = rnorm(n), "D1" = rnorm(n)),
data.frame("P1" = sample(1:n), "P3" = rnorm(n), "D3" =rnorm(n)),
data.frame("P1" = sample(1:n), "P4" = rnorm(n))
)
microbenchmark(
base = Reduce(function(...) merge(..., all=T, by = "P1"), list_of_longer_ones)
, dplyr = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2), list_of_longer_ones)
, dplyrSet = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2, by = "P1"), list_of_longer_ones)
, dplyrPurrr = reduce(list_of_longer_ones, full_join, by = "P1")
) %>%
group_by(expr) %>%
summarise(median = median(time)) %>%
mutate(nRows = n)
}) %>%
bind_rows
medianTimes_dataFrame %>%
mutate_at(c("median", "nRows"), format, big.mark = ",", scientific = FALSE) %>%
spread(nRows, median)
gives
expr ` 10` ` 100` ` 1,000` ` 10,000` `100,000`
* <fctr> <chr> <chr> <chr> <chr> <chr>
1 base 806,009.5 973,636.0 2,046,009.5 19,088,482.5 519,159,607.0
2 dplyr 1,092,747.0 1,242,550.5 2,010,648.5 10,618,735.5 156,958,793.0
3 dplyrSet 526,030.0 616,996.0 1,343,766.5 9,767,689.5 147,919,013.5
4 dplyrPurrr 541,182.0 624,208.0 1,351,910.0 9,711,435.0 146,379,176.5
Here, full_join
continues to beat merge
-- which suggests that the merge.data.table
is better than the merge.data.frame
method (and by a lot).