I have two extremely large datasets I am looking for a way to efficiently merging them (inner join is fine). I came up with a solution of creating named spit lists on the merging keys and merging them as such.
The solution is unfortunately very inefficient still.
Is there a way I can avoid using dplyr
at all I believe that is the root of the issue, as well as the slow lapply
. Is map a good solution?
Here you can find a reproducible example Thanks in advance!
library(tidyverse)
library(data.table)
named_group_split <- function(.tbl, ...) {
# get names
grouped <- group_by(.tbl, ...)
names <- rlang::eval_bare(rlang::expr(paste(!!!group_keys(grouped), sep = "-")))
# split data
grouped %>%
group_split(.keep = FALSE) %>%
rlang::set_names(names)
}
# FIRST SPLIT LIST
set.seed(1)
db_1 <- data.frame(id_1=sample(1:10, 10, replace=T),
id_2=sample(LETTERS, 10, replace=T),
value1=runif(10, 1.0, 10.0)) %>%
data.table() %>%
dplyr::mutate(id_1name=id_1,
id_2name=id_2) %>%
named_group_split(id_1name,id_2name)
# SECOND SPLIT LIST
set.seed(2)
db_2 <- data.frame(id_1=sample(1:10, 1000, replace=T),
id_2=sample(LETTERS, 1000, replace=T),
value2=runif(1000, 1.0, 10.0)) %>%
data.table() %>%
dplyr::mutate(id_1name=id_1,
id_2name=id_2) %>%
named_group_split(id_1name,id_2name)
keys <- intersect(unique(c(names(db_1))), unique(c(names(db_2))))
result <- setNames(lapply(keys, function(key){
result <- db_1[[key]] %>%
dplyr::left_join(db_2[[key]])
}),keys) %>%
do.call(plyr:::rbind.fill,.)
WHAT I MANAGED TO DO IS THE FOLLOWING:
library(purrr)
my_lists <- function(list1, list2) {
keys <- intersect(unique(c(names(list1))), unique(c(names(list2))))
result <- map2(list1[keys], list2[keys], c)
return(result)
}
result <- my_lists(db1, db2)
But from there I cannot recreate the db I need...