Inner join on large dataset best practices

Question

I am trying to merge two large datasets (around 3.5m lines,each) using dplyr::inner_join. I am working on a powerful machine with 40+ cores. I am not sure I am taking advantage of the machine itself as I am not parallelizing the task anyhow. How should I tackle the problem, which is taking a lot to run?

Best

I was having the same issue [recently](https://stackoverflow.com/questions/57904212/r-fill-new-column-based-on-interval-from-another-dataset-lookup/) and you really have to use `data.table`. Theres some generally tips [here](https://community.rstudio.com/t/error-memory-exhausted-limit-reached/8928) and [here](https://community.rstudio.com/t/for-loop-or-if-someone-has-a-better-idea-on-large-dataset/10415/8) on merging large datasets — user63230, Oct 16 '19 at 12:12

Sinh Nguyen · Answer 1 · 2021-05-18T06:39:55.047

I don't think 3.5M inner join going to have performance issues unless your two final datasets will be 3.5M * 3.5M after the join due to duplication of key columns in your datasets (duplicated values of joined columns)

Normally in R, there is no functions that will utilized multiple cores. To do it you will have to divide data in batch that could be process separately then combined final results together and calculated further. Here are pseudo code using library dplyr & doParallel

library(dplyr)
library(doParallel)

# Parallel configuration #####
cpuCount <- 10
# Note that doParallel will replicated your environment to and process on multiple core
# so if your environment is 10GB memory & you use 10 core
# it would required 10GBx10=100GB RAM to process data parallel
registerDoParallel(cpuCount)

data_1 # 3.5M rows records with key column is id_1 & value column value_1
data_2 # 3.5M rows records with key columns are id_1 & id_2

# Goal is to calculate some stats/summary of value_1 for each combination of id_1 + id_2
id_1_unique <- unique(data_1$id_1)
batchStep <- 1000
batch_id_1 <- seq(1, length(id_1_unique )+batchStep , by=batchStep )

# Do the join for each batch id_1 & summary/calculation then return the final_data
# foreach will result a list, for this psuedo code it is a list of datasets
# which can be combined use bind_rows
summaryData <- bind_rows(foreach(index=1:(length(batch_id_1)-1)) %dopar% {
    batch_id_1_current <- id_1_unique[index:index+batchStep-1]
    batch_data_1 <- data_1 %>% filter(id_1 %in% batch_id_1_current)
    joined_data <- inner_join(batch_data_1, data_2, by="id_1")
    final_data <- joined_data %>%
        group_by(id_1, id_2) %>%
        #calculation code here
        summary(calculated_value_1=sum(value_1)) %>%
        ungroup()
    return(final_data)
})

hi @Sinh, how did you arrive at `batch_data_2`? Can you edit your code to show that? Thanks. — William, May 18 '21 at 05:22
Sorry my bad - there is no `batch_data_2` just `inner_join` with `data_2` which automatically limitted to only rows that matched `batch_data_1`. Depend on your memory constraint if the two dataset are too big to have both of them loaded together - you may want to separate them, saving on disk and process each batch file separately. — Sinh Nguyen, May 18 '21 at 06:42

GenesRus · Answer 2 · 2019-10-12T00:59:21.960

You should try the data.table package, which is much more efficient than dplyr for large datasets. I've copied over the inner join code from here.

library(data.table)
DT <- data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
X  <- data.table(x=c("c","b"), v=8:7, foo=c(4,2))
DT[X, on="x", nomatch=0] # inner join
                         # SELECT DT INNER JOIN X ON DT$x = X$x

Althought data.table doesn't use parallelization, it will be faster than inner_join and the best option to the best of my knowledge.

Inner join on large dataset best practices

2 Answers2