Threshold for using join over merge

Question

My understanding regarding the difference between the merge() function (in base R) and the join() functions of plyr and dplyr are that join() is faster and more efficient when working with "large" data sets.

Is there some way to determine a threshold to regarding when to use join() over merge(), without using a heuristic approach?

if speed is the main concern you'll probably be better off working either with `data.table` or data bases. — moodymudskipper, Mar 15 '19 at 23:32
You may want to directly compare execution speed, see for instance this thread: https://stackoverflow.com/questions/6262203/measuring-function-execution-time-in-r — user2363777, Mar 16 '19 at 00:02

score 3 · Accepted Answer · answered Mar 16 '19 at 01:32

I am sure you will be hard pressed to find a "hard and fast" rule around when to switch from one function to another. As others have mentioned, there are a set of tools in R to help you measure performance. object.size and system.time are two such function that look at memory usage and performance time, respectively. One general approach is to measure the two directly over an arbitrarily expanding data set. Below is one attempt at this. We will create a data frame with an 'id' column and a random set of numeric values, allowing the data frame to grow and measuring how it changes. I'll use inner_join here as you mentioned dplyr. We will measure time as "elapsed" time.

library(tidyverse)

setseed(424)

#number of rows in a cycle
growth <- c(100,1000,10000,100000,1000000,5000000)

#empty lists
n <- 1
l1 <- c()
l2 <- c()

#test for inner join in dplyr
for(i in growth){
  x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
  y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))

  test <- inner_join(x,y, by = c('id' = 'id'))
  l1[[n]] <- object.size(test)

  print(system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3])
  l2[[n]] <- system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3]

  n <- n+1

}

#empty lists
n <- 1
l3 <- c()
l4 <- c()

#test for merge
for(i in growth){
  x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
  y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))

  test <- merge(x,y, by = c('id'))
  l3[[n]] <- object.size(test)
  # print(object.size(test))

  print(system.time(test <- merge(x,y, by = c('id')))[3])
  l4[[n]] <- system.time(test <- merge(x,y, by = c('id')))[3]

  n <- n+1

   }

#ploting output (some coercing may happen, so be it)
plot <- bind_rows(data.frame("size_bytes" = l3, "time_sec" = l4, "id" = "merge"), 
                  data.frame("size_bytes" = l1, "time_sec" = l2, "id" = "inner_join"))

plot$size_MB <- plot$size_bytes/1000000

ggplot(plot, aes(x = size_MB, y =time_sec, color = id)) + geom_line()

merge seems to perform worse out the gate, but really kicks off around ~20MB. Is this the final word on the matter? No. But such testing can give you a idea of how to choose a function.

Threshold for using join over merge

1 Answers1