1

I have two data frames that I am trying to merge.

df1 has dimensions 20015 rows and 7 variables. df2 has dimensions 8534664 rows and 29 variables.

When I do full_join(df1, df2, by = "KEY") I get the Error: cannot allocate vector of size 891.2 Mb so I set memory.limit(1000000) and I still get the same error. I run the full_join() whilst looking at my CPU usage graph in the windows task manager and it increases exponentially. I have also used gc() through out my code.

My question is, is there a function out there which can join the first 1,000,000 rows. Take a break, then join the next 1,000,000 rows etc. until all rows have been joined.

Is there a function to run the full_join() in batches?

user113156
  • 6,761
  • 5
  • 35
  • 81
  • 2
    try joining using data.table, which is very neat on memory usage... see here how to perform: https://stackoverflow.com/questions/15170741/how-does-one-do-a-full-join-using-data-table#15170956 – Wimpel Sep 22 '18 at 20:36

1 Answers1

0

This is just to report the time it takes running with full_join and merge from data.table in a 64 bit Windows system(Intel ~3.5 Ghz, RAM 120GB). Hope it will help at least as a reference for your case.

library(data.table)
df1 <- data.table(KEY=sample(1:800,20015,replace = TRUE), 
                  matrix(rnorm(20015*7),20015,7))#1.1MB
df2 <- data.table(KEY=sample(1:800,8534664,replace = TRUE), 
                  matrix(rnorm(8534664*29),8534664,29))#1.9GB
library(dplyr)
tick <- Sys.time()
df_join <- full_join(df1, df2, by = "KEY") #~58.1 GB in memory
tock <- Sys.time()- tick #~1.85min
#With data.table merge.
tick <- Sys.time()
df_join<- merge(df1, df2, by = "KEY", allow.cartesian = TRUE)#~58.1 GB in memory
tock <- Sys.time() - tick #~5.75 mins
Chriss Paul
  • 1,101
  • 6
  • 19