0

I am running a data pipeline in bash (Debian GNU/Linux 10 [buster]) that runs 19 R cleaning scripts on 19 separate raw data sets, then merges these into a large dataset of 89 variables and roughly 300 million observations. I use data.table in the last (merge) script when I can:

d_full <- d_merging[d_full, on = c("id", "date")]

but occasionally I need a full join and the base R solution is a bit more direct and allows me to avoid creating unique keys as suggested here

d_full <- merge(d_full, d_merging, by = c("id", "date"), all = TRUE)

which should call the data.table method, anyway. I run garbage collection gc() after each merge.

The code runs without issue except for cannot allocate vector of x GB error messages that I get earlier and earlier as the data grows. These are fixed by increasing my memory size; however, I am now using a pretty large computer (Google virtual machine type n2-highmem-128 with 128 vCPU and 864 GB memory). Even though it works, I would like to avoid being dependent on such a large memory size, especially since the data will just keep growing.

Is there a more memory-efficient method in R that I am not aware of? Would creating a unique key for each dataset so I can use the data.table native joining method really make much of a difference? Or, would I be better off executing this step in another language, possibly SQL or bash? I am very new to bash, but happy to learn more if it would be worthwhile.

Note: This particular data flow and resulting dataset were established by my predecessors. I can make a case to change it/pare it down, but before I do, I wanted to see if I could find a better solution with the pipeline as-is.

oguz ismail
  • 1
  • 16
  • 47
  • 69
  • 1
    I can't speak from direct experience, but based on what I've heard from other people, it might be worth investigating moving the merges to duckdb (there's an R interface pkg). – joran Jul 03 '23 at 19:52
  • Great, thanks for the tip. I'll look into that. – Sarah Hirsch Jul 03 '23 at 19:56

0 Answers0