I am running a data pipeline in bash (Debian GNU/Linux 10 [buster]) that runs 19 R cleaning scripts on 19 separate raw data sets, then merges these into a large dataset of 89 variables and roughly 300 million observations. I use data.table in the last (merge) script when I can:
d_full <- d_merging[d_full, on = c("id", "date")]
but occasionally I need a full join and the base R solution is a bit more direct and allows me to avoid creating unique keys as suggested here
d_full <- merge(d_full, d_merging, by = c("id", "date"), all = TRUE)
which should call the data.table method, anyway. I run garbage collection gc()
after each merge.
The code runs without issue except for cannot allocate vector of x GB
error messages that I get earlier and earlier as the data grows. These are fixed by increasing my memory size; however, I am now using a pretty large computer (Google virtual machine type n2-highmem-128 with 128 vCPU and 864 GB memory). Even though it works, I would like to avoid being dependent on such a large memory size, especially since the data will just keep growing.
Is there a more memory-efficient method in R that I am not aware of? Would creating a unique key for each dataset so I can use the data.table native joining method really make much of a difference? Or, would I be better off executing this step in another language, possibly SQL or bash? I am very new to bash, but happy to learn more if it would be worthwhile.
Note: This particular data flow and resulting dataset were established by my predecessors. I can make a case to change it/pare it down, but before I do, I wanted to see if I could find a better solution with the pipeline as-is.