I find myself working with distributed datasets (parquet) taking up to >100gb on disk space. Together they sum up to approx 2.4B rows and 24 cols.
I manage to work on it with R/Arrow, simple operations are quite good, but when it comes to perform a sort by an ID sparse across different files Arrow requires to pull data first (collect()) and no amount of Ram seems to be enough.
From working experience I know that SAS Proc Sort is mostly performed on disk rather than on Ram, I was wondering if there's an R package with similar approach.
Any idea how to approach the problem in R, rather than buy a server with 256gb of Ram? Thanks, R