Is there an R function / package to sort data on disk space (bigger than Ram datasets), similar to PROC SORT in Sas?

Question

I find myself working with distributed datasets (parquet) taking up to >100gb on disk space. Together they sum up to approx 2.4B rows and 24 cols.

I manage to work on it with R/Arrow, simple operations are quite good, but when it comes to perform a sort by an ID sparse across different files Arrow requires to pull data first (collect()) and no amount of Ram seems to be enough.

From working experience I know that SAS Proc Sort is mostly performed on disk rather than on Ram, I was wondering if there's an R package with similar approach.

Any idea how to approach the problem in R, rather than buy a server with 256gb of Ram? Thanks, R

It's not my area of expertise, but what about Unix command line utilities like `awk` that might be able to sort a text file? — thelatemail, Aug 18 '22 at 23:02
E.g: https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort indicates RAM-buffered, parallelised `sort`ing of files on disk is possible. — thelatemail, Aug 19 '22 at 04:32
I'm not very familiar with unix programming, I'd like to stay in R programming language. — GrilloRob, Aug 19 '22 at 08:38
Use R's `system` or better yet the `processx` package to make run one of those parallelized commands. I agree that it is frustrating to not be able to use parquet's on-disk filtering for strings, that seems highly inconvenient (though perhaps not trivial to fix). — r2evans, Aug 19 '22 at 14:17

Is there an R function / package to sort data on disk space (bigger than Ram datasets), similar to PROC SORT in Sas?

0 Answers0