2

I find myself working with distributed datasets (parquet) taking up to >100gb on disk space. Together they sum up to approx 2.4B rows and 24 cols.

I manage to work on it with R/Arrow, simple operations are quite good, but when it comes to perform a sort by an ID sparse across different files Arrow requires to pull data first (collect()) and no amount of Ram seems to be enough.

From working experience I know that SAS Proc Sort is mostly performed on disk rather than on Ram, I was wondering if there's an R package with similar approach.

Any idea how to approach the problem in R, rather than buy a server with 256gb of Ram? Thanks, R

GrilloRob
  • 262
  • 1
  • 3
  • 15
  • It's not my area of expertise, but what about Unix command line utilities like `awk` that might be able to sort a text file? – thelatemail Aug 18 '22 at 23:02
  • E.g: https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort indicates RAM-buffered, parallelised `sort`ing of files on disk is possible. – thelatemail Aug 19 '22 at 04:32
  • I'm not very familiar with unix programming, I'd like to stay in R programming language. – GrilloRob Aug 19 '22 at 08:38
  • Use R's `system` or better yet the `processx` package to make run one of those parallelized commands. I agree that it is frustrating to not be able to use parquet's on-disk filtering for strings, that seems highly inconvenient (though perhaps not trivial to fix). – r2evans Aug 19 '22 at 14:17

0 Answers0