Does the R arrow package have anything like the random access capability of the fst package?

Question

Our team is looking to integrate more of our R and python work. One part of this effort has been trying to move from fst files (using the package fst), which as far as I know cannot be read in python without interfacing with R (Is it possible to import .fst file in python) and instead using feather files (using the arrow package) that can be read natively by python.

The thing I'm running into is that we frequently use the random access functionality from fst (http://www.fstpackage.org/#random-access). For example, we may have a table in an fst file with 100 million rows, and 40 columns, 4gb. The table is sorted by a column MyDate (which contains Dates). With fst, I can read in just the MktDate column (which is quick and doesn't take much memory), identify the rows I need for some date range, and read in just that portion of the fst file. Is there any way to do that with feather? I've thought about using a file system such that a big file with say 5000 dates were instead stored as 5000 dated files, but I'd prefer to stick with just one file if possible.

I use parquet files in the arrow package, but I don't see how to do what you want with either arrow or parquet files. You could use data.table::fread to do what you want. The downside is that the CSV file would be very large, but you could save it as a csv.gz file. Not sure if python could handle a csv.gz file easily. — David F, Nov 14 '22 at 18:37

Does the R arrow package have anything like the random access capability of the fst package?

0 Answers0