@Jon Keane is correct. Using col_select
should allow you to achieve this.
(conbench2) pace@pace-desktop:~/dev/arrow/r$ /usr/bin/time -v Rscript -e "print(arrow::read_feather('/home/pace/dev/data/feather/big/data.feather', col_select=c('f0', 'f7000', 'f32000'), as_data_frame = FALSE))"
Table
500000 rows x 3 columns
$f0 <int32>
$f7000 <int32>
$f32000 <int32>
Command being timed: "Rscript -e print(arrow::read_feather('/home/pace/dev/data/feather/big/data.feather', col_select=c('f0', 'f7000', 'f32000'), as_data_frame = FALSE))"
User time (seconds): 1.16
System time (seconds): 0.51
Percent of CPU this job got: 150%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.11
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 262660
Average resident set size (kbytes): 0
...
That being said, feather may not be the best format when your entire file does not fit into memory. In this case, even if you specify memory-mapped, you are still going to have to perform I/O. If you are repeatedly accessing the same small set of columns again and again you should be fine. They will quickly be loaded into the page cache and the I/O cost will disappear.
On the other hand, if you are accessing random columns each time or you expect large gaps of time to pass between runs (in which case the pages won't be in the page cache) you may consider parquet. Parquet will require more CPU time to compress / decompress but should reduce the amount of data you need to load. Of course, for relatively small amounts of data (e.g. 0.2% of that dataset) the difference in performance will probably be pretty small. Even then it may spare your hard disk as the table you describe takes up ~100GB as feather and since "Most columns are {NA,1,2}" I would expect the data is highly compressible.