1

I have a matrix with 50K columns, 500K rows that I would like to subset by column-names/index pretty fast without using much memory (e.g. memory mapped). Most columns are {NA,1,2}, minority (1%) of the columns are quantitative or strings. What file formats/framework in R are the most suitable for doing this?

I thought I could use feather for this, but it seems to load the entire file and uses almost as much memory as the data.table. equivalent, even though I set as_data_frame=F.

  f="/path/to/matrix.50Kcolums.500Krows.tsv"
  df <- data.table::fread(f) #
  arrow::write_feather(df,paste0(f,".feather"))  
  df <- read_feather(f.arrow, as_data_frame = FALSE) # uses almost as much memory as fread()
  df <- as.data.frame(df[,grep("columns_with_some_name", names(df))]) # this is what I need it to do fast and without using much memory. 

Any thoughts ?

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
tafelplankje
  • 563
  • 1
  • 7
  • 21
  • 2
    Don’t use files for such data, use a database system. – Konrad Rudolph May 12 '21 at 10:05
  • I agree. I was wondering if there was a file format that would allow this. what database format would work well with this many columns? I read SQL(lite) might not be suitable for so many columns. – tafelplankje May 12 '21 at 10:11
  • You might have a look at [disk.frame](https://github.com/xiaodaigh/disk.frame), tested [here](https://stackoverflow.com/a/63929173/13513328) – Waldi May 12 '21 at 10:24
  • SQLite is essentially a database wrapper around a flat file. It’s a convenient solution when you want a database *API* (i.e.: SQL) without having the actual database system. You’re right that SQL generally deals better with relational, normalised data, whereas your data is extremely denormalised. You might want to check out NoSQL formats such as Google Bigtable or similar. Traditionally, data warehousing might also be a solution here but I’m not sure whether that’s still used much. Actually I might be wrong: Feather/HDF5/… might be better suited than a database here. – Konrad Rudolph May 12 '21 at 10:28
  • 1
    Have you tried using the `col_select` argument to select only the columns you're interested in with [`read_feather()`](https://arrow.apache.org/docs/r/reference/read_feather.html)? – Jon Keane May 12 '21 at 13:54

1 Answers1

2

@Jon Keane is correct. Using col_select should allow you to achieve this.

(conbench2) pace@pace-desktop:~/dev/arrow/r$ /usr/bin/time -v Rscript -e "print(arrow::read_feather('/home/pace/dev/data/feather/big/data.feather', col_select=c('f0', 'f7000', 'f32000'), as_data_frame = FALSE))"
Table
500000 rows x 3 columns
$f0 <int32>
$f7000 <int32>
$f32000 <int32>
    Command being timed: "Rscript -e print(arrow::read_feather('/home/pace/dev/data/feather/big/data.feather', col_select=c('f0', 'f7000', 'f32000'), as_data_frame = FALSE))"
    User time (seconds): 1.16
    System time (seconds): 0.51
    Percent of CPU this job got: 150%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.11
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 262660
    Average resident set size (kbytes): 0
    ...

That being said, feather may not be the best format when your entire file does not fit into memory. In this case, even if you specify memory-mapped, you are still going to have to perform I/O. If you are repeatedly accessing the same small set of columns again and again you should be fine. They will quickly be loaded into the page cache and the I/O cost will disappear.

On the other hand, if you are accessing random columns each time or you expect large gaps of time to pass between runs (in which case the pages won't be in the page cache) you may consider parquet. Parquet will require more CPU time to compress / decompress but should reduce the amount of data you need to load. Of course, for relatively small amounts of data (e.g. 0.2% of that dataset) the difference in performance will probably be pretty small. Even then it may spare your hard disk as the table you describe takes up ~100GB as feather and since "Most columns are {NA,1,2}" I would expect the data is highly compressible.

Pace
  • 41,875
  • 13
  • 113
  • 156