2

I have a very large R dataframe (10G+, 100mil+ rows) as a result of tokenizing an already large text dataset. When I tried to save the df using saveRDS:

saveRDS(dfname, file='dfname.rds', compress=F)

(setting compress=F, hoping that somehow saves a bit of time) it takes many hours and the saving is still not complete.

Is there a way of saving a gigantic DF in R without spending this much time?

HermitFF
  • 21
  • 1
  • 3
    Suppose you prefer writing it in another format then, use fwrite(...) from data.table package. It writes pretty fast. See function doc here https://www.rdocumentation.org/packages/data.table/versions/1.14.2/topics/fwrite – Wasim Aftab Aug 30 '22 at 13:35
  • 2
    feather and parquet are worth a read about, I'm still new to them myself but finding them useful – Quixotic22 Aug 30 '22 at 13:45
  • 2
    `data.table` is a beginning, but you really wanna look at [arrow](https://arrow.apache.org/docs/r/). See: `write_parquet`. I'm not sure about writing, but it will make reading and interacting with that data many times faster – csgroen Aug 30 '22 at 13:59
  • 1
    I'd suggest closing this as a dupe of [Speeding up write.table](https://stackoverflow.com/q/10505605/903061). Many good options mentioned there, and it would be nice to keep new options consolidated in one place. (`feather` is there, but no `arrow`. `fst` package is mentioned there but not here, and looks promising. I'd also suggest adding `vroom::vroom_write` to the list, which at least offers multithreading.) – Gregor Thomas Aug 30 '22 at 14:56

1 Answers1

2

I made a naive benchmark of the methods mentioned in the comments, and data.table::fwrite comes out ahead when data is uncompressed (but resulting csv file is huge), followed by arrow::write_parquet, then compressed csv.gz using data.table::fwrite, and finally saveRDS.

library(readr)
library(data.table)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(bench)

large_table <- as.data.frame(matrix(rnorm(1e7), ncol = 1e4))

test_saveRDS <- function(large_table) {
    saveRDS(large_table, "test.RDS")
    return(TRUE)
}
test_fwrite_uncomp <- function(large_table) {
    data.table::fwrite(large_table, "test_dt.csv")
    return(TRUE)
}
test_fwrite <- function(large_table) {
    data.table::fwrite(large_table, "test_dt.csv.gz")
    return(TRUE)
}
test_write_parquet <- function(large_table) {
    arrow::write_parquet(large_table, "test.parquet")
    return(TRUE)
}

bench::mark(
    test_saveRDS(large_table),
    test_fwrite(large_table),
    test_fwrite_uncomp(large_table),
    test_write_parquet(large_table)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 test_saveRDS(large_table)          3.68s    3.68s     0.271    8.63KB     0   
#> 2 test_fwrite(large_table)           2.95s    2.95s     0.339  211.97KB     0   
#> 3 test_fwrite_uncomp(large_table)  656.3ms  656.3ms     1.52       88KB     0   
#> 4 test_write_parquet(large_table)    1.52s    1.52s     0.659    6.47MB     2.64

saveRDS uses the least memory, tough.

csgroen
  • 2,511
  • 11
  • 28