Combining 100 RDS/RData files into one large file - large file too big

Question

I have 100 .rds files, each approximately 2510 KB in size, and would like to bind them all together by row into one large data file.

So far I am using this:

memory.limit(size=1500000000)

files = list.files(path = "mypath", pattern = "*.rds")

dat_list = lapply(files, function (x) data.table(readRDS(x)))

all <- do.call("rbind", dat_list)

This seems to work but when running the final line I get a "cannot allocate vector of size..." error which I understand to be due to the final file I am trying to create being too large.

As you can see I have tried increasing the memory limit in R but this does not help. Is there any way I can get around this ? I have read of methods of combining csv files outside of R so the R memory is not affected - is there a similar method that can be used here?

I intend to convert this to a file - mapped big.matrix object later if that helps? I also have the same files in RData format.

Would appreciate any help anyone can offer!

Since you are using `data.table`, try `all<-rbindlist(dat_list)` rather than using `do.call("rbind")`. What exactly is stored in each of the RDS files? Are they nromal data.frames? Or sparcematrices? — MrFlick, Jul 22 '22 at 13:35
@MrFlick Thank you for your comment. They are data.frames. I just tried this and it has not resolved the problem regarding the memory. However the vector size in the error message has increased ? — fil0607, Jul 22 '22 at 13:45
As a workaround, you could save the Rds files sequentially into tsv or csv file(s) (e.g. by using data.table::fwrite; that would work for simple data structures, at least) - and either generate one big file by appending each file sequentially and then removing it from memory (e.g.using fwrite with append=TRUE), or saving one by one and concatenating them on the command line). If the resulting text file is still too large to fit into memory, you could then load it in chunks, use vroom, etc., to get the data back into R. — user12728748, Jul 22 '22 at 14:06
How large will the file be? Try sum(sapply(files, function(x) nrow(readRDS(x)))) — Hugh, Jul 22 '22 at 15:39
@user12728748 thank you very much for your comment, I am trying to achieve something like this! I tried your suggestion but the resulting CSV did not look like my data at all, instead was full of non-alphanumeric symbols ? I think the problem is in my conversion of the Rdata/RDS files to csv format — fil0607, Jul 22 '22 at 18:00

user12728748 · Answer 1 · 2022-07-23T09:26:30.297

As a follow-up to my comment:

You could save the Rds files sequentially into tsv or csv file(s) (e.g. by using data.table::fwrite; that would work for simple data structures, at least) - and either generate one big file by appending each file sequentially and then removing it from memory (e.g.using fwrite with append=TRUE), or saving one by one and concatenating them on the command line). If the resulting text file is still too large to fit into memory, you could then load it in chunks, use vroom, etc., to get the data back into R.

Below is an example showing the idea:

library(data.table)
# generate 100 Rds files as examples
invisible(lapply(1:100, 
    \(x) saveRDS(data.frame(matrix(rnorm(1e5), ncol=5, nrow = 2e4, 
        dimnames=list(NULL, paste0("col", 1:5)))), sprintf("file%03d.Rds", x))))

# files to concatenate
files <- list.files(pattern="file.*.Rds")

# assuming all files have the same column names, you could retrieve it from first file
cn <- colnames(readRDS(file=files[1]))
fwrite(data.table(t(cn)), file="outfile.csv", col.names = FALSE)
# sequentially load Rds files, save appending to newly created output file
invisible(lapply(files, 
    \(x) {fwrite(readRDS(x), file="outfile.csv", col.names = FALSE, append=TRUE)}))

# open with vroom
library(vroom)
vroom("outfile.csv")
#> Rows: 2000000 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (5): col1, col2, col3, col4, col5
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2,000,000 × 5
#>      col1     col2   col3   col4   col5
#>     <dbl>    <dbl>  <dbl>  <dbl>  <dbl>
#>  1 -1.08  -0.366    1.26  -0.791  1.37 
#>  2  0.365  0.382   -0.742 -0.648 -0.800
#>  3  1.09  -0.618    0.480  1.64   0.155
#>  4  2.54   0.170   -0.654  0.537  0.140
#>  5 -0.331  0.262    0.156  0.360 -0.250
#>  6 -0.349 -0.00872  0.322  0.698  0.653
#>  7  0.353 -0.0634   1.28  -0.402 -1.54 
#>  8  1.35   1.15    -1.05   0.410 -0.183
#>  9 -0.499 -3.07     1.14  -0.878  1.11 
#> 10  0.479  1.30     0.718  1.17  -1.02 
#> # … with 1,999,990 more rows
#> # ℹ Use `print(n = ...)` to see more rows

^{Created on 2022-07-23 by the reprex package (v2.0.1)}

thank you for your answer! I have managed to create my data file as a csv, it is approx 11 GB in size. When I run vroom() on it, I get Error: std::bad_alloc ?? I do not know how to get around this. I am quite stuck so any suggestions would be greatly appreciated — fil0607, Jul 23 '22 at 19:26
It sounds like `vroom` does have issues reading files larger than memory; perhaps `arrow` could be a helpful alternative, see this thread: https://stackoverflow.com/questions/68628271/partially-read-really-large-csv-gz-in-r-using-vroom — user12728748, Jul 24 '22 at 07:04

M.Viking · Accepted Answer · 2022-07-22T15:34:08.337

0

Update: using the newer purrr::map_df() function, which combines map and bind_rows and returns a dataframe

https://purrr.tidyverse.org/reference/map.html

library(tidyverse)
my_files = list.files(pattern = "*.rds")
my_all <- map_df(my_files, read_rds)

...

The dplyr::bind_rows() function is explicitly an efficient implementation of the common pattern of do.call(rbind, dfs) for binding many data frames into one.

https://dplyr.tidyverse.org/reference/bind.html

library(tidyverse)
write_rds(iris, "iris1.rds") #write three sample files
write_rds(iris, "iris2.rds")
write_rds(iris, "iris3.rds")
my_files = list.files(pattern = "*.rds")
dat_list = lapply(my_files, function (x) read_rds(x)) #switched to only read_rds()
my_all <- do.call("bind_rows", dat_list) #switched to bind_rows()

edited Jul 22 '22 at 15:34

answered Jul 22 '22 at 14:13

M.Viking

5,067
4
17
33

Thank you for your response. I have tried this but it keeps making my machine crash? R just turns white and unresponsive ? Is there an adaptation of this that can reduce the memory usage when creating the large file ?? @M.Viking – fil0607 Jul 22 '22 at 17:47
1

Two ideas: what if you load the first 50, then the second 50... eg `my_files[1:50]` `my_files[51:100]` , and see if forcing garbage collection `gc()` between steps will help. – M.Viking Jul 22 '22 at 18:28

Combining 100 RDS/RData files into one large file - large file too big

2 Answers2