Partition a large list into chunks with convenient I/O

Question

I have a large list with size of approx. 1.3GB. I'm looking for the fastest solution in R to generate chunks and save them in any convenient format so that :

a) every saved file of the chunk is less than 100MB large

b) the original list can be loaded conveniently and fast into a new R workspace

EDIT II : The reason to do so is a R-solution to bypass the GitHub file size restriction of 100MB per file. The limitation to R is due to some external non-technical restrictions which I can't comment.

What is the best solution for this problem?

EDIT I: Since it was mentioned in the comments that some code for the problem is helpful to create a better question:

An R-example of a list with size of 1.3 GB:

li <- list(a = rnorm(10^8),
           b =  rnorm(10^7.8))

Typically it is considered downvote-worthy if the question is too broad and does not include code relevant to your specific problem. — qdread, Mar 17 '22 at 13:30
Ok. Interesting. Because, I just don't understand in which way this question could be broader than e.g. https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks . Which is also posted without any code to the specific problem. I tried to be as specific with the problem as possible and I do not see how an R code example generating a large list would improve this question. But if so, I could do this of course. — mugdi, Mar 17 '22 at 13:37
The problem here is we have no idea why you want to split the original file. It makes sense when it can't fit in memory, but you also want to load the whole file in memory, so it's not that. What do you want to do with your a) and b)? — , Mar 17 '22 at 15:24
Alright. It did not make sense for me that this helps other people to come to a solution! Thanks for the help. I adjusted the question accordingly. — mugdi, Mar 17 '22 at 15:34

score 1 · Accepted Answer · 2022-03-17T18:58:02.390

So, you want to split a file and to reload it in a single dataframe.

There is a twist: to reduce file size, it would be wise to compress, but then the file size is not entirely deterministic. You may have to tweak a parameter.

The following is a piece of code I have used for a similar task (unrelated to GitHub though).

The split.file function takes 3 arguments: a dataframe, the number of rows to write in each file, and the base filename. For instance, if basename is "myfile", the files will be "myfile00001.rds", "myfile00002.rds", etc. The function returns the number of files written.

The join.files function takes the base name.

Note:

Play with the rows parameter to find out the correct size to fit in 100 MB. It depends on your data, but for similar datasets a fixed size should do. However, if you are dealing with very different datasets, this approach will likely fail.
When reading, you need to have twice as much memory as occupied by your dataframe (because a list of the smaller dataframes is first read, then rbinded.
The number is written as 5 digits, but you can change that. The goal is to have the names in lexicographic order, so that when the files are concatenated, the rows are in the same order as the original file.

Here are the functions:

split.file <- function(db, rows, basename) {
  n = nrow(db)
  m = n %/% rows
  for (k in seq_len(m)) {
    db.sub <- db[seq(1 + (k-1)*rows, k*rows), , drop = F]
    saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, k),
            compress = "xz", ascii = F)
  }
  if (m * rows < n) {
    db.sub <- db[seq(1 + m*rows, n), , drop = F]
    saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, m+1),
            compress = "xz", ascii = F)
    m <- m + 1
  }
  m
}

join.files <- function(basename) {
  files <- sort(list.files(pattern = sprintf("%s[0-9]{5}\\.rds", basename)))
  do.call("rbind", lapply(files, readRDS))
}

Example:

n <- 1500100
db <- data.frame(x = rnorm(n))
split.file(db, 100000, "myfile")
dbx <- join.files("myfile")
all(dbx$x == db$x)

Hmm looks like a good and simple solutions. But I guess this wont work with nested lists, right? — mugdi, Mar 17 '22 at 18:47
@mugdi Nested how? (Almost) everything is possible, you just have to specify your needs. — , Mar 17 '22 at 18:58
Consider a list with 40 nested lists each list contains 4 objects. The first is always a dataset while the other are either vetors or 1x1 datasets. But as I'm writing this I found a solution to my problem. I will tweak your solution and add my adjustmens to my question so other users might benefit! — mugdi, Mar 17 '22 at 19:05
@mugdi Fine! If you need more help, you may also ask another question on a more specific point. — , Mar 17 '22 at 19:10
Alright. I noted that. I mostly tried to stick to one question. So it is not considered "spam" if you ask a new questions which to some extend relates to an recently asked question? — mugdi, Mar 17 '22 at 19:19

Partition a large list into chunks with convenient I/O

1 Answers1

Linked