So, you want to split a file and to reload it in a single dataframe.
There is a twist: to reduce file size, it would be wise to compress, but then the file size is not entirely deterministic. You may have to tweak a parameter.
The following is a piece of code I have used for a similar task (unrelated to GitHub though).
The split.file
function takes 3 arguments: a dataframe, the number of rows to write in each file, and the base filename. For instance, if basename is "myfile", the files will be "myfile00001.rds", "myfile00002.rds", etc.
The function returns the number of files written.
The join.files
function takes the base name.
Note:
- Play with the
rows
parameter to find out the correct size to fit in 100 MB. It depends on your data, but for similar datasets a fixed size should do. However, if you are dealing with very different datasets, this approach will likely fail.
- When reading, you need to have twice as much memory as occupied by your dataframe (because a list of the smaller dataframes is first read, then rbinded.
- The number is written as 5 digits, but you can change that. The goal is to have the names in lexicographic order, so that when the files are concatenated, the rows are in the same order as the original file.
Here are the functions:
split.file <- function(db, rows, basename) {
n = nrow(db)
m = n %/% rows
for (k in seq_len(m)) {
db.sub <- db[seq(1 + (k-1)*rows, k*rows), , drop = F]
saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, k),
compress = "xz", ascii = F)
}
if (m * rows < n) {
db.sub <- db[seq(1 + m*rows, n), , drop = F]
saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, m+1),
compress = "xz", ascii = F)
m <- m + 1
}
m
}
join.files <- function(basename) {
files <- sort(list.files(pattern = sprintf("%s[0-9]{5}\\.rds", basename)))
do.call("rbind", lapply(files, readRDS))
}
Example:
n <- 1500100
db <- data.frame(x = rnorm(n))
split.file(db, 100000, "myfile")
dbx <- join.files("myfile")
all(dbx$x == db$x)