I have several folders with ca 10000 small csv files that I'd like to quickly read into memory and stitch together into one data frame per folder. readr's read_csv can do this conveniently since it accepts vectors of file paths directly and does the combining for me. However, it crashes, when I use want to read more than a couple files.
What is the best way around this issue?
reproducible example inspired from read_csv:
continents <- c("africa", "americas", "asia", "europe", "oceania")
filepaths <- vapply(
paste0("mini-gapminder-", continents, ".csv"),
FUN = readr_example,
FUN.VALUE = character(1)
)
filepaths_10k <- rep(filepaths, 2000)
# works
read_csv(filepaths, id = "file")
# doesnt
read_csv(filepaths_10k, id = "file")
I get the following error:
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file '/usr/lib/rstudio/resources/CITATION': Too many open files
Error in file(con, "rb") : cannot open the connection
In addition: Warning message:
In file(con, "rb") :
cannot open file '/home/simon/R/x86_64-pc-linux-gnu-library/3.6/readr/extdata/mini-gapminder-asia.csv': Too many open files
Edit: I have a version of the code using lapply, read_csv and rbindlist, but that did not even finish when I let it run over night. So speed is part of the story here and some microbenchmarks I have run suggest that the above approach is much faster.
Edit2: As per the suggestions (thanks!) I have run some more benchmarks myself. It seems to me there the main difference is whether I am relying on some "explicit" way of binding together the files or whether that is done under the hood by readr. Explicitly setting readr to not use lazy evaluation doesn't seem to make a difference in terms of speed and it also doesn't fix the error. So the suggestion it could be an OS specific thing may be correct (I am on ubuntu 20.04). Also, readr switched the default back to eager evaluation, so this is expected (I had to check anyways...). Also, I am not sure I want lazy eval in the first place, since I am combining all files and do some more cleaning steps anyhow.
microbenchmark(l_apply_rbindlist = lapply(filepaths, read_csv) %>% rbindlist(),
l_apply_bindrows = lapply(filepaths, read_csv) %>% bind_rows(),
read_csv_map = map_df(filepaths, ~read_csv(.)),
readr_default = read_csv(filepaths),
readr_eager_expl = read_csv(filepaths, lazy = FALSE),
times = 10,
check = "equivalent")
Unit: milliseconds
expr min lq mean median uq max neval cld
l_apply_rbindlist 214.08594 219.90338 223.36077 222.36070 227.47078 232.48656 10 b
l_apply_bindrows 225.47465 232.00539 235.62815 234.78071 239.32159 249.53793 10 b
read_csv_map 215.86775 225.37601 229.41726 231.70719 232.17263 240.49416 10 b
readr_default 57.66125 59.77418 77.79516 60.41160 69.10050 214.88023 10 a
readr_eager_expl 56.21319 57.05472 61.06905 62.67377 63.66434 64.61471 10 a