read_csv crashes because "too many files open"

Question

I have several folders with ca 10000 small csv files that I'd like to quickly read into memory and stitch together into one data frame per folder. readr's read_csv can do this conveniently since it accepts vectors of file paths directly and does the combining for me. However, it crashes, when I use want to read more than a couple files.

What is the best way around this issue?

reproducible example inspired from read_csv:

continents <- c("africa", "americas", "asia", "europe", "oceania")
filepaths <- vapply(
  paste0("mini-gapminder-", continents, ".csv"),
  FUN = readr_example,
  FUN.VALUE = character(1)
)
filepaths_10k <- rep(filepaths, 2000)
# works
read_csv(filepaths, id = "file")
# doesnt
read_csv(filepaths_10k, id = "file")

I get the following error:

Error in file(con, "r") : cannot open the connection                                                                                    
In addition: Warning message:
In file(con, "r") :
  cannot open file '/usr/lib/rstudio/resources/CITATION': Too many open files
Error in file(con, "rb") : cannot open the connection                                                                                   
In addition: Warning message:
In file(con, "rb") :
  cannot open file '/home/simon/R/x86_64-pc-linux-gnu-library/3.6/readr/extdata/mini-gapminder-asia.csv': Too many open files

Edit: I have a version of the code using lapply, read_csv and rbindlist, but that did not even finish when I let it run over night. So speed is part of the story here and some microbenchmarks I have run suggest that the above approach is much faster.

Edit2: As per the suggestions (thanks!) I have run some more benchmarks myself. It seems to me there the main difference is whether I am relying on some "explicit" way of binding together the files or whether that is done under the hood by readr. Explicitly setting readr to not use lazy evaluation doesn't seem to make a difference in terms of speed and it also doesn't fix the error. So the suggestion it could be an OS specific thing may be correct (I am on ubuntu 20.04). Also, readr switched the default back to eager evaluation, so this is expected (I had to check anyways...). Also, I am not sure I want lazy eval in the first place, since I am combining all files and do some more cleaning steps anyhow.

microbenchmark(l_apply_rbindlist = lapply(filepaths, read_csv) %>% rbindlist(),
               l_apply_bindrows = lapply(filepaths, read_csv) %>% bind_rows(),
               read_csv_map = map_df(filepaths, ~read_csv(.)),
               readr_default = read_csv(filepaths),
               readr_eager_expl = read_csv(filepaths, lazy = FALSE),
               times = 10,
               check = "equivalent")

Unit: milliseconds
              expr       min        lq      mean    median        uq       max neval cld
 l_apply_rbindlist 214.08594 219.90338 223.36077 222.36070 227.47078 232.48656    10   b
  l_apply_bindrows 225.47465 232.00539 235.62815 234.78071 239.32159 249.53793    10   b
      read_csv_map 215.86775 225.37601 229.41726 231.70719 232.17263 240.49416    10   b
     readr_default  57.66125  59.77418  77.79516  60.41160  69.10050 214.88023    10  a 
  readr_eager_expl  56.21319  57.05472  61.06905  62.67377  63.66434  64.61471    10  a

Possible workaround (I can't test from here) would be `lappy(filepaths_10k, FUN = read_csv)` followed by a `do.call()` using `rbind`? — Paul Stafford Allen, Oct 18 '22 at 08:31
Sorry, I should have been more explicit. There is a version of the code along the lines you suggest. I just had it running over night and it didn't finish. That is why am trying to use the more streamlined approach in my question. — Simon, Oct 18 '22 at 08:38
Does using `lazy = FALSE` make any difference? FWIW, I can't reproduce this on Windows -- but an open files limit sounds like an OS specific thing, anyway. — Mikko Marttila, Oct 18 '22 at 08:48
For large lists, using a combination of `data.table` and `purrr` appears quickest based on this post: https://stackoverflow.com/a/40943207/16730940 - that might help. — Paul Stafford Allen, Oct 18 '22 at 08:56
Thanks for the suggestions. I have reported some benchmarks in the second edit. I am trying now to increase the ulimit settings on my ubuntu system. — Simon, Oct 18 '22 at 10:45
Package {vroom} might help you out, regarding both multiple file handling and performance: https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/#reading-multiple-files , https://vroom.r-lib.org/ — , Oct 18 '22 at 12:13
Thanks! Vroom has been merged into readr with version 2.0, so I'd be using that already if not for the open file limit... — Simon, Oct 19 '22 at 11:43
It's interesting how the vector file input is so much faster. There must be some parallel processing happening. That would explain the large number of open files, and improvements to the inherently sequential `lapply()` method. You might benefit from opening an issue with `readr` about the file limit, and in the mean time probably batching the input files would help you reach a middle ground. — Mikko Marttila, Oct 19 '22 at 14:40

score 1 · Answer 1 · answered Oct 19 '22 at 14:48

You could try to batch the files to read_csv() calls so that you stay below the open files limit of your system, but still benefit from the speed of specifying multiple files to read in one call.

library(readr)

continents <- c("africa", "americas", "asia", "europe", "oceania")
filepaths <- readr_example(
  paste0("mini-gapminder-", continents, ".csv")
)

BATCH_SIZE <- 3

op <- options(readr.show_col_types = FALSE)

batch_no <- (seq_along(filepaths) - 1) %/% BATCH_SIZE
split(filepaths, batch_no) |> purrr::map_df(read_csv)
#> # A tibble: 26 × 5
#>    country       year lifeExp      pop gdpPercap
#>    <chr>        <dbl>   <dbl>    <dbl>     <dbl>
#>  1 Algeria       1952    43.1  9279525     2449.
#>  2 Angola        1952    30.0  4232095     3521.
#>  3 Benin         1952    38.2  1738315     1063.
#>  4 Botswana      1952    47.6   442308      851.
#>  5 Burkina Faso  1952    32.0  4469979      543.
#>  6 Burundi       1952    39.0  2445618      339.
#>  7 Argentina     1952    62.5 17876956     5911.
#>  8 Bolivia       1952    40.4  2883315     2677.
#>  9 Brazil        1952    50.9 56602560     2109.
#> 10 Canada        1952    68.8 14785584    11367.
#> # … with 16 more rows

options(op)

read_csv crashes because "too many files open"

1 Answers1