0

I have a large data set that I have to read in to R fairly often. There are ~700,000 rows and 129 columns, and the data is spread across 258 individual files ranging in size from 10 - 24000KB, each with the same format.

Whilst trying to speed up import of this data, I noticed that it is significantly faster to import the data if it is all merged into a single 700,000 row .csv file:

library(readr)
library(microbenchmark)

multiple = function() { #Read data as it is supplied, across 258 files
  lapply(multiple_filepaths, 
       function(x)
         read_csv(x, show_col_types = FALSE, progress = FALSE, col_types = coltypes))
}

single = function() { #Read data after it has been re-saved as a single file
  read_csv(single_filepath,
           show_col_types = FALSE, progress = FALSE, col_types = coltypes)
}

microbenchmark(multiple(), single(), times = 10)

Unit: seconds
       expr       min        lq      mean    median        uq       max neval
 multiple() 40.811196 40.988007 41.330675 41.279320 41.665536 41.948959    10
   single()  5.258331  5.401542  6.016591  5.749491  6.668737  6.947849    10

As you can see, it is over 5 times slower to import from separate files. This seems like quite a drastic difference for what amounts to the same operation. Why is it so much slower across multiple files?

Furthermore (since I have no choice in being supplied files separately) is there a way to get the multiple file import closer to that of the single file? I tried concatenating files before import, but this was also slow using CMD copy (perhaps UNIX cat would be better, but I am on a windows machine)

Edit: fread

Trying fread as suggested by @PaulS is still slower than the single file method. fread also exhibits the same slower behaviour on multiple files as opposed to one:

Unit: seconds
            expr       min        lq      mean    median        uq       max neval
      multiple() 40.972344 41.136988 41.319579 41.301631 41.493197 41.684763     3
        single()  6.224543  6.487988  6.993778  6.751433  7.378395  8.005358     3
 multiplefread()  19.16834  19.23708  19.33322  19.30583  19.41566  19.52543     3
   singlefread()  2.195246  3.631488  5.053056  5.067730  6.481960  7.896188     3

Could the issue something to do with the disk hardware performing better on single files?

Richard Berry
  • 396
  • 1
  • 9
  • Maybe you should try `data.table::fread`, which is thought to be much faster than `read_csv`. – PaulS Aug 16 '22 at 11:19
  • I illustrated the issue with csv files because it was a bit simpler to combine the files etc. but my actual use case is for fixed width files and in general manipulating the data into csv format for fread seemed to outweigh its benefits. So a helpful suggestion, but not in my case I'm afraid – Richard Berry Aug 16 '22 at 11:24
  • 1
    I guess that `fread` can read fixed-width files: see: [Faster way to read fixed-width files](https://stackoverflow.com/questions/24715894/faster-way-to-read-fixed-width-files). – PaulS Aug 16 '22 at 11:42
  • Thanks for that, I've now tried fread but it exhibits the same slower behavior for multiple files (see edit) – Richard Berry Aug 16 '22 at 14:35
  • I don't know that there's a single answer. Each file should be stored as a contiguous block of disk space, and yes it's faster for your hardware to read a giant contiguous block than 250 blocks. And there's overhead in the function calls--it's good you're passing in the column types, but there's still additional overhead in making 250 function calls instead of 1. And each one needs to allocate memory and create an object (a list item)... – Gregor Thomas Aug 16 '22 at 15:09
  • However, one other function to try: `vroom` (which is used under-the-hood by `read_csv`) offers support for [reading multiple files](https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/#reading-multiple-files). Benchmarking vroom can be tricky since it does some "lazy reading", [and it still may not be as fast as data.table](https://vroom.r-lib.org/articles/benchmarks.html#reading-multiple-delimited-files). – Gregor Thomas Aug 16 '22 at 15:15

0 Answers0