I have a large data set that I have to read in to R fairly often. There are ~700,000 rows and 129 columns, and the data is spread across 258 individual files ranging in size from 10 - 24000KB, each with the same format.
Whilst trying to speed up import of this data, I noticed that it is significantly faster to import the data if it is all merged into a single 700,000 row .csv
file:
library(readr)
library(microbenchmark)
multiple = function() { #Read data as it is supplied, across 258 files
lapply(multiple_filepaths,
function(x)
read_csv(x, show_col_types = FALSE, progress = FALSE, col_types = coltypes))
}
single = function() { #Read data after it has been re-saved as a single file
read_csv(single_filepath,
show_col_types = FALSE, progress = FALSE, col_types = coltypes)
}
microbenchmark(multiple(), single(), times = 10)
Unit: seconds
expr min lq mean median uq max neval
multiple() 40.811196 40.988007 41.330675 41.279320 41.665536 41.948959 10
single() 5.258331 5.401542 6.016591 5.749491 6.668737 6.947849 10
As you can see, it is over 5 times slower to import from separate files. This seems like quite a drastic difference for what amounts to the same operation. Why is it so much slower across multiple files?
Furthermore (since I have no choice in being supplied files separately) is there a way to get the multiple file import closer to that of the single file? I tried concatenating files before import, but this was also slow using CMD copy
(perhaps UNIX cat
would be better, but I am on a windows machine)
Edit: fread
Trying fread as suggested by @PaulS is still slower than the single file method. fread also exhibits the same slower behaviour on multiple files as opposed to one:
Unit: seconds
expr min lq mean median uq max neval
multiple() 40.972344 41.136988 41.319579 41.301631 41.493197 41.684763 3
single() 6.224543 6.487988 6.993778 6.751433 7.378395 8.005358 3
multiplefread() 19.16834 19.23708 19.33322 19.30583 19.41566 19.52543 3
singlefread() 2.195246 3.631488 5.053056 5.067730 6.481960 7.896188 3
Could the issue something to do with the disk hardware performing better on single files?