1

I have a 720 csv files which are 7.4 GB in total (or a 1.5GB zip file, compressed). What's the most efficient way to read the data into R?

From my research, fread of data.table seems like the fastest way to do so. However, it's still taking a long time to do so (I'm asking this question with the fread has been running over 40 minutes). Is there a faster way to do it?


(Edit to delete my solution).

Katie
  • 811
  • 1
  • 10
  • 15
  • 1
    It might be possible that you're using `fread` less efficiently than you could be, but it's impossible to say without specific examples. More generally, the only thing more "efficient" for data this size will likely be to put it all in a db of some form, and just query the parts you want/need from R as you need them. – joran Jan 09 '19 at 19:17
  • 3
    I guess that `data.table::fread` right now is the fastest way (as it's reading file in parallel). Slow performance might be due to wrong code. Can you post yours? How are you using loop? Do you have enough memory? – pogibas Jan 09 '19 at 19:17
  • 1
    Do you need them all in one data frame? That can be hard. An efficient way would be to read them one by one, do the computations, store some aggregated data which can later be combined into a data frame – lebatsnok Jan 09 '19 at 19:18
  • Sorry for the lack of the details; I just added more info to the question! – Katie Jan 09 '19 at 19:28
  • 1
    Do you need all columns from `bikes`? Can you try loading all files, but only first 10 rows (to make sure that it works). – pogibas Jan 09 '19 at 19:30
  • 3
    `data.table::rblindlist` will likely be quite a bit faster than `rbind` (and you won't need the `do.call()` bit that way). But I suspect it will still take at least on the order of minutes. – joran Jan 09 '19 at 19:33
  • 1
    @joran do you mean `data.table::rbindlist` ? It would be tempting to try to do something in parallel instead of `lapply` (e.g. `parallel::parLapply()`) but if the problem is I/O-bound (all data are on the same disk drive, I'm not sure that's going to help ... e.g. see https://stackoverflow.com/questions/48460365/parallel-i-o-read-files-from-disk-in-r – Ben Bolker Jan 09 '19 at 19:40
  • 1
    `fread('cat *.csv')` might be faster than messing with `do.call(rbind(...))` or `rbindlist(...)` – Clayton Stanley Jan 09 '19 at 19:40
  • 2
    @ClaytonStanley: what about multiple header rows in the middle of the concatenated files? @Katie: if you the have the RAM for extra copies, doing just the `lapply()` first and then `rbind()`ing would let you be more precise about where the bottleneck is. Substituting `plyr::llply()` for `lapply()` would let you print a progress bar for the reading process ... – Ben Bolker Jan 09 '19 at 19:42
  • 1
    @BenBolker e.g., https://stackoverflow.com/a/35786076/822162 – Clayton Stanley Jan 09 '19 at 19:45
  • 1
    @BenBolker Yep, that's a typo. Not sure what a blind list would be! ;) – joran Jan 09 '19 at 19:46
  • a possibility would be to do the reading in parallel with foreach and the bind_rows after and this would speed up the reading in of the data. `results = foreach(csv = bikes, .combine = bind_rows ) %dopar% { read_csv(csv) }` Also, bind_rows is fast and robust in case every csv isn't same – zcoleman Jan 09 '19 at 20:51

0 Answers0