0

I have a list of 1,000 csv files to read in R and I discovered the "readbulk" package. The package works great because it reads a file and merge with the other files to form a data frame. But I have about a 1,000 csv-files to read and I'm afraid that using this package to read and merge each file, one by one will crash my computer. Any suggestions? Thanks!

  • 1
    Suggestion: try it. No, I'm not being sarcastic or inconsiderate. There might not be a problem, in which case any time we spend speculating (on files we know nothing about) will be wasted. If you've already experienced a problem, then please provide something [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and somebody might be able to give you relevant advice. – r2evans Jul 03 '18 at 19:11
  • 1
    (1000s of files can mean anything. I have a weekly task that reads in ten times that many files without a hiccup, so the four-digit number of files does not necessarily mean anything.) – r2evans Jul 03 '18 at 19:12
  • Does it take you more than 8 hours to read almost 2,000 files? Because in my case after 8 hours is still reading 700 files and it gets slower by the hour. – A. Quinones Jul 08 '18 at 19:05
  • No, well under that. By chance, are you iteratively `rbind`ing it as you read files in a `for` loop? Your *"slower by the hour"* is an indicator. Post your code in your question, I'll take a look. – r2evans Jul 08 '18 at 19:33
  • This is what I've been doing: Data <- read_bulk(directory = "E:/SafeGraph201709PuertoRico", subdirectories = TRUE) – A. Quinones Jul 09 '18 at 15:26
  • Yeah, looking at [the code for `read_bulk`](https://github.com/PascalKieslich/readbulk/blob/master/R/read_bulk.R#L146), it's doing it wrong and its performance will only worsen with more files. You need to read the files individually into a list (such as `lst<-lapply(files, read.csv)`) and at the end do a single `do.call(rbind,lst)`. See for example https://stackoverflow.com/a/23555961/3358272. – r2evans Jul 09 '18 at 16:23
  • Thank you so much! – A. Quinones Jul 09 '18 at 16:55
  • FYI: I opened [an issue](https://github.com/PascalKieslich/readbulk/issues/1) on behalf of this question. – r2evans Jul 09 '18 at 17:24
  • Thanks a lot! I appreciate all your help! – A. Quinones Jul 09 '18 at 20:22
  • Did you try the solution at https://stackoverflow.com/q/11433432/3358272? I think this is effectively a duplicate of it (lacking code-changes in `readbulk::read_bulk`), and possible solutions are indicated in a couple of the answers. – r2evans Jul 09 '18 at 22:21
  • Thanks for raising the issue and thanks, r2evans, for the suggestions. So far, we haven't really optimized `read_bulk` for speed, as our use cases (reading experiment data files - which are in our case usually relatively small each - from a few hundred participants in the lab) worked fine. I am currently teaching a workshop at a summer school but will look at the issue by the end of the week and will optimize `read_bulk` to solve the problem. Will post it here once it is done. – PascalKieslich Jul 09 '18 at 21:43
  • @r2evans Thank you so much!!! Your suggestion worked to perfection! You're the best! – A. Quinones Jul 10 '18 at 14:02

0 Answers0