Reading csv files with the 'readbulk' package in R

Question

I have a list of 1,000 csv files to read in R and I discovered the "readbulk" package. The package works great because it reads a file and merge with the other files to form a data frame. But I have about a 1,000 csv-files to read and I'm afraid that using this package to read and merge each file, one by one will crash my computer. Any suggestions? Thanks!

Suggestion: try it. No, I'm not being sarcastic or inconsiderate. There might not be a problem, in which case any time we spend speculating (on files we know nothing about) will be wasted. If you've already experienced a problem, then please provide something [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and somebody might be able to give you relevant advice. — r2evans, Jul 03 '18 at 19:11
(1000s of files can mean anything. I have a weekly task that reads in ten times that many files without a hiccup, so the four-digit number of files does not necessarily mean anything.) — r2evans, Jul 03 '18 at 19:12
Does it take you more than 8 hours to read almost 2,000 files? Because in my case after 8 hours is still reading 700 files and it gets slower by the hour. — A. Quinones, Jul 08 '18 at 19:05
No, well under that. By chance, are you iteratively `rbind`ing it as you read files in a `for` loop? Your *"slower by the hour"* is an indicator. Post your code in your question, I'll take a look. — r2evans, Jul 08 '18 at 19:33
This is what I've been doing: Data <- read_bulk(directory = "E:/SafeGraph201709PuertoRico", subdirectories = TRUE) — A. Quinones, Jul 09 '18 at 15:26
Yeah, looking at [the code for `read_bulk`](https://github.com/PascalKieslich/readbulk/blob/master/R/read_bulk.R#L146), it's doing it wrong and its performance will only worsen with more files. You need to read the files individually into a list (such as `lst<-lapply(files, read.csv)`) and at the end do a single `do.call(rbind,lst)`. See for example https://stackoverflow.com/a/23555961/3358272. — r2evans, Jul 09 '18 at 16:23
FYI: I opened [an issue](https://github.com/PascalKieslich/readbulk/issues/1) on behalf of this question. — r2evans, Jul 09 '18 at 17:24
Did you try the solution at https://stackoverflow.com/q/11433432/3358272? I think this is effectively a duplicate of it (lacking code-changes in `readbulk::read_bulk`), and possible solutions are indicated in a couple of the answers. — r2evans, Jul 09 '18 at 22:21
Thanks for raising the issue and thanks, r2evans, for the suggestions. So far, we haven't really optimized `read_bulk` for speed, as our use cases (reading experiment data files - which are in our case usually relatively small each - from a few hundred participants in the lab) worked fine. I am currently teaching a workshop at a summer school but will look at the issue by the end of the week and will optimize `read_bulk` to solve the problem. Will post it here once it is done. — PascalKieslich, Jul 09 '18 at 21:43
@r2evans Thank you so much!!! Your suggestion worked to perfection! You're the best! — A. Quinones, Jul 10 '18 at 14:02

Reading csv files with the 'readbulk' package in R

0 Answers0