I have a list of 1,000 csv files to read in R and I discovered the "readbulk" package. The package works great because it reads a file and merge with the other files to form a data frame. But I have about a 1,000 csv-files to read and I'm afraid that using this package to read and merge each file, one by one will crash my computer. Any suggestions? Thanks!
Asked
Active
Viewed 39 times
0
-
1Suggestion: try it. No, I'm not being sarcastic or inconsiderate. There might not be a problem, in which case any time we spend speculating (on files we know nothing about) will be wasted. If you've already experienced a problem, then please provide something [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and somebody might be able to give you relevant advice. – r2evans Jul 03 '18 at 19:11
-
1(1000s of files can mean anything. I have a weekly task that reads in ten times that many files without a hiccup, so the four-digit number of files does not necessarily mean anything.) – r2evans Jul 03 '18 at 19:12
-
Does it take you more than 8 hours to read almost 2,000 files? Because in my case after 8 hours is still reading 700 files and it gets slower by the hour. – A. Quinones Jul 08 '18 at 19:05
-
No, well under that. By chance, are you iteratively `rbind`ing it as you read files in a `for` loop? Your *"slower by the hour"* is an indicator. Post your code in your question, I'll take a look. – r2evans Jul 08 '18 at 19:33
-
This is what I've been doing: Data <- read_bulk(directory = "E:/SafeGraph201709PuertoRico", subdirectories = TRUE) – A. Quinones Jul 09 '18 at 15:26
-
Yeah, looking at [the code for `read_bulk`](https://github.com/PascalKieslich/readbulk/blob/master/R/read_bulk.R#L146), it's doing it wrong and its performance will only worsen with more files. You need to read the files individually into a list (such as `lst<-lapply(files, read.csv)`) and at the end do a single `do.call(rbind,lst)`. See for example https://stackoverflow.com/a/23555961/3358272. – r2evans Jul 09 '18 at 16:23
-
Thank you so much! – A. Quinones Jul 09 '18 at 16:55
-
FYI: I opened [an issue](https://github.com/PascalKieslich/readbulk/issues/1) on behalf of this question. – r2evans Jul 09 '18 at 17:24
-
Thanks a lot! I appreciate all your help! – A. Quinones Jul 09 '18 at 20:22
-
Did you try the solution at https://stackoverflow.com/q/11433432/3358272? I think this is effectively a duplicate of it (lacking code-changes in `readbulk::read_bulk`), and possible solutions are indicated in a couple of the answers. – r2evans Jul 09 '18 at 22:21
-
Thanks for raising the issue and thanks, r2evans, for the suggestions. So far, we haven't really optimized `read_bulk` for speed, as our use cases (reading experiment data files - which are in our case usually relatively small each - from a few hundred participants in the lab) worked fine. I am currently teaching a workshop at a summer school but will look at the issue by the end of the week and will optimize `read_bulk` to solve the problem. Will post it here once it is done. – PascalKieslich Jul 09 '18 at 21:43
-
@r2evans Thank you so much!!! Your suggestion worked to perfection! You're the best! – A. Quinones Jul 10 '18 at 14:02