I have a big csv file and it takes ages to read. Can I read this in parallel in R using a package like "parallel" or related? I've tried using mclapply, but it is not working.
Asked
Active
Viewed 3,838 times
8
-
3Hi, Have you checked out this post on [SO](http://stackoverflow.com/questions/9060457/r-is-it-possible-to-parallelize-speed-up-the-reading-in-of-a-20-million-plus)? Also, check out `fread` in the `data.table` package. It might do what you need (but isn't in parallel). – Richard Erickson Apr 29 '15 at 15:50
-
What is `big`? Number of rows, columns, what is the size of CSV? Also, add your code, even if it is not working. I think you could use `fread` within `mclapply` and specify rownumber chunks. – zx8754 Apr 29 '15 at 16:22
-
1I was thinking that only using one core is a slow idea. Now using fread I can do it 5% of the time. It was a CSV file of 1.2GB, and with read.csv it took about 4-5 minutes and now just 14 seconds. Thanks Richard. I'll try to check if i can use fread() with mclapply zx, thanks. – Ansjovis86 Apr 29 '15 at 20:38
-
1@Ansjovis86 You can post what works best for you as an answer. – Frank May 01 '15 at 17:08
-
1@Frank I wrote up my comment as an answer using the OP's comments. – Richard Erickson May 01 '15 at 18:52
1 Answers
9
Based upon the comment by the OP, fread
from the data.table
package worked. Here's the code:
library(data.table)
dt <- fread("myFile.csv")
In the OP's case, read in time for a 1.2GB file with read.csv
it took about 4-5 minutes and just 14 seconds with fread
.
Update 29 January 2021: It appears that fread()
now works in parallel per a Tweet from the package's creator.

Richard Erickson
- 2,568
- 8
- 26
- 39