8

I have a big csv file and it takes ages to read. Can I read this in parallel in R using a package like "parallel" or related? I've tried using mclapply, but it is not working.

Ansjovis86
  • 1,506
  • 5
  • 17
  • 48
  • 3
    Hi, Have you checked out this post on [SO](http://stackoverflow.com/questions/9060457/r-is-it-possible-to-parallelize-speed-up-the-reading-in-of-a-20-million-plus)? Also, check out `fread` in the `data.table` package. It might do what you need (but isn't in parallel). – Richard Erickson Apr 29 '15 at 15:50
  • What is `big`? Number of rows, columns, what is the size of CSV? Also, add your code, even if it is not working. I think you could use `fread` within `mclapply` and specify rownumber chunks. – zx8754 Apr 29 '15 at 16:22
  • 1
    I was thinking that only using one core is a slow idea. Now using fread I can do it 5% of the time. It was a CSV file of 1.2GB, and with read.csv it took about 4-5 minutes and now just 14 seconds. Thanks Richard. I'll try to check if i can use fread() with mclapply zx, thanks. – Ansjovis86 Apr 29 '15 at 20:38
  • 1
    @Ansjovis86 You can post what works best for you as an answer. – Frank May 01 '15 at 17:08
  • 1
    @Frank I wrote up my comment as an answer using the OP's comments. – Richard Erickson May 01 '15 at 18:52

1 Answers1

9

Based upon the comment by the OP, fread from the data.table package worked. Here's the code:

library(data.table)
dt <- fread("myFile.csv")

In the OP's case, read in time for a 1.2GB file with read.csv it took about 4-5 minutes and just 14 seconds with fread.

Update 29 January 2021: It appears that fread() now works in parallel per a Tweet from the package's creator.

Richard Erickson
  • 2,568
  • 8
  • 26
  • 39