Can I read 1 big CSV file in parallel in R?

Question

I have a big csv file and it takes ages to read. Can I read this in parallel in R using a package like "parallel" or related? I've tried using mclapply, but it is not working.

Hi, Have you checked out this post on [SO](http://stackoverflow.com/questions/9060457/r-is-it-possible-to-parallelize-speed-up-the-reading-in-of-a-20-million-plus)? Also, check out `fread` in the `data.table` package. It might do what you need (but isn't in parallel). — Richard Erickson, Apr 29 '15 at 15:50
What is `big`? Number of rows, columns, what is the size of CSV? Also, add your code, even if it is not working. I think you could use `fread` within `mclapply` and specify rownumber chunks. — zx8754, Apr 29 '15 at 16:22
I was thinking that only using one core is a slow idea. Now using fread I can do it 5% of the time. It was a CSV file of 1.2GB, and with read.csv it took about 4-5 minutes and now just 14 seconds. Thanks Richard. I'll try to check if i can use fread() with mclapply zx, thanks. — Ansjovis86, Apr 29 '15 at 20:38
@Ansjovis86 You can post what works best for you as an answer. — Frank, May 01 '15 at 17:08
@Frank I wrote up my comment as an answer using the OP's comments. — Richard Erickson, May 01 '15 at 18:52

Richard Erickson · Accepted Answer · 2021-01-29T14:16:21.447

9

Based upon the comment by the OP, fread from the data.table package worked. Here's the code:

library(data.table)
dt <- fread("myFile.csv")

In the OP's case, read in time for a 1.2GB file with read.csv it took about 4-5 minutes and just 14 seconds with fread.

Update 29 January 2021: It appears that fread() now works in parallel per a Tweet from the package's creator.

edited Jan 29 '21 at 14:16

answered May 01 '15 at 18:51

Richard Erickson

2,568
8
26
39

Can I read 1 big CSV file in parallel in R?

1 Answers1

Linked