1

I am processing big csv file (~500-700MB), so I am reading them chunk by chunk. I tried read.csv() function but it is very slow as number to rows to skip increases, so I found data.table::fread() a much faster way to read a file.(R-Blogger,and stackOverflow) but when I am reading a 60MB csv file with fread() it works fine (Reading 60MB file) but when I tried it on a bigger file (~450MB) of same type it shows R Session Aborted (Reading 450MB file) both files have same structure, it only differs in size. I am not able to understand why it is not working as people are reading even bigger size file with it.

Here is my code snippet-

library(data.table)
ffName = "Bund001.csv"

s<- Sys.time()


ColNamesVector <<- c("RIC","Date","Time","GMT_Offset","Type","Price","Volume","Bid_Price","Bid_Size","Ask_Price","Ask_Size","Qualifiers")


rawData <- fread(ffName,sep=",",nrows = 100000,skip = 400000,col.names = ColNamesVector)



print(Sys.time()-s)
Abhinav Rawat
  • 452
  • 3
  • 15
  • 1
    Can you track RAM utilization before crash? How much RAM do you have on your computer? – Emmanuel-Lin Sep 11 '17 at 15:02
  • Currently 8GB RAM is Installed in my PC, Before Running The script, it was `rsession = 65,880k` and `rstudio = 160,328k` after crash `rstudio = 157,060k` and `rsession = 66,196k` before crashing and disappearing from task manager window. I don't think RAM utilization is a issue as I am able to read 500MB-700MB of **csv** files with `read.csv()` and `read.csv.raw()` – Abhinav Rawat Sep 11 '17 at 15:26
  • Ok then, i guees there is something wrong after few lines in your set. `fread`is known to be sensitive. You should control that there are no missing separators, nor special characters in your large file. – Emmanuel-Lin Sep 11 '17 at 15:29
  • about missing separators, this is counter statement which works perfectly fine `rawData <- read.csv("Bund001.csv",sep = ",",nrows = chSize,skip = nskip,col.names = ColNamesVector)` – Abhinav Rawat Sep 11 '17 at 15:32
  • so is there any way to get it worked out, or any other faster way to read **csv** files? – Abhinav Rawat Sep 11 '17 at 15:34
  • You are able to read the whole file by chunks? (Ex: line 1 to 10k, 10001 to 20k... until the end?) – Emmanuel-Lin Sep 11 '17 at 15:42
  • Why do you use `<<-` and not just `<-` for creating `ColNamesVector`? – Jaap Sep 11 '17 at 17:01
  • @Emmanuel-Lin yes, and by reading chunk by chunk I meant, first I'm reading 100000 lines, processing them and saving data then again repeating the same from 100001 on-words. – Abhinav Rawat Sep 13 '17 at 06:56

2 Answers2

4

Did you check NEWS first? (Other tips are on the data.table Support page.)

The screenshot included in your question shows you are using 1.10.4. As luck would have it, currently NEWS shows that 14 improvements have been made to fread since then and many are relevant to your question. Please try dev. The installation page explains that a pre-compiled binary for Windows is made for you and how to get it. You don't need to have any tools installed. That page explains you can revert easily should it not work out.

Please try v1.10.5 from dev and accept this answer if that fixes it.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • I much appreciate your answer, but I'm currently in my internship and these people have GitHub blocked for us, Due to some reasons that nobody else can understand :P. I installed the v1.10.4(current version on my R ) from CRAN using `install.packages()`. I will try the the v1.10.5 during weekends and get back to you with updates, Thanks again. – Abhinav Rawat Sep 13 '17 at 06:51
  • @AbhinavRawat My sympathies. Perhaps you can use your phone to view the homepage. Here's a direct link to the pre-compiled Windows binary (current 1.10.5) : https://ci.appveyor.com/api/buildjobs/g0hw382c9i9iuujj/artifacts/data.table_1.10.5.zip – Matt Dowle Sep 13 '17 at 15:55
  • I have the same problem with `fread`. It works with `nrows = 900000`, but not with `nrows = 1000000`. It also causes R session to abort. I am using R 3.6.2. and data.table 1.12.8. I have it works with read.csv2 (muck slower ofc). – Mislav Jan 30 '20 at 09:20
-3

It is not about size, it means your CSV is slightly out-of-specs.

I would advice to try readr, it is a bit slower but more tolerant to errors

https://github.com/tidyverse/readr

Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64