5

I have a large data set in GBs that I'd have to process before I analyse them. I tried creating a connector, which allows me to loop through the large datasets and extract chunks at a time.This allows me to quarantine data that satisfies some conditions.

My problem is that I am not able to create an indicator for the connector that stipulates it is null and to execute close(connector) when the end of the dataset is reached. Moreover, for the first chunk of extracted data, I'd have to skip 17 lines since the file contains header that R is not able to read.

A manual attempt that works:

filename="nameoffile.txt"    
con<<-file(description=filename,open="r")    
data<-read.table(con,nrows=1000,skip=17,header=FALSE)    
data<-read.table(con,nrows=1000,skip=0,header=FALSE)    
.    
.    
.    
till end of dataset

Since I'd want to avoid mannually keying the above command until I reach the end of the dataset, I attempted to write a loop to automate the process, which was unsuccessful.

My attempt with loops that failed:

filename="nameoffile.txt"    
con<<-file(description=filename,open="r")    
data<-read.table(con,nrows=1000,skip=17,header=FALSE)        
if (nrow(rval)==0) {    
  con <<-NULL    
  close(con)    
  }else{    
    if(nrow(rval)!=0){    
    con <<-file(description=filename, open="r")    
    data<-read.table(conn,nrows=1000,skip=0,header=FALSE)      
  }}    
user1922730
  • 303
  • 4
  • 10
  • 2
    Have you investigated the `ff` package, and `read.table.ffdf`? – mnel Sep 02 '13 at 02:03
  • It's not a good idea to tackle this problem with base R only. Packages `ff`, `bigmemory` and even `data.table` come to mind. – Ferdinand.kraft Sep 02 '13 at 02:22
  • Files in GBs stored in text files are not very big actually. Try to compress them before analyzing. The main constrain is to read disk (I/O). You can use read.table and save it as RData format with compression level 9. The compressed ratio is about 10% depending on your contents and finally your files only are MBs. – Bangyou Sep 02 '13 at 03:59
  • Maybe package [LaF](http://cran.r-project.org/web/packages/LaF/index.html) is also usefull in your case? –  Sep 02 '13 at 07:14

1 Answers1

10

Looks like you're on the right track. Just open the connection once (you don't need to use <<-, just <-; use a larger chunk size so that R's vectorized operations can be used to process each chunk efficiently), along the lines of

filename <- "nameoffile.txt"
nrows <- 1000000
con <- file(description=filename,open="r")    
## N.B.: skip = 17 from original prob.! Usually not needed (thx @Moody_Mudskipper)
data <- read.table(con, nrows=nrows, skip=17, header=FALSE)
repeat {
    if (nrow(data) == 0)
        break
    ## process chunk 'data' here, then...
    ## ...read next chunk
    if (nrow(data) != nrows)   # last chunk was final chunk
        break
    data <- tryCatch({
        read.table(con, nrows=nrows, skip=0, header=FALSE)
    }, error=function(err) {
       ## matching condition message only works when message is not translated
       if (identical(conditionMessage(err), "no lines available in input"))
          data.frame()
       else stop(err)
    })
}
close(con)    

Iteration seems to me like a good strategy, especially for a file that you're going to process once rather than say reference repeatedly like a data base. The answer is modified to try to be more robust about detecting reading at the end of the file.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • Do you get this error message when you read the last iteration? `Error in read.table(infile, header = FALSE, nrows = 10, sep = ",", stringsAsFactors = FALSE) : no lines available in input In addition: Warning message: In read.table(infile, header = FALSE, nrows = 10, sep = ",", stringsAsFactors = FALSE) : incomplete final line found by readTableHeader on 'data/temp.csv'` Any way round it? – mchangun Oct 18 '13 at 03:51
  • @mchangun Tried to elaborate, but it's a bit of a hack. – Martin Morgan Oct 18 '13 at 05:48
  • I actually found another way around this: http://stackoverflow.com/questions/19441236/read-table-in-chunks-error-message . Seems a bit more elegant. Thanks for your reply though! – mchangun Oct 18 '13 at 06:48
  • @mchangun that fails when the file has lines equal to some multiple of nrows -- you read the last full chunk, and then try to read zero lines. – Martin Morgan Oct 18 '13 at 12:10
  • For those that just come here to grab the code fast and run, please note the `skip=17` in there that you may want to remove ;) – moodymudskipper Sep 01 '17 at 09:34