1

Possible Duplicate:
Quickly reading very large tables as dataframes in R

I need to repeatedly read different sections of data from a big (more than 50MB) text file. Currently, my code is working, but it will take 30 seconds to finish 1 iteration. Can anyone help me to improve the efficiency of my code?

My code:

Skip_s=c(88334, 92244, 92635, 96154, 96545, 100455, 100846, 104365, 104756, 
         112967, 123524, 134081, 145929, 156877, 170171, 183856, 194804, 206143,
         217482, 230385, 245243, 255800)   #starting number of rows

nrows_s=c(380) #the length of code I will read
for (k in 1:length(Skip_s)){
    bb = read.table("file",skip=Skip_s[k]-1, nrows=nrows_s, colClasses = c("character",rep("numeric",6)))
    #I use read.table is because it can break the loaded information into columns
    TOT_s[k,1,Ite] = mean(bb$V4[1:105])  
    TOT_s[]..... #I have 20 this statements 
}

The complete version of the above code will take 30 seconds. Most of the time are spent on read.table. I am thinking is there a way to optimize it?

Here is an example of data read by read.table function

 CNC HORIZON     COMPARTMENT      TOTAL      ADSORBED   DISSOLVED  GAS CONC.
 CNC                              (MG/KG)    (MG/KG)    (MG/L)     (MG/L)
 CNC --------------------------------------------------------------------------
 CNC
 CNC
 CNC   1            1             0.4062     0.3737      1.210     0.2419E-05
 CNC   1            2             0.4942     0.4547      1.472     0.2943E-05
 CNC   1            3             0.4930     0.4536      1.468     0.2936E-05
 CNC   1            4             0.4812     0.4427      1.433     0.2865E-05
 CNC   1            5             0.4682     0.4307      1.394     0.2788E-05
 CNC   1            6             0.4550     0.4186      1.355     0.2710E-05
 CNC   1            7             0.4418     0.4065      1.315     0.2631E-05
 CNC   1            8             0.4286     0.3944      1.276     0.2552E-05
 CNC   1            9             0.4154     0.3822      1.237     0.2474E-05
 CNC   1           10             0.4022     0.3701      1.198     0.2395E-05
Community
  • 1
  • 1
TTT
  • 4,354
  • 13
  • 73
  • 123
  • read http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r – mnel Oct 25 '12 at 22:42
  • I agree with @mnel, and would point out that at a certain point, if your data processing needs are fairly simple, and you can't fit everything in memory, it might be worth investigating moving some of this work to a SQL db. – joran Oct 25 '12 at 22:47
  • 3
    Its also worth mentioning that with a 50 MB file you can probably read it into R in its entirety and then do your slicing. This will be your fastest choice by far. also big and 50MB don't usually go together these days :) – Justin Oct 25 '12 at 22:52
  • Does seem to be duplicate. My impression is that of the 4 items in the accepted answer to the duplicate, the use of colClasses is likely to be the most important, so I don't know how much extra traction you can expect. It looks like it might be fixed witdth data. There is a `read.fwf` function that might not need to parse each line – IRTFM Oct 25 '12 at 22:52
  • And as I told you yesterday, if performance matters, read binary data (provided you can write it that way first). – Dirk Eddelbuettel Oct 25 '12 at 23:17
  • DWin, read.fwf is brutally slow. – John Oct 26 '12 at 00:04
  • @Justin, thanks for your comments. My problem is I have to read different 50MB data in each iteration. The worse thing is I have more than 10,000 iterations, which will take at least 3-4 days. So I am wondering, is there a quick to read the data and assign columns to them? – TTT Oct 26 '12 at 17:28
  • @DWin, I tried colClasses, which only improve 10% – TTT Oct 26 '12 at 17:29

0 Answers0