9

I'm using fread in data.table (1.8.8, R 3.0.1) in a attempt to read very large files.

The file in questions has 313 rows and ~6.6 million cols of numeric data rows and the file is around around 12gb. This is a Centos 6.4 with 512GB of RAM.

When I attempt to read in the file:

g=fread('final.results',header=T,sep=' ')
'header' changed by user from 'auto' to TRUE
Error: protect(): protection stack overflow

I tried starting R with --max-ppsize 500000 , which is the max, but the same error.

I also tried setting the stack size to unlimited via

ulimit -s unlimited

Virtual memory was already set to unlimited.

Am I being unrealistic with a file of this size? Did I miss something fairly obvious?

Community
  • 1
  • 1
mpmorley
  • 93
  • 1
  • 4
  • Please try v1.8.9 on R-Forge (link on data.table homepage). There are 10 bug fixes to `fread` there, see NEWS. Large file support is one of them, but on Windows as already should be ok on Linux. 6.6 million columns (!) is new and could well be a new bug. Please confirm with v1.8.9 and we'll go from there... – Matt Dowle Aug 26 '13 at 18:01
  • @MatthewDowle Yes I'm not happy with 6 million rows either. Install 1.8.9, same error. I made a much smaller file, 10 rows x 50K cols, same error. 10 rows x 49,999 cols it works. – mpmorley Aug 26 '13 at 18:39
  • 1
    Did you mean columns in that comment (you wrote 6 million rows)? Very interesting and strange it fails at 50,000 columns exactly. Well done for honing in on that so quickly. I don't recall any column limit like that. Will take a look. – Matt Dowle Aug 26 '13 at 18:47
  • Sorry, Yes columns. A question regarding first comment, I have 512Gb of RAM, why wouldn't' the file fit? Thanks and great work with data.table. – mpmorley Aug 26 '13 at 19:03
  • Apols, I misread. For some reason I interpreted 512 as MB, since 512GB RAM is quite big then! Even though you wrote 512GB. So, yes _should_ of course read in fine. Does `read.table` / `read.csv` work with 50k columns and 6e6 columns? – Matt Dowle Aug 26 '13 at 19:19
  • You are using 64bit R, not 32bit R? Both would run on your 64bit box. Just checking. The R startup banner should confirm. – Matt Dowle Aug 26 '13 at 19:27
  • Yes 64 bit.. The 50k can be read with read.table. Even with "optimized" read.table I haven't tried reading in the 6e6 table,I just can't image it finishing in a reasonable time frame. The very odd thing Is I have a 313 row x 330K col file I have read in before, However all the data was all integers. – mpmorley Aug 26 '13 at 19:44

1 Answers1

7

Now fixed in v1.8.9 on R-Forge.

  • An unintended 50,000 column limit has been removed in fread. Thanks to mpmorley for reporting. Test added.

The reason was I got this part wrong in the fread.c source :

// *********************************************************************
// Allocate columns for known nrow
// *********************************************************************
ans=PROTECT(allocVector(VECSXP,ncol));
protecti++;
setAttrib(ans,R_NamesSymbol,names);
for (i=0; i<ncol; i++) {
    thistype  = TypeSxp[ type[i] ];
    thiscol = PROTECT(allocVector(thistype,nrow));   // ** HERE **
    protecti++;
    if (type[i]==SXP_INT64)
        setAttrib(thiscol, R_ClassSymbol, ScalarString(mkChar("integer64")));
    SET_TRUELENGTH(thiscol, nrow);
    SET_VECTOR_ELT(ans,i,thiscol);
}

According to R-exts section 5.9.1, that PROTECT inside the loop isn't needed :

In some cases it is necessary to keep better track of whether protection is really needed. Be particularly aware of situations where a large number of objects are generated. The pointer protection stack has a fixed size (default 10,000) and can become full. It is not a good idea then to just PROTECT everything in sight and UNPROTECT several thousand objects at the end. It will almost invariably be possible to either assign the objects as part of another object (which automatically protects them) or unprotect them immediately after use.

So that PROTECT is now removed and all is well. (It seems that the pointer protection stack limit has been reduced to 50,000 since that text was written; Defn.h contains #define R_PPSSIZE 50000L.) I've checked all other PROTECTs in data.table C source for anything similar and found and fixed one in assign.c too (when adding more than 50,000 columns by reference), no others.

Thanks for reporting!

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Tested with a file of 313 rows and 6536299 cols. `system.time(geno<-fread('final.results',sep=' ',header=TRUE))` `user system elapsed` `881.321 16.594 923.957` – mpmorley Aug 27 '13 at 14:04
  • @mpmorley Great, thanks for testing. 15 mins to read a 12GB file sounds ok, maybe, given it's so wide. Is it ok for you? Does anything else read it faster? Would be interested to see the timing breakdown reported by `verbose=TRUE`. If it's mostly spent mmap'ing for example, there may be further options to tune. – Matt Dowle Aug 27 '13 at 22:06
  • For me it's acceptable, I'm not focused on doing something useful with all this data, might see a new post. However I'm glad to run and post if you and others are interested. – mpmorley Aug 29 '13 at 19:40