3

I am writing a simple command-line Rscript that reads some binary data and outputs it to as a stream of numeric characters. The data is of specific format and R has a very quick library to deal with the binary files in question. The file (of 7 million characters) is read quickly - in less than a second:

library(affyio)
system.time(CEL <- read.celfile("testCEL.CEL"))

user  system elapsed 
0.462   0.035   0.498

I want to write a part of read data to stdout:

str(CEL$INTENSITY$MEAN)
num [1:6553600] 6955 225 7173 182 148 ...

As you can see it's numeric data with ~6.5 million integers.

And the writing is terribly slow:

system.time(write(CEL$INTENSITY$MEAN, file="TEST.out"))
user  system elapsed 
8.953  10.739  19.694

(Here the writing is done to a file, but doing it to standard output from Rscript takes the same amount of time"

cat(vector) does not improve the speed at all. One improvement I found is this:

system.time(writeLines(as.character(CEL$INTENSITY$MEAN), "TEST.out"))
user  system elapsed 
6.282   0.016   6.298

It is still a far cry from the speed it got when reading the data in (and it read 5 times more data than this particular vector). Moreover I have an overhead of transforming the entire vector to character before I can proceed. Plus when sinking to stdout I cannot terminate the stream with CTRL+C if by accident I fail to redirect it to file.

So my question is - is there a faster way to simply output numeric vector from R to stdout?

Also why is reading the data in so much faster than writing? And this is not only for binary files, but in general:

system.time(tmp <- scan("TEST.out"))
Read 6553600 items
user  system elapsed 
1.216   0.028   1.245 
Karolis Koncevičius
  • 9,417
  • 9
  • 56
  • 89

1 Answers1

6

Binary reads are fast. Printing to stdout is slow for two reasons:

  • formatting
  • actual printing

You can benchmark / profile either. But if you really want to be "fast", stay away from formatting for printing lots of data.

Compiled code can help make the conversion faster. But again, the fastest solution will to

  • remain with binary
  • not write to stdout out or file (but use eg something like Redis).
Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • Thanks for a quick reply. You seem to say that this is not really possible and should be expected to be slow. I am not really well versed with input/output streams so forgive me in advance if this next question is silly, but: shouldn't it be possible to fasten this up at least in principle (using Rcpp or something like that). I searched around before posting this question and found a [SO](http://stackoverflow.com/a/5025822/1953718) for java where they claim to put 170mb csv file to disk in 0.3s. – Karolis Koncevičius Jan 11 '15 at 01:17
  • 2
    Binary to binary is fast. Binary to something else takes time. Writing ascii takes time. But do not believe *anyone* and just *profile and measure* on your data. – Dirk Eddelbuettel Jan 11 '15 at 01:18