2

The data frames are 15k rows x 200k columns. It's the first time I try to write this to a TSV file, and I am surprised to see how slow my code is. It takes three days and is still running. This is unacceptable. What techniques can I use to reduce writing time?

I know it is quick to write out in R objects, but I have to send this data to another person, who does not use R. Therefore the common format we can use is plain text file.


Confirmation

I confirm that write_csv from readr package does write my files much faster than the base write.table function. However, it does not let me specify the separator I want, so it is not preferred for my case. I ended up using this trick: first I preprocess my huge data frame to a character vector like this:

forwriteout <- apply(mydf, 1, function(x){paste(x, collapse = "\t")})

And then I write out forwriteout with the base write function. This is almost as fast as write_csv. See the benchmark below.

                     expr       min        lq      mean    median        uq
        pasteandwrite  281.8968  283.4586  288.5968  289.2780  292.2049
     normalwritetable 1973.7250 1981.6122 1999.1016 1997.5792 2014.2397
 usewritecsvfromreadr  201.6592  202.6115  215.2030  216.4946  226.1103
       max neval
  295.6102    10
 2028.3227    10
  229.3069    10
Christopher Bottoms
  • 11,218
  • 8
  • 50
  • 99
biocyberman
  • 5,675
  • 8
  • 38
  • 50

1 Answers1

2

Many people use write.csv() to write to a flatfile. However, there is a relatively new library called 'readr', that reads/writes much more quickly.

http://cran.r-project.org/web/packages/readr/readr.pdf

This is about twice as fast as write.csv, and never writes row names.

There, I got you down to 1.5 days. (and still running)

Other tricks are

  • Only write the data you need: so row.names are an obvious one.
  • use round() to round any numerical fields to the minimum number of decimal places that you need.
  • Benchmark. Try writing to disk 1% of your data. Time it, try some tricks or packages I mentioned, and time it again, see what works.
Michael Plazzer
  • 447
  • 1
  • 6
  • 18