Speeding up the performance of write.table

Question

I have a data.frame and I want to write it out. The dimensions of my data.frame are 256 rows by 65536 columns. What are faster alternatives to write.csv?

Either get a faster hard drive or, if your data can be converted to a matrix, use `write`. — Joshua Ulrich, May 08 '12 at 20:05
but when I first import it into R use read.table, it automatically uses dataframe, so I need to use as.matrix after I finished my calculation? — lolibility, May 08 '12 at 20:16
do you need to write it out as a CSV or could you simply save it as an RData object or other compressed form? — Chase, May 08 '12 at 20:28
I want the out files looks like a matrix, it will be have separated columns and rows. — lolibility, May 08 '12 at 20:47
@lolibility - I guess my question is more around why you need it to look like a matrix? Are you going to be opening this in another program or feeding it into something else? Or do you just need to save so you can pull it up in R at a later date. As I show below, native R objects are faster to save and take up less space. For the example below, the CSV file takes ~275MB compared to ~80MB for the RData object. — Chase, May 08 '12 at 21:30
might be related: http://stackoverflow.com/questions/4756989/how-to-load-data-quickly-into-r — daroczig, May 08 '12 at 22:18
@daroczig - interesting, thanks for pointing me in that direction. I've got an 8M row x 30 column dataset I've been wrangling with and being able to grab a subset of it at times would be uber helpful. — Chase, May 09 '12 at 02:48
@Chase: the saves pkg would be handy if only you do not need all columns from your big data at a time. So e.g. you need only 2-3 vars from the dataset for the analysis at a time. — daroczig, May 09 '12 at 09:58

MichaelChirico · Answer 1 · 2017-02-15T12:26:48.137

data.table::fwrite() was contributed by Otto Seiskari and is available in versions 1.9.8+. Matt has made additional enhancements on top (including parallelisation) and wrote an article about it. Please report any issues on the tracker.

First, here's a comparison on the same dimensions used by @chase above (i.e., a very large number of columns: 65,000 columns (!) x 256 rows), together with fwrite and write_feather, so that we have some consistency across machines. Note the huge difference compress=FALSE makes in base R.

# -----------------------------------------------------------------------------
# function  | object type |  output type | compress= | Runtime | File size |
# -----------------------------------------------------------------------------
# save      |      matrix |    binary    |   FALSE   |    0.3s |    134MB  |
# save      |  data.frame |    binary    |   FALSE   |    0.4s |    135MB  |
# feather   |  data.frame |    binary    |   FALSE   |    0.4s |    139MB  |
# fwrite    |  data.table |    csv       |   FALSE   |    1.0s |    302MB  |
# save      |      matrix |    binary    |   TRUE    |   17.9s |     89MB  |
# save      |  data.frame |    binary    |   TRUE    |   18.1s |     89MB  |
# write.csv |      matrix |    csv       |   FALSE   |   21.7s |    302MB  |
# write.csv |  data.frame |    csv       |   FALSE   |  121.3s |    302MB  |

Note that fwrite() runs in parallel. The timing shown here is on a 13' Macbook Pro with 2 cores and 1 thread/core (+2 virtual threads via hyperthreading), 512GB SSD, 256KB/core L2 cache and 4MB L4 cache. Depending on your system spec, YMMV.

I also reran the benchmarks on relatively more likely (and bigger) data:

library(data.table)
NN <- 5e6 # at this number of rows, the .csv output is ~800Mb on my machine
set.seed(51423)
DT <- data.table(
  str1 = sample(sprintf("%010d",1:NN)), #ID field 1
  str2 = sample(sprintf("%09d",1:NN)),  #ID field 2
  # varying length string field--think names/addresses, etc.
  str3 = replicate(NN,paste0(sample(LETTERS,sample(10:30,1),T), collapse="")),
  # factor-like string field with 50 "levels"
  str4 = sprintf("%05d",sample(sample(1e5,50),NN,T)),
  # factor-like string field with 17 levels, varying length
  str5 = sample(replicate(17,paste0(sample(LETTERS, sample(15:25,1),T),
      collapse="")),NN,T),
  # lognormally distributed numeric
  num1 = round(exp(rnorm(NN,mean=6.5,sd=1.5)),2),
  # 3 binary strings
  str6 = sample(c("Y","N"),NN,T),
  str7 = sample(c("M","F"),NN,T),
  str8 = sample(c("B","W"),NN,T),
  # right-skewed (integer type)
  int1 = as.integer(ceiling(rexp(NN))),
  num2 = round(exp(rnorm(NN,mean=6,sd=1.5)),2),
  # lognormal numeric that can be positive or negative
  num3 = (-1)^sample(2,NN,T)*round(exp(rnorm(NN,mean=6,sd=1.5)),2))

# -------------------------------------------------------------------------------
# function  |   object   | out |        other args         | Runtime  | File size |
# -------------------------------------------------------------------------------
# fwrite    | data.table | csv |      quote = FALSE        |   1.7s   |  523.2MB  |
# fwrite    | data.frame | csv |      quote = FALSE        |   1.7s   |  523.2MB  |
# feather   | data.frame | bin |     no compression        |   3.3s   |  635.3MB  |
# save      | data.frame | bin |     compress = FALSE      |  12.0s   |  795.3MB  |
# write.csv | data.frame | csv |    row.names = FALSE      |  28.7s   |  493.7MB  |
# save      | data.frame | bin |     compress = TRUE       |  48.1s   |  190.3MB  |
# -------------------------------------------------------------------------------

So fwrite is ~2x faster than feather in this test. This was run on the same machine as noted above with fwrite running in parallel on 2 cores.

feather seems quite fast binary format as well, but no compression yet.

Here's an attempt at showing how fwrite compares with respect to scale:

NB: the benchmark has been updated by running base R's save() with compress = FALSE (since feather also is not compressed).

So fwrite is fastest of all of them on this data (running on 2 cores) plus it creates a .csv which can easily be viewed, inspected and passed to grep, sed etc.

Code for reproduction:

require(data.table)
require(microbenchmark)
require(feather)
ns <- as.integer(10^seq(2, 6, length.out = 25))
DTn <- function(nn)
    data.table(
          str1 = sample(sprintf("%010d",1:nn)),
          str2 = sample(sprintf("%09d",1:nn)),
          str3 = replicate(nn,paste0(sample(LETTERS,sample(10:30,1),T), collapse="")),
          str4 = sprintf("%05d",sample(sample(1e5,50),nn,T)),
          str5 = sample(replicate(17,paste0(sample(LETTERS, sample(15:25,1),T), collapse="")),nn,T),
          num1 = round(exp(rnorm(nn,mean=6.5,sd=1.5)),2),
          str6 = sample(c("Y","N"),nn,T),
          str7 = sample(c("M","F"),nn,T),
          str8 = sample(c("B","W"),nn,T),
          int1 = as.integer(ceiling(rexp(nn))),
          num2 = round(exp(rnorm(nn,mean=6,sd=1.5)),2),
          num3 = (-1)^sample(2,nn,T)*round(exp(rnorm(nn,mean=6,sd=1.5)),2))

count <- data.table(n = ns,
                    c = c(rep(1000, 12),
                          rep(100, 6),
                          rep(10, 7)))

mbs <- lapply(ns, function(nn){
  print(nn)
  set.seed(51423)
  DT <- DTn(nn)
  microbenchmark(times = count[n==nn,c],
               write.csv=write.csv(DT, "writecsv.csv", quote=FALSE, row.names=FALSE),
               save=save(DT, file = "save.RData", compress=FALSE),
               fwrite=fwrite(DT, "fwrite_turbo.csv", quote=FALSE, sep=","),
               feather=write_feather(DT, "feather.feather"))})

png("microbenchmark.png", height=600, width=600)
par(las=2, oma = c(1, 0, 0, 0))
matplot(ns, t(sapply(mbs, function(x) {
  y <- summary(x)[,"median"]
  y/y[3]})),
  main = "Relative Speed of fwrite (turbo) vs. rest",
  xlab = "", ylab = "Time Relative to fwrite (turbo)",
  type = "l", lty = 1, lwd = 2, 
  col = c("red", "blue", "black", "magenta"), xaxt = "n", 
  ylim=c(0,25), xlim=c(0, max(ns)))
axis(1, at = ns, labels = prettyNum(ns, ","))
mtext("# Rows", side = 1, las = 1, line = 5)
legend("right", lty = 1, lwd = 3, 
       legend = c("write.csv", "save", "feather"),
       col = c("red", "blue", "magenta"))
dev.off()

what about `readr::write_csv` ? would be nice to add it to benchmarks. — Dmitriy Selivanov, Apr 07 '16 at 22:29
@DmitriySelivanov on a quick test run, `write_csv` was slower than `write.csv`... — MichaelChirico, Apr 07 '16 at 23:23
`write_csv` is super slow here because it uses a poor strategy for turning doubles in to strings, see https://github.com/hadley/readr/issues/387. — hadley, Apr 08 '16 at 12:18
I also think that `save()` write/reads columns of class `Date` correctly whereas `fwrite()` and `feather()` currently don't. So a fair comparison would be against `double`, `char` and `integer` types alone.. at this point. — Arun, Apr 14 '16 at 14:01
Performance comparison could be extended with extra procedures (https://rpubs.com/demydd/170957). Tests are done on theoretical sample and retested on actual data. — Dimon D., May 17 '16 at 15:12

score 27 · Accepted Answer · edited Apr 07 '16 at 02:04

If all of your columns are of the same class, convert to a matrix before writing out, provides a nearly 6x speed up. Also, you can look into using write.matrix() from package MASS, though it did not prove faster for this example. Maybe I didn't set something up properly:

#Fake data
m <- matrix(runif(256*65536), nrow = 256)
#AS a data.frame
system.time(write.csv(as.data.frame(m), "dataframe.csv"))
#----------
#   user  system elapsed 
# 319.53   13.65  333.76 

#As a matrix
system.time(write.csv(m, "matrix.csv"))
#----------
#   user  system elapsed 
#  52.43    0.88   53.59 

#Using write.matrix()
require(MASS)
system.time(write.matrix(m, "writematrix.csv"))
#----------
#   user  system elapsed 
# 113.58   59.12  172.75

EDIT

To address the concern raised below that the results above are not fair to data.frame, here are some more results and timing to show that the overall message is still "convert your data object to a matrix if possible. If not possible, deal with it. Alternatively, reconsider why you need to write out a 200MB+ file in CSV format if the timing is of the utmost importance":

#This is a data.frame
m2 <- as.data.frame(matrix(runif(256*65536), nrow = 256))
#This is still 6x slower
system.time(write.csv(m2, "dataframe.csv"))
#   user  system elapsed 
# 317.85   13.95  332.44
#This even includes the overhead in converting to as.matrix in the timing 
system.time(write.csv(as.matrix(m2), "asmatrix.csv"))
#   user  system elapsed 
#  53.67    0.92   54.67

So, nothing really changes. To confirm this is reasonable, consider the relative time costs of as.data.frame():

m3 <- as.matrix(m2)
system.time(as.data.frame(m3))
#   user  system elapsed 
#   0.77    0.00    0.77

So, not really a big deal or skewing information as much as the comment below would believe. If you're still not convinced that using write.csv() on large data.frames is a bad idea performance wise, consult the manual under the Note:

write.table can be slow for data frames with large numbers (hundreds or more) of
columns: this is inevitable as each column could be of a different class and so must be
handled separately. If they are all of the same class, consider using a matrix instead.

Finally, consider moving to a native RData object if you're still losing sleep over saving things faster

system.time(save(m2, file = "thisisfast.RData"))
#   user  system elapsed 
#  21.67    0.12   21.81

That's a bit of an unfair comparison... the as.data.frame takes considerable time. Furthermore, the data the OP has are already in data.frame. — John, May 08 '12 at 20:48
@John - good points, though the relative overhead of `as.data.frame` is negligible compared to the overhead of using `write.csv()` and friends on a `data.frame` vis-a-vis a `matrix`. — Chase, May 08 '12 at 21:25
I know it's less, but it's better to have the answer that will probably be accepted not leave that question open for the naive reader. — John, May 08 '12 at 21:26
@John - yes, I agree completely. Thanks for the nudge in the right direction. I was honestly just being sloppy but wanted to give more than the `RTFM` response. And also the overhead of `as.data.frame()` will increase will smaller data objects... — Chase, May 08 '12 at 21:33
In the final `system.time(save(...))` it's MUCH faster adding `compress=FALSE`. 14 seconds vs 0.2 seconds on my machine. — Matt Dowle, Apr 12 '16 at 06:41

hadley · Answer 3 · 2016-04-07T19:24:36.973

Another option is to use the feather file format.

df <- as.data.frame(matrix(runif(256*65536), nrow = 256))

system.time(feather::write_feather(df, "df.feather"))
#>   user  system elapsed 
#>  0.237   0.355   0.617

Feather is a binary file format designed to be very efficient to read and write. It's designed to work with multiple languages: there are currently R and python clients, and a julia client is in the works.

For comparison, here's how long saveRDS takes:

system.time(saveRDS(df, "df.rds"))
#>   user  system elapsed 
#> 17.363   0.307  17.856

Now, this is a somewhat unfair comparison because the default for saveRDS is to compress the data, and here the data is incompressible because it's completely random. Turning compression off makes saveRDS significantly faster:

system.time(saveRDS(df, "df.rds", compress = FALSE))
#>   user  system elapsed 
#>  0.181   0.247   0.473

And indeed it's now slightly faster than feather. So why use feather? Well, it's typically faster than readRDS(), and you usually write the data relatively few times compared to the number of times that you read it.

system.time(readRDS("df.rds"))
#>   user  system elapsed 
#>  0.198   0.090   0.287 

system.time(feather::read_feather("df.feather"))
#>   user  system elapsed 
#>  0.125   0.060   0.185

see https://gist.github.com/markdanese/28b9f5412df55efceba754fee2363444 for a gist for anyone who wants to test it out. FWIW, fwrite is fast for a CSV but not in the same league as feather. — Mark Danese, Apr 07 '16 at 18:38
`feather` is great, but not relevant to original question. Because it is binary format... — Dmitriy Selivanov, Apr 07 '16 at 22:31
@DmitriySelivanov I just re-read the original question and I don't see where it requests a plain text format. — hadley, Apr 08 '16 at 12:17

eipi10 · Answer 4 · 2017-11-30T22:27:11.390

The `fst` package

A more recent option for very fast reading and writing of data files is the fst package. fst generates files in a binary format.

Use write.fst(dat, "file.fst", compress=0), where compress can go from 0 (no compression) to 100 (maximum compression). Data can be read back into R with dat = read.fst("file.fst"). Based on the timings listed at the package website, it's faster than feather, data.table and base R readRDS and writeRDS.

The package development site warns that the fst data format is still evolving and that fst should therefore not yet be used for long-term data storage.

score 1 · Answer 5 · answered Apr 01 '17 at 02:25

You can also try 'readr' package's read_rds (compare to data.table::fread) and write_rds (compare to data.table::fwrite).

Here is a simple example in my dataset　(1133 rows and 429499 columns):

write dataset

fwrite(rankp2,file="rankp2_429499.txt",col.names=T,row.names=F,quote = F,sep="\t") write_rds(rankp2,"rankp2_429499.rds")

read dataset (1133 rows and 429499 columns)

system.time(fread("rankp2_429499.txt",sep="\t",header=T,fill = TRUE)) 　user system elapsed 42.391 0.526 42.949

system.time(read_rds("rankp2_429499.rds")) user system elapsed 2.157 0.388 2.547

Hope it helps.

score 0 · Answer 6 · answered Aug 29 '18 at 17:05

i think you should use fwrite()

it is much faster, and helped me a lot:

fwrite(x, file = "", append = FALSE, quote = "auto",
  sep = ",", sep2 = c("","|",""),
  eol = if (.Platform$OS.type=="windows") "\r\n" else "\n",
  na = "", dec = ".", row.names = FALSE, col.names = TRUE,
  qmethod = c("double","escape"),
  logical01 = getOption("datatable.logical01", FALSE),  # due to change to TRUE; see NEWS
  logicalAsInt = logical01,  # deprecated
  dateTimeAs = c("ISO","squash","epoch","write.csv"),
  buffMB = 8L, nThread = getDTthreads(),
  showProgress = interactive(),
  verbose = getOption("datatable.verbose", FALSE))

https://jangorecki.gitlab.io/data.table/library/data.table/html/fwrite.html

Speeding up the performance of write.table

6 Answers6

EDIT

The `fst` package

write dataset

read dataset (1133 rows and 429499 columns)

Linked

Related

Speeding up the performance of write.table

6 Answers6

EDIT

The fst package

write dataset

read dataset (1133 rows and 429499 columns)

Linked

Related

The `fst` package