I have a data.frame
and I want to write it out. The dimensions of my data.frame
are 256 rows by 65536 columns. What are faster alternatives to write.csv
?

- 67,710
- 18
- 144
- 161

- 2,187
- 6
- 25
- 45
-
2Either get a faster hard drive or, if your data can be converted to a matrix, use `write`. – Joshua Ulrich May 08 '12 at 20:05
-
but when I first import it into R use read.table, it automatically uses dataframe, so I need to use as.matrix after I finished my calculation? – lolibility May 08 '12 at 20:16
-
1do you need to write it out as a CSV or could you simply save it as an RData object or other compressed form? – Chase May 08 '12 at 20:28
-
I want the out files looks like a matrix, it will be have separated columns and rows. – lolibility May 08 '12 at 20:47
-
@lolibility - I guess my question is more around why you need it to look like a matrix? Are you going to be opening this in another program or feeding it into something else? Or do you just need to save so you can pull it up in R at a later date. As I show below, native R objects are faster to save and take up less space. For the example below, the CSV file takes ~275MB compared to ~80MB for the RData object. – Chase May 08 '12 at 21:30
-
might be related: http://stackoverflow.com/questions/4756989/how-to-load-data-quickly-into-r – daroczig May 08 '12 at 22:18
-
@daroczig - interesting, thanks for pointing me in that direction. I've got an 8M row x 30 column dataset I've been wrangling with and being able to grab a subset of it at times would be uber helpful. – Chase May 09 '12 at 02:48
-
@Chase: the saves pkg would be handy if only you do not need all columns from your big data at a time. So e.g. you need only 2-3 vars from the dataset for the analysis at a time. – daroczig May 09 '12 at 09:58
6 Answers
data.table::fwrite()
was contributed by Otto Seiskari and is available in versions 1.9.8+. Matt has made additional enhancements on top (including parallelisation) and wrote an article about it. Please report any issues on the tracker.
First, here's a comparison on the same dimensions used by @chase above (i.e., a very large number of columns: 65,000 columns (!) x 256 rows), together with fwrite
and write_feather
, so that we have some consistency across machines. Note the huge difference compress=FALSE
makes in base R.
# -----------------------------------------------------------------------------
# function | object type | output type | compress= | Runtime | File size |
# -----------------------------------------------------------------------------
# save | matrix | binary | FALSE | 0.3s | 134MB |
# save | data.frame | binary | FALSE | 0.4s | 135MB |
# feather | data.frame | binary | FALSE | 0.4s | 139MB |
# fwrite | data.table | csv | FALSE | 1.0s | 302MB |
# save | matrix | binary | TRUE | 17.9s | 89MB |
# save | data.frame | binary | TRUE | 18.1s | 89MB |
# write.csv | matrix | csv | FALSE | 21.7s | 302MB |
# write.csv | data.frame | csv | FALSE | 121.3s | 302MB |
Note that fwrite()
runs in parallel. The timing shown here is on a 13' Macbook Pro with 2 cores and 1 thread/core (+2 virtual threads via hyperthreading), 512GB SSD, 256KB/core L2 cache and 4MB L4 cache. Depending on your system spec, YMMV.
I also reran the benchmarks on relatively more likely (and bigger) data:
library(data.table)
NN <- 5e6 # at this number of rows, the .csv output is ~800Mb on my machine
set.seed(51423)
DT <- data.table(
str1 = sample(sprintf("%010d",1:NN)), #ID field 1
str2 = sample(sprintf("%09d",1:NN)), #ID field 2
# varying length string field--think names/addresses, etc.
str3 = replicate(NN,paste0(sample(LETTERS,sample(10:30,1),T), collapse="")),
# factor-like string field with 50 "levels"
str4 = sprintf("%05d",sample(sample(1e5,50),NN,T)),
# factor-like string field with 17 levels, varying length
str5 = sample(replicate(17,paste0(sample(LETTERS, sample(15:25,1),T),
collapse="")),NN,T),
# lognormally distributed numeric
num1 = round(exp(rnorm(NN,mean=6.5,sd=1.5)),2),
# 3 binary strings
str6 = sample(c("Y","N"),NN,T),
str7 = sample(c("M","F"),NN,T),
str8 = sample(c("B","W"),NN,T),
# right-skewed (integer type)
int1 = as.integer(ceiling(rexp(NN))),
num2 = round(exp(rnorm(NN,mean=6,sd=1.5)),2),
# lognormal numeric that can be positive or negative
num3 = (-1)^sample(2,NN,T)*round(exp(rnorm(NN,mean=6,sd=1.5)),2))
# -------------------------------------------------------------------------------
# function | object | out | other args | Runtime | File size |
# -------------------------------------------------------------------------------
# fwrite | data.table | csv | quote = FALSE | 1.7s | 523.2MB |
# fwrite | data.frame | csv | quote = FALSE | 1.7s | 523.2MB |
# feather | data.frame | bin | no compression | 3.3s | 635.3MB |
# save | data.frame | bin | compress = FALSE | 12.0s | 795.3MB |
# write.csv | data.frame | csv | row.names = FALSE | 28.7s | 493.7MB |
# save | data.frame | bin | compress = TRUE | 48.1s | 190.3MB |
# -------------------------------------------------------------------------------
So fwrite
is ~2x faster than feather
in this test. This was run on the same machine as noted above with fwrite
running in parallel on 2 cores.
feather
seems quite fast binary format as well, but no compression yet.
Here's an attempt at showing how fwrite
compares with respect to scale:
NB: the benchmark has been updated by running base R's save()
with compress = FALSE
(since feather also is not compressed).
So fwrite
is fastest of all of them on this data (running on 2 cores) plus it creates a .csv
which can easily be viewed, inspected and passed to grep
, sed
etc.
Code for reproduction:
require(data.table)
require(microbenchmark)
require(feather)
ns <- as.integer(10^seq(2, 6, length.out = 25))
DTn <- function(nn)
data.table(
str1 = sample(sprintf("%010d",1:nn)),
str2 = sample(sprintf("%09d",1:nn)),
str3 = replicate(nn,paste0(sample(LETTERS,sample(10:30,1),T), collapse="")),
str4 = sprintf("%05d",sample(sample(1e5,50),nn,T)),
str5 = sample(replicate(17,paste0(sample(LETTERS, sample(15:25,1),T), collapse="")),nn,T),
num1 = round(exp(rnorm(nn,mean=6.5,sd=1.5)),2),
str6 = sample(c("Y","N"),nn,T),
str7 = sample(c("M","F"),nn,T),
str8 = sample(c("B","W"),nn,T),
int1 = as.integer(ceiling(rexp(nn))),
num2 = round(exp(rnorm(nn,mean=6,sd=1.5)),2),
num3 = (-1)^sample(2,nn,T)*round(exp(rnorm(nn,mean=6,sd=1.5)),2))
count <- data.table(n = ns,
c = c(rep(1000, 12),
rep(100, 6),
rep(10, 7)))
mbs <- lapply(ns, function(nn){
print(nn)
set.seed(51423)
DT <- DTn(nn)
microbenchmark(times = count[n==nn,c],
write.csv=write.csv(DT, "writecsv.csv", quote=FALSE, row.names=FALSE),
save=save(DT, file = "save.RData", compress=FALSE),
fwrite=fwrite(DT, "fwrite_turbo.csv", quote=FALSE, sep=","),
feather=write_feather(DT, "feather.feather"))})
png("microbenchmark.png", height=600, width=600)
par(las=2, oma = c(1, 0, 0, 0))
matplot(ns, t(sapply(mbs, function(x) {
y <- summary(x)[,"median"]
y/y[3]})),
main = "Relative Speed of fwrite (turbo) vs. rest",
xlab = "", ylab = "Time Relative to fwrite (turbo)",
type = "l", lty = 1, lwd = 2,
col = c("red", "blue", "black", "magenta"), xaxt = "n",
ylim=c(0,25), xlim=c(0, max(ns)))
axis(1, at = ns, labels = prettyNum(ns, ","))
mtext("# Rows", side = 1, las = 1, line = 5)
legend("right", lty = 1, lwd = 3,
legend = c("write.csv", "save", "feather"),
col = c("red", "blue", "magenta"))
dev.off()

- 33,841
- 14
- 113
- 198
-
2what about `readr::write_csv` ? would be nice to add it to benchmarks. – Dmitriy Selivanov Apr 07 '16 at 22:29
-
3@DmitriySelivanov on a quick test run, `write_csv` was slower than `write.csv`... – MichaelChirico Apr 07 '16 at 23:23
-
1`write_csv` is super slow here because it uses a poor strategy for turning doubles in to strings, see https://github.com/hadley/readr/issues/387. – hadley Apr 08 '16 at 12:18
-
3I also think that `save()` write/reads columns of class `Date` correctly whereas `fwrite()` and `feather()` currently don't. So a fair comparison would be against `double`, `char` and `integer` types alone.. at this point. – Arun Apr 14 '16 at 14:01
-
Performance comparison could be extended with extra procedures (https://rpubs.com/demydd/170957). Tests are done on theoretical sample and retested on actual data. – Dimon D. May 17 '16 at 15:12
If all of your columns are of the same class, convert to a matrix before writing out, provides a nearly 6x speed up. Also, you can look into using write.matrix()
from package MASS
, though it did not prove faster for this example. Maybe I didn't set something up properly:
#Fake data
m <- matrix(runif(256*65536), nrow = 256)
#AS a data.frame
system.time(write.csv(as.data.frame(m), "dataframe.csv"))
#----------
# user system elapsed
# 319.53 13.65 333.76
#As a matrix
system.time(write.csv(m, "matrix.csv"))
#----------
# user system elapsed
# 52.43 0.88 53.59
#Using write.matrix()
require(MASS)
system.time(write.matrix(m, "writematrix.csv"))
#----------
# user system elapsed
# 113.58 59.12 172.75
EDIT
To address the concern raised below that the results above are not fair to data.frame, here are some more results and timing to show that the overall message is still "convert your data object to a matrix if possible. If not possible, deal with it. Alternatively, reconsider why you need to write out a 200MB+ file in CSV format if the timing is of the utmost importance":
#This is a data.frame
m2 <- as.data.frame(matrix(runif(256*65536), nrow = 256))
#This is still 6x slower
system.time(write.csv(m2, "dataframe.csv"))
# user system elapsed
# 317.85 13.95 332.44
#This even includes the overhead in converting to as.matrix in the timing
system.time(write.csv(as.matrix(m2), "asmatrix.csv"))
# user system elapsed
# 53.67 0.92 54.67
So, nothing really changes. To confirm this is reasonable, consider the relative time costs of as.data.frame()
:
m3 <- as.matrix(m2)
system.time(as.data.frame(m3))
# user system elapsed
# 0.77 0.00 0.77
So, not really a big deal or skewing information as much as the comment below would believe. If you're still not convinced that using write.csv()
on large data.frames is a bad idea performance wise, consult the manual under the Note
:
write.table can be slow for data frames with large numbers (hundreds or more) of
columns: this is inevitable as each column could be of a different class and so must be
handled separately. If they are all of the same class, consider using a matrix instead.
Finally, consider moving to a native RData object if you're still losing sleep over saving things faster
system.time(save(m2, file = "thisisfast.RData"))
# user system elapsed
# 21.67 0.12 21.81

- 33,841
- 14
- 113
- 198

- 67,710
- 18
- 144
- 161
-
4That's a bit of an unfair comparison... the as.data.frame takes considerable time. Furthermore, the data the OP has are already in data.frame. – John May 08 '12 at 20:48
-
@John - good points, though the relative overhead of `as.data.frame` is negligible compared to the overhead of using `write.csv()` and friends on a `data.frame` vis-a-vis a `matrix`. – Chase May 08 '12 at 21:25
-
I know it's less, but it's better to have the answer that will probably be accepted not leave that question open for the naive reader. – John May 08 '12 at 21:26
-
1@John - yes, I agree completely. Thanks for the nudge in the right direction. I was honestly just being sloppy but wanted to give more than the `RTFM` response. And also the overhead of `as.data.frame()` will increase will smaller data objects... – Chase May 08 '12 at 21:33
-
2In the final `system.time(save(...))` it's MUCH faster adding `compress=FALSE`. 14 seconds vs 0.2 seconds on my machine. – Matt Dowle Apr 12 '16 at 06:41
Another option is to use the feather file format.
df <- as.data.frame(matrix(runif(256*65536), nrow = 256))
system.time(feather::write_feather(df, "df.feather"))
#> user system elapsed
#> 0.237 0.355 0.617
Feather is a binary file format designed to be very efficient to read and write. It's designed to work with multiple languages: there are currently R and python clients, and a julia client is in the works.
For comparison, here's how long saveRDS
takes:
system.time(saveRDS(df, "df.rds"))
#> user system elapsed
#> 17.363 0.307 17.856
Now, this is a somewhat unfair comparison because the default for saveRDS
is to compress the data, and here the data is incompressible because it's completely random. Turning compression off makes saveRDS
significantly faster:
system.time(saveRDS(df, "df.rds", compress = FALSE))
#> user system elapsed
#> 0.181 0.247 0.473
And indeed it's now slightly faster than feather. So why use feather? Well, it's typically faster than readRDS()
, and you usually write the data relatively few times compared to the number of times that you read it.
system.time(readRDS("df.rds"))
#> user system elapsed
#> 0.198 0.090 0.287
system.time(feather::read_feather("df.feather"))
#> user system elapsed
#> 0.125 0.060 0.185

- 102,019
- 32
- 183
- 245
-
3see https://gist.github.com/markdanese/28b9f5412df55efceba754fee2363444 for a gist for anyone who wants to test it out. FWIW, fwrite is fast for a CSV but not in the same league as feather. – Mark Danese Apr 07 '16 at 18:38
-
6
-
7`feather` is great, but not relevant to original question. Because it is binary format... – Dmitriy Selivanov Apr 07 '16 at 22:31
-
4@DmitriySelivanov I just re-read the original question and I don't see where it requests a plain text format. – hadley Apr 08 '16 at 12:17
The fst
package
A more recent option for very fast reading and writing of data files is the fst
package. fst
generates files in a binary format.
Use write.fst(dat, "file.fst", compress=0)
, where compress
can go from 0 (no compression) to 100 (maximum compression). Data can be read back into R with dat = read.fst("file.fst")
. Based on the timings listed at the package website, it's faster than feather
, data.table
and base R readRDS
and writeRDS
.
The package development site warns that the fst
data format is still evolving and that fst
should therefore not yet be used for long-term data storage.

- 91,525
- 24
- 209
- 285
You can also try 'readr' package's read_rds (compare to data.table::fread) and write_rds (compare to data.table::fwrite).
Here is a simple example in my dataset (1133 rows and 429499 columns):
write dataset
fwrite(rankp2,file="rankp2_429499.txt",col.names=T,row.names=F,quote = F,sep="\t")
write_rds(rankp2,"rankp2_429499.rds")
read dataset (1133 rows and 429499 columns)
system.time(fread("rankp2_429499.txt",sep="\t",header=T,fill = TRUE))
user system elapsed
42.391 0.526 42.949
system.time(read_rds("rankp2_429499.rds"))
user system elapsed
2.157 0.388 2.547
Hope it helps.

- 49
- 4
i think you should use fwrite()
it is much faster, and helped me a lot:
fwrite(x, file = "", append = FALSE, quote = "auto",
sep = ",", sep2 = c("","|",""),
eol = if (.Platform$OS.type=="windows") "\r\n" else "\n",
na = "", dec = ".", row.names = FALSE, col.names = TRUE,
qmethod = c("double","escape"),
logical01 = getOption("datatable.logical01", FALSE), # due to change to TRUE; see NEWS
logicalAsInt = logical01, # deprecated
dateTimeAs = c("ISO","squash","epoch","write.csv"),
buffMB = 8L, nThread = getDTthreads(),
showProgress = interactive(),
verbose = getOption("datatable.verbose", FALSE))
https://jangorecki.gitlab.io/data.table/library/data.table/html/fwrite.html

- 1
- 4