Benchmarking data.frame (base), data.frame(package dataframe) and data.table

Question

With the recent introduction of the package dataframe, I thought it was time to properly benchmark the various data structures and to highlight what each is best at. I'm no expert at the different strengths of each, so my question is, how should we go about benchmarking them.

Some (rather crude) things I have tried:

library(microbenchmark)
library(data.table)
mat <- matrix(rnorm(10000), nrow = 100)
mat2df.base <- data.frame(mat)
library(dataframe)
mat2df.dataframe <- data.frame(mat)
mat2dt <- data.table(mat)
bm <- microbenchmark(t(mat), t(mat2df.base), t(mat2df.dataframe), t(mat2dt), times = 1000)

Results:

Unit: microseconds
                 expr      min       lq   median       uq       max
1              t(mat)   20.927   23.210   31.201   36.908   951.591
2      t(mat2df.base)  929.903  974.039  997.439 1040.814 28270.717
3 t(mat2df.dataframe)  924.957  969.093  992.683 1025.404 27255.205
4           t(mat2dt) 1749.465 1817.382 1857.903 1909.649  5347.321

Take a look at the files in the `library\dataframe\doc` and `library\data.table\doc` directories, which give some idea of the benchmarks that the `dataframe` and `data.table` authors are interested in. Also, you might find [this](http://r.789695.n4.nabble.com/Fwd-Comments-on-data-table-td4630943.html) exchange between two of the authors to be of interest. — jthetzel, May 23 '12 at 14:44
@jthetzel Not sure why but that link isn't showing [my reply](http://r.789695.n4.nabble.com/Fwd-Comments-on-data-table-tp4630946p4631062.html). — Matt Dowle, May 23 '12 at 15:50
@RJ beware of `benchmark`'s and `microbenchmark`'s `times` argument. Issues detailed [here](http://www.talkstats.com/showthread.php/25761-Evidence-that-data.table-isn-t-always-fastest?p=84871&viewfull=1#post84871). — Matt Dowle, May 23 '12 at 15:53

Ari B. Friedman · Accepted Answer · 2012-06-14T15:50:00.293

I'm no data.table expert, but from what I understand its primary advantage is in indexing. So try subsetting with the various packages to compare speeds.

library(microbenchmark)
library(data.table)
mat <- matrix(rnorm(1e7), ncol = 10) 
key <- as.character(sample(1:10,1e6,replace=TRUE))
mat2df.base <- data.frame(mat)
mat2df.base$key <- key

bm.before <- microbenchmark( 
  mat2df.base[mat2df.base$key==2,] 
)

library(dataframe)
mat2df.dataframe <- data.frame(mat)
mat2df.dataframe$key <- key
mat2dt <- data.table(mat)
mat2dt$key <- key
setkey(mat2dt,key)


bm.subset <- microbenchmark( 
  mat2df.base[mat2df.base$key==2,], 
  mat2df.dataframe[mat2df.dataframe$key==2,],
  mat2dt["2",]
  )

                                       expr       min        lq    median   

    uq       max
1           mat2df.base[mat2df.base$key == 2, ] 153.99596 154.98602 155.91621 157.0894 194.24456
2 mat2df.dataframe[mat2df.dataframe$key == 2, ] 153.63907 154.66295 155.68553 156.9827 173.76913
3                                 mat2dt["2", ]  15.51085  15.66742  15.72899  15.8463  22.53044

With a sufficiently large matrix, data.table wipes the table with the other options.

Also, I suspect that @RJ- 's attempt to compare the performance of base data.frame with the package dataframe's data.frames is not working. The performances are just too similar, and I suspect the results are those of the loaded library not of base.

Edit: Tested. Doesn't seem to make much of a difference. bm.after is the same code as bm.subset above, just run at the same time as bm.before to provide an accurate comparison.

bm.before <- microbenchmark( 
  mat2df.base[mat2df.base$key==2,] 
)

> bm.after
Unit: milliseconds
                                           expr       min        lq    median        uq       max
1           mat2df.base[mat2df.base$key == 2, ] 160.62708 166.25787 167.52325 169.18710 173.47864
2 mat2df.dataframe[mat2df.dataframe$key == 2, ] 163.30259 166.00588 167.80138 169.24647 174.05713
3                                 mat2dt["2", ]  16.16117  16.89627  17.09047  17.37057  62.01954

> bm.before
Unit: milliseconds
                                 expr     min       lq   median       uq      max
1 mat2df.base[mat2df.base$key == 2, ] 159.178 160.9867 162.1149 164.0046 195.9501

`data.table` is intended to be used with **large** data sets, with many rows, and that's where it really shines. Try changing lines 3 and 4 above to: `mat <- matrix(rnorm(1e7), ncol = 10) key <- as.character(sample(1:10,1e6,replace=TRUE))`, and watch `data.table` really start to pull away. — Josh O'Brien, May 23 '12 at 15:48
@gsk3 Beware of `benchmark(...,times=100)`. See links in comments to question. — Matt Dowle, May 23 '12 at 15:56

Benchmarking data.frame (base), data.frame(package dataframe) and data.table

1 Answers1

Linked