10

I've noticed that cbind takes considerably longer than rbind for data.tables. What is the reason for this?

> dt <- as.data.table(mtcars)                             
> new.dt <- copy(dt)                                      
> timeit({for (i in 1:100) dt.new <- rbind(dt.new, dt)})  
   user  system elapsed                                   
  0.237   0.012   0.253                                   
> new.dt <- copy(dt)                                      
> timeit({for (i in 1:100) dt.new <- cbind(dt.new, dt)})  
   user  system elapsed                                   
 14.795   0.090  14.912    

Where

timeit <- function(expr)
{
    ptm <- proc.time()
    expr
    proc.time() - ptm
}
andrew
  • 2,524
  • 2
  • 24
  • 36
  • I don't know the internals of data.table, but I guess adding new records (rows) is easier than restructuring the table with new variables (columns). – zx8754 Jun 03 '15 at 15:03
  • @zx8754 Yes I agree my suspicion too, curious what the specific bottle neck is. Maybe it's memory allocation, maybe greater memory requirements, hopefully someone familar with the nitty gritty of the package can shed light.. – andrew Jun 03 '15 at 15:07
  • 3
    Unless I'm mistaken, calling `rbind` on a `data.table` will dispatch `rbind.data.table`, which calls the `data.table` function `rbindlist`- implemented in C, and very fast. See @Arun's answer [here](http://stackoverflow.com/questions/15673550/why-is-rbindlist-better-than-rbind). On top of this, there are almost certainly fundamental differences between column-wise modifications and row-wise modifications (regarding how objects are stored in memory), so this isn't really an "apples-to-apples" comparison. Most likely this is why `data.table` implements the `:=` for modifying columns efficiently. – nrussell Jun 03 '15 at 15:12
  • Hmm, I'm don't see the dispatching you mentioned for rbind, it looks like in both cases it is just casting the data.table to data.frame and then performing the operation (further highlighting the nice-ness of rbindlist). – andrew Jun 03 '15 at 15:29
  • It seems that ```data.table``` has its own ```cbind()```, but it is masked from the package: http://www.inside-r.org/packages/cran/data.table/docs/cbind – mdd Jun 03 '15 at 15:30
  • I'm just confused now, that page says data.table also provides a rbind for data.tables, but when I tab complete ```data.table:::```, I see no cbind or rbind functions. Even if they are masked, shouldn't they still exist under the ```data.table:::``` namespace? – andrew Jun 03 '15 at 15:36
  • 1
    @andrew because base `rbind` and `cbind` are not generic, `data.table` modifies those base functions on load. See the code for `rbind.data.frame` and `cbind.data.frame` to understand what's going on. – eddi Jun 03 '15 at 15:44
  • btw it's interesting that for `data.frame`'s `cbind` is faster (and this is what I would expect, so it probably comes down to `data.table::data.table` being really inefficient) – eddi Jun 03 '15 at 15:45
  • 1
    @MatthiasDiener That link seems to be out of date. The current version has no help page for `cbind.data.table`. – Frank Jun 03 '15 at 17:00

1 Answers1

9

Ultimately I think this comes down to alloc.col being slow due to a loop where it removes various attributes from the columns. I'm not entirely sure why that's done, perhaps Arun or Matt can explain.

As you can see below, the basic operations for cbind are much faster than rbind:

cbind.dt.simple = function(...) {
  x = c(...)
  setattr(x, "class", c("data.table", "data.frame"))
  ans = .Call(data.table:::Calloccolwrapper, x, max(100L, ncol(x) + 64L), FALSE)
  .Call(data.table:::Csetnamed, ans, 0L)
}

library(microbenchmark)

microbenchmark(rbind(dt, dt), cbind(dt, dt), cbind.dt.simple(dt, dt))
#Unit: microseconds
#                    expr      min        lq      mean    median        uq       max neval
#           rbind(dt, dt)  785.318  996.5045 1665.1762 1234.4045 1520.3830 21327.426   100
#           cbind(dt, dt) 2350.275 3022.5685 3885.0014 3533.7595 4093.1975 21606.895   100
# cbind.dt.simple(dt, dt)   74.125  116.5290  168.5101  141.9055  180.3035  1903.526   100
eddi
  • 49,088
  • 6
  • 104
  • 155
  • 1
    I find that `cbind2 <- function(...) (setattr(do.call(c,list(...)),"class",c("data.table","data.frame")))` is even faster. I don't really follow what's happening with the `C*` functions, so maybe I'm missing something. `all.equal(cbind.dt.simple(dt, dt),cbind2(dt,dt)) # TRUE` – Frank Jun 03 '15 at 17:11
  • 1
    @Frank right, but that leaves some things incomplete, try e.g. `cbind2(dt, dt)[, newcol := 5]` – eddi Jun 03 '15 at 17:12
  • 2
    Ah, I see. My next idea `setDT(setattr(do.call(c,list(...)),"class",c("data.frame")))` was way slower. Oh, simply `setDT(do.call(c,list(...)))` works, but is the same (slow) speed. – Frank Jun 03 '15 at 17:14
  • 2
    yeah, I played around with `as.data.table` and `setDT` (as opposed to `data.table::data.table` that's used in `cbind`), and while they were a little faster, they would slow down significantly upon the `alloc.col` call – eddi Jun 03 '15 at 17:18