19

What I really like about data.table is the := idiom for changing the table by reference, without the need for costly copies. From what I understand, this is one of the aspects which makes data.table so ultrafast compared to other methods.

Now, I started playing around with the dplyr package which seems to be equally performant. But since results still have to be assigned using the <- operator, I was expecting a performance drain at this level. However, there seems to be none.

As an example:

library(dplyr)
library(Lahman)
library(microbenchmark)
library(ggplot2)

df <- Batting[ c("yearID", "teamID", "G_batting") ]

mb <- microbenchmark(
  dplyr = {
    tb <- tbl_df( df )
    tb <- tb %.%
      group_by( yearID, teamID ) %.%
      mutate( G_batting = max(G_batting) )
  },
  data.table = {
    dt <- as.data.table( df )
    dt[ , G_batting := max(G_batting), by = list( yearID, teamID ) ]
  },
  times = 500
)

qplot( data = mb, x = expr, y = time * 1E-6, geom = "boxplot", ylab="time [ms]", xlab = "approach" )

enter image description here

I am just wondering how this is possible? Or is there a conceptual mistake in the way I benchmark? Is my understanding of <- wrong?

Beasterfield
  • 7,023
  • 2
  • 38
  • 47
  • Not to say that this is necessarily the case here, but using a `.Call` C API one can potentially alter any object in place. The API does not force the developer to return a new object and all the data of the object passed in is available to the developer via C pointers of the `SEXP` structure. I've done it myself for fast in-place manipulation of image data (and no, it is not advisable!). – Oleg Sklyar Feb 11 '14 at 12:18
  • For whatever it's worth, for me `dplyr` was 50% slower than `data.table` in all iterations of your benchmark (100 x on microbenchmark, though on earlier run I got a worse result for the `max` case). This is on a windows 64 bit machine. – BrodieG Feb 11 '14 at 13:29
  • @BrodieG interesting. For completeness this benchmarks were carried out under R 3.0.2, on a Mac (x86_64-apple-darwin10.8.0 (64-bit)), `dplyr`version 0.1.1 (2014-02-09), and `data.table` 1.8.10. – Beasterfield Feb 11 '14 at 14:05
  • I'm also on 3.0.2 / 0.1.1 / 1.8.10. Data Table timings: `10.52806 10.91406 11.51819 11.91552 14.73834`, Dplyr: `15.69537 16.29676 16.71768 17.43426 24.86194` (min, lq, med, uq, max, milliseconds). – BrodieG Feb 11 '14 at 14:07
  • 3
    Dataset needs to be much bigger, replications needs to be much smaller. Otherwise it's comparing overhead, not the task itself. – Matt Dowle Feb 11 '14 at 16:00

2 Answers2

22

Nice question. In general, I'd benchmark on a data size that's big enough to not fit (almost) entirely in the cache. Have a look here under "initial setup". It really isn't meaningful to compare tools developed for (in-memory) big-data to run tasks that runs in milliseconds. We are planning to benchmark on relatively bigger data in the future.

Additionally if your intent is to find out if mutate is performing a copy, then all you've to do is to check the address before and after (this can be done using .Internal(inspect(.)) in base R or using the function changes() in dplyr).


On to whether a copy is being made or not:

There are two different things to be checked here. A) creating a new column, and B) modifying an existing column.

A) Creating a new column:

require(dplyr)
require(data.table)
df <- tbl_df(data.frame(x=1:5, y=6:10))

df2 <- mutate(df, z=1L)
changes(df, df2)
# Changed variables:
#           old new
# z             0x105ec36d0

It tells you that there are no changes in the addresses of x and y, and points out z we just added. What's happening here?

dplyr shallow copies the data.frame and then has added the new column. A shallow copy as opposed to a deep-copy just copies the vector of column pointers, not the data itself. Therefore it should be fast. Basically df2 is created with 3 columns, where the first two columns are pointing to the same address location as that of df and the 3rd column was just created.

On the other hand, data.table doesn't have to shallow copy, as it modifies the column by reference (in-place). data.table also (cleverly) over-allocates a list of column vectors that allows for fast adding of (new) columns by reference.

There should not be a huge difference in the time to shallow copy as long as you've too many columns. Here's a small benchmark on 5000 columns with 1e4 rows:

require(data.table) # 1.8.11
require(dplyr)      # latest commit from github

dt <- as.data.table(lapply(1:5e3, function(x) sample(1e4)))
ans1 <- sapply(1:1e2, function(x) {
    dd <- copy(dt) # so as to create the new column each time
    system.time(set(dd, i=NULL, j="V1001", value=1L))['elapsed'] 
    # or equivalently of dd[, V1001 := 1L]
})

df <- tbl_df(as.data.frame(dt))
ans2 <- sapply(1:1e2, function(x) {
    system.time(mutate(df, V1001 = 1L))['elapsed']
})
> summary(ans1) # data.table
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.00000 0.00000 0.00100 0.00061 0.00100 0.00100
> summary(ans2) # dplyr
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.03800 0.03900 0.03900 0.04178 0.04100 0.07900

You can see the difference in the "mean time" here (0.00061 vs 0.04178)..

B) Modify an existing column:

df2 <- mutate(df, y=1L)
changes(df, df2)
# Changed variables:
#           old         new
# y         0x105e5a850 0x105e590e0 

It tells you that y has been changed - a copy of column y has been made. It had to create a new memory location to change the values of y, because it was pointing to the same location as that of df's y before.

However, since data.table modifies in place there'll be no copy made in case of (B). It'll modify df in place. So you should see a performance difference if you are modifying columns.

This is one of the fundamental differences in the philosophies between the two packages. dplyr doesn't like modifying in-place and therefore trades-off by copying when modifying existing columns.

And because of this, it wouldn't be possible to change values of certain rows of a particular column of a data.frame without a deep-copy. That is:

DT[x >= 5L, y := 1L] # y is an existing column

This can't be done without an entire copy of the data.frame using base R and dplyr, to my knowledge.


Also, consider a 2 column dataset of size 20GB (two columns each 10GB) on a machine with 32GB RAM. The data.table philosophy is to provide a way to change a subset of those 10GB columns by reference, without copying even a single column once. A copy of one column would need an extra 10GB and may fail with out-out-memory, let alone be fast or not. This concept (:=) is analogous to UPDATE in SQL.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Arun
  • 116,683
  • 26
  • 284
  • 387
  • 3
    great explanation, and I find this flexibility of being able to decide when to copy and when not to copy quite valuable as I do very expensive operations of type B all the time (most frequently it's smth like DT1[DT2, newcolinDT1 := somecolfromDT2]`) – eddi Feb 11 '14 at 16:25
  • 2
    Thank you so much for the so detailed explanation. I see also your point concerning the `DT[x >= 5L, y := 1L]`. – Beasterfield Feb 11 '14 at 17:31
  • @matt-dowle the edit (the last paragraph) was not necessary, I got it already from what was said above. I'm anyways wondering why all discussions I have read in the last days covering `data.table` and `dplyr` where imho emotionally flavoured. I was not asking for which package is better. – Beasterfield Feb 11 '14 at 17:32
  • @Beasterfield, thanks. IMHO, Matt was merely expanding on the point I missed when I mentioned that you'll be seeing performance differences when modifying existing columns - where it's actually making a copy (and therefore memory requirements). There'll always be benchmarks comparing packages and people defending tools they like because it goes with their philosophy / fits best with their work routine. Which posts gave you the impression that they are *emotionally flavoured*? – Arun Feb 11 '14 at 18:44
  • You can't currently modify subsets in place with dplyr, but it's something that we might add in the future. I need to see more examples of where it's really useful (i.e. can't be done another way). – hadley Feb 11 '14 at 19:24
  • 3
    @hadley What about imputation of missing values? I assume tbl_dt() doesn't allow changes in place either? – Vincent Feb 11 '14 at 22:03
  • @Vincent I usually think of imputation as a prediction problem that operates on whole columns – hadley Feb 12 '14 at 15:06
  • 3
    @hadley I agree it is a prediction problem. It may involve data from the same column and other columns. However it should only change a few values in that column, at least if the data is not too crappy :) – Vincent Feb 12 '14 at 18:30
11

To understand what's going on, you need to understand what exactly is being copied. A data frame is actually quite cheap to copy because it's basically just an array of pointers to the columns. Doing a shallow copy of a data frame is very cheap, because you just have to copy those pointers.

However, most base R function do a deep copy. So when you do:

df <- data.frame(x = 1:10, y = 1:10)
transform(df, z = x + y)

not only does R copy the data frame, it actually copies each individual column. dplyr provides the changes() function to make this easier to see. For each column in the data frame, it displays the memory location where that column lives. If it's changed, then the complete column has been copied:

df2 <- transform(df, z = x + y)
changes(df, df2)
#> Changed variables:
#>           old            new           
#> x         0x7fb19adcd378 0x7fb19ab9bcb8
#> y         0x7fb19adcd3d0 0x7fb19ab9bd10
#> z                        0x7fb19ab9bd68
#> 
#> Changed attributes:
#>           old            new           
#> names     0x7fb19adcce98 0x7fb1944e4558
#> row.names 0x7fb19ab2bd10 0x7fb19ab2bf20
#> class     0x7fb19ad5d208 0x7fb19ab51b28

If we do the same thing in dplyr, the original columns aren't copied:

df3 <- dplyr::mutate(df, z = x + y)
changes(df, df3)
#> Changed variables:
#>           old new           
#> z             0x7fb19adcd060
#> 
#> Changed attributes:
#>           old            new           
#> names     0x7fb19adcce98 0x7fb1944e8b18
#> row.names 0x7fb19ab9c0d8 0x7fb19ab9c340
#> class     0x7fb19ad5d208 0x7fb19ad69408

This makes dplyr much faster than base R.

Data.table is a little bit faster again, because it allows you to modify the data table in place - it doesn't even have to copy the pointers to the columns. I think not modifying in place makes dplyr a little easier to understand (because it perserves usual R semantics), at the cost of being a little slower (but the cost grows with the number of columns, not number of rows).

hadley
  • 102,019
  • 32
  • 183
  • 245
  • Thank you very much. But maybe one more question: Why does something like `dt <- tbl_df( data.frame( a = 1:10, b = LETTERS[1:10] ) ); dt2 <- dt; dt2[1,1] <- 0; changes(dt2, dt)` copies then all columns of `dt`, although only one columns was manipulated? – Beasterfield Feb 11 '14 at 17:28
  • @Beasterfield that's because you're using the base R `[` method. We might eventually implement that in C++, but it's not currently high priority. – hadley Feb 11 '14 at 19:17