2

Check this toy code:

> x <- data.table(a = 1:2) 
> foo <- function(z) { z[, b:=3:4]  }
> y <- foo(x)
> x[]
   a b
1: 1 3
2: 2 4

It seems data.table is passed by reference. Is this intentional? Is this documented? I did read through the docs and couldn't find a mention of this behaviour.

I'm not asking about R's documented reference semantics (in :=, set*** and some others). I'm asking whether a data.table complete object is supposed to be passed by reference as a function argument.


Edit: Following @Oliver's answer, here are some more curious examples.

> dt<- data.table(a=1:2)
> attr(dt, ".internal.selfref")
<pointer: 0x564776a93e88>
> address(dt)
[1] "0x5647bc0f6c50"
> 
> ff<-function(x) { x[, b:=3:4]; print(address(x)); print(attr(dt, ".internal.selfref")) }
> ff(dt)
[1] "0x5647bc0f6c50"
<pointer: 0x564776a93e88>

So not only is .internal.selfref identical to the caller's dt copy, so is the address. It really is the same object. (I think).

This is not exactly the case for data.frames:

> df<- data.frame(a=1:2)
> address(df)
[1] "0x5647b39d21e8"
> ff<-function(x) { print(address(x)); x$b=3:4; print(address(x)) }
> 
> ff(df)
[1] "0x5647b39d21e8"
[1] "0x5647ae24de78"

Maybe the root issue is that regular data.table operations somehow do not trigger R's copy-on-modify semantics?

Ofek Shilon
  • 14,734
  • 5
  • 67
  • 101

2 Answers2

5

I think what you're being surprised about is actually R behavior, which is why it's not specifically documented in data.table (maybe it should be anyway, as the implications are more important for data.table).

You were surprised that the object passed to a function had the same address, but this is the same for base R as well:

x = 1:10
address(x)
# [1] "0x7fb7d4b6c820"
(function(y) {print(address(y))})(x)
# [1] "0x7fb7d4b6c820"

What's being copied in the function environment is the pointer to x. Moreover, for base R, the parent x is immutable:

foo = function(y) {
  print(address(y))
  y[1L] = 2L
  print(address(y))
}
foo(x)
# [1] "0x7fb7d4b6c820"
# [1] "0x7fb7d4e11d28"

That is, as soon as we try to edit y, a copy is made. This is related to reference counting -- you can see some work by Luke Tierney on this, e.g. this presentation

The difference for data.table is that data.table enables edit permissions for the parent object -- a double-edged sword as I think you know.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • The `address` snippets are not what caused my to say data.tables are passed by value, the change in their contents did. So essentially yes, "data.table operations somehow do not trigger R's copy-on-modify semantics". Thanks! – Ofek Shilon Jul 07 '20 at 11:41
2

It is quite hard to find a clear answer to this question in the documentation, indeed.

What it seems like you're experiencing is the indeed the pass-by-reference behaviour of data.table. But it is not exactly as one might think. Here it is the behaviour of set*, := and [.data.table that we're experiencing, and this is documented in ?copy (although not in a way that might still be a little unclear). Basically (to my understanding) the data.table has a self-referencing pointer, and these functions all use this pointer to overwrite the existing data.table rather than creating a copy. A shallow-copy is avoided using non-standard evaluation.

We can use a series of examples what is happening:

Example 1: Using set overwrites original object.

library(data.table)
dt <- data.table(a = 1:3)

## Example 1:
### add by reference. A shallow copy is taken by R-intervals
### But the self-referncing pointer still points to the old object (original table is overwritten) 
test1 <- function(x){
  # Add column to existing dt by reference (similar to using `set`)
  x[, b := seq(.N)]
}
test1(dt)  
dt
   a b
1: 1 1
2: 2 2
3: 3 3

This is the same result as reported in the question. What happens here seems to be, that the set method uses a pointer to the object internally, which again points to the original object.
Note that here i use [.data.table with :=. The same result would've been obtained using set(x, j = 'b', value = seq(nrow(x))).

From this object we can also see the self-referencing pointer residing within the attributes (note the pointer address for example 3)

attributes(dt)
$names
[1] "a" "b"

$row.names
[1] 1 2 3

$class
[1] "data.table" "data.frame"

$.internal.selfref
<pointer: 0x0000017e19811ef0>

Example 2: Using [[ uses inherited method from list and creates a copy

test2 <- function(x){
  x[['c']] <- seq(nrow(x))
  x
}
dt2 <- test2(dt)
dt   
   a b
1: 1 1
2: 2 2
3: 3 3

In this example we can see, that despite me creating a new column, this is not exposed to the original table. The reason that this method does not overwrite the existing object seems to be, that there is no [[.data.table<- method defined to be using set within the data.table package. As such it falls back to [[.list<- which has no awareness of the self-referencing pointer in the table, and as such no new column is generated in the original table. Instead a copy is created, which has the same attributes as the original table, including the reference pointer.

Example 3: adding new columns to the result of example 2

The behaviour we've seen in example 2 leads to some interesting behaviours. First we can confirm the pointer is identical

attributes(dt2)
identical(attr(dt, '.internal.selfref'), attr(dt2, '.internal.selfref'))

But if we then try to add new information to dt2 we will get a warning

dt2[, d := 1:3]

Warning message:
In `[.data.table`(dt2, , `:=`(d, 1:3)) : Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.

From which we can see the developers of data.table was very much aware of this behaviour or at least that it was a possible danger. So from this we can see that

  1. no data.tables are in fact not passed by reference. The self-referencing pointer residing within the attributes is passed by reference, and is then used to overwrite the columns in the original table
  2. This is likely intended behaviour and something the users of data.table should be aware of.
  3. If one wishes to use [.data.table with := or set within a function one should create a dt <- copy(dt) or explicitly state within the function documentation that it is overwritten by reference.

For more information about how pointers are handled by R, I believe the manual Writing R extensions and the Rcpp vignettes both describe their behaviour, although pointers in general are considered an advanced topic and can lead to unexpected behaviour.

Oliver
  • 8,169
  • 3
  • 15
  • 37
  • Thank you! Nice insights. I'm not entirely sure this is the bottom of it though - please see the edits to the question (too long for comment) – Ofek Shilon Jul 05 '20 at 16:01