It is quite hard to find a clear answer to this question in the documentation, indeed.
What it seems like you're experiencing is the indeed the pass-by-reference behaviour of data.table
. But it is not exactly as one might think. Here it is the behaviour of set*
, :=
and [.data.table
that we're experiencing, and this is documented in ?copy
(although not in a way that might still be a little unclear). Basically (to my understanding) the data.table
has a self-referencing pointer, and these functions all use this pointer to overwrite the existing data.table
rather than creating a copy. A shallow-copy is avoided using non-standard evaluation.
We can use a series of examples what is happening:
Example 1: Using set overwrites original object.
library(data.table)
dt <- data.table(a = 1:3)
## Example 1:
### add by reference. A shallow copy is taken by R-intervals
### But the self-referncing pointer still points to the old object (original table is overwritten)
test1 <- function(x){
# Add column to existing dt by reference (similar to using `set`)
x[, b := seq(.N)]
}
test1(dt)
dt
a b
1: 1 1
2: 2 2
3: 3 3
This is the same result as reported in the question. What happens here seems to be, that the set
method uses a pointer to the object internally, which again points to the original object.
Note that here i use [.data.table
with :=
. The same result would've been obtained using set(x, j = 'b', value = seq(nrow(x)))
.
From this object we can also see the self-referencing pointer residing within the attributes (note the pointer address for example 3)
attributes(dt)
$names
[1] "a" "b"
$row.names
[1] 1 2 3
$class
[1] "data.table" "data.frame"
$.internal.selfref
<pointer: 0x0000017e19811ef0>
Example 2: Using [[
uses inherited method from list
and creates a copy
test2 <- function(x){
x[['c']] <- seq(nrow(x))
x
}
dt2 <- test2(dt)
dt
a b
1: 1 1
2: 2 2
3: 3 3
In this example we can see, that despite me creating a new column, this is not exposed to the original table. The reason that this method does not overwrite the existing object seems to be, that there is no [[.data.table<-
method defined to be using set
within the data.table package. As such it falls back to [[.list<-
which has no awareness of the self-referencing pointer in the table, and as such no new column is generated in the original table. Instead a copy is created, which has the same attributes as the original table, including the reference pointer.
Example 3: adding new columns to the result of example 2
The behaviour we've seen in example 2 leads to some interesting behaviours. First we can confirm the pointer is identical
attributes(dt2)
identical(attr(dt, '.internal.selfref'), attr(dt2, '.internal.selfref'))
But if we then try to add new information to dt2
we will get a warning
dt2[, d := 1:3]
Warning message:
In `[.data.table`(dt2, , `:=`(d, 1:3)) :
Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
From which we can see the developers of data.table
was very much aware of this behaviour or at least that it was a possible danger. So from this we can see that
- no data.tables are in fact not passed by reference. The self-referencing pointer residing within the attributes is passed by reference, and is then used to overwrite the columns in the original table
- This is likely intended behaviour and something the users of data.table should be aware of.
- If one wishes to use
[.data.table
with :=
or set
within a function one should create a dt <- copy(dt)
or explicitly state within the function documentation that it is overwritten by reference.
For more information about how pointers are handled by R, I believe the manual Writing R extensions and the Rcpp
vignettes both describe their behaviour, although pointers in general are considered an advanced topic and can lead to unexpected behaviour.