In what situations do I need to use `copy` from R's data.table?

Question

R's data.table package exhibits the operations by reference behavior when storing a vector of column names:

> library(data.table)
> dt <- data.table(x=1,y=2)
> vars <- names(dt)
> vars
[1] "x" "y"
> dt[, z:=3]
> vars
[1] "x" "y" "z"

I did not expect the object vars to "update" like this (to contain column names in dt that were created later). If I use vars <- copy(names(dt)) it doesn't update, as if storing column names operates by reference similar to making a copy of a whole data.table.

Other functions like nrow() do not "update" like this.

My question is: when do I need to use copy and when do I not? I originally thought it was only for copying whole data.tables, but this makes me wonder where else it will be needed.

But that update _by reference_ is exactly what a reference is! If you want a _distinct copy_ you use `copy()` to create an independent instance. — Dirk Eddelbuettel, Mar 31 '19 at 22:09
Right, but it's unclear to me under which circumstances I should assume `data.table` is using references and under which circumstances I should expect it to make distinct copies of output from functions. — dmp, Mar 31 '19 at 22:35
I think `data.table` is pretty clear about this: _always a reference_ (for performance reasons) _unless_ you explicitly opt to make `copy()`. — Dirk Eddelbuettel, Mar 31 '19 at 22:36
I guess I'm hung up on the fact that _always_ doesn't appear to be true? Other base functions like `dim`, `nrow`, etc. do not operate by reference. `n = nrow(dt)`, (n is 1), `dt = rbind(dt,data.table(x=3,y=4))`, (n is still 1) — dmp, Mar 31 '19 at 22:41
Maybe checking which are base R and which are from `data.table` (including possible replacement functions)? — Dirk Eddelbuettel, Mar 31 '19 at 22:54
Unless I'm mistaken, the `names` function is from base, not data.table — dmp, Mar 31 '19 at 23:10
Related info - https://stackoverflow.com/questions/15913417/why-does-data-table-update-namesdt-by-reference-even-if-i-assign-to-another-v - seems to suggest that `names` is unique in this regard. Also related to: https://stackoverflow.com/questions/18662715/colnames-being-dropped-in-data-table-in-r — thelatemail, Mar 31 '19 at 23:32
@dmp: Well you can always read up on S3, dispatching and replacement functions. Here we have `data.table:::names<-.data.table`. It's tricky :) — Dirk Eddelbuettel, Mar 31 '19 at 23:46
@DirkEddelbuettel - `data.table:::\`names<-.data.table\`` if trying to peek at the code in an R session. — thelatemail, Mar 31 '19 at 23:48
@dmp, I believe [the answer](https://stackoverflow.com/a/15913648/3358272) in @thelatemail's first link is spot-on to what you're asking about (specifically: *"`names1` is pointing to the same location as dt's column names pointer."*). This was completely new to me, so thank you for asking the question (I never knew!). I'm not going to mark it as duplicate (since it's possible/likely I'm missing something in the underlying need of your question), but please come back with whether this resolves your initial concern/question. — r2evans, Apr 01 '19 at 03:59
Seems also relevant: [Understanding exactly when a data.table is a reference to (vs a copy of) another data.table](https://stackoverflow.com/questions/10225098/understanding-exactly-when-a-data-table-is-a-reference-to-vs-a-copy-of-another) — markus, Apr 01 '19 at 09:10

In what situations do I need to use `copy` from R's data.table?

0 Answers0