3

R's data.table package exhibits the operations by reference behavior when storing a vector of column names:

> library(data.table)
> dt <- data.table(x=1,y=2)
> vars <- names(dt)
> vars
[1] "x" "y"
> dt[, z:=3]
> vars
[1] "x" "y" "z"

I did not expect the object vars to "update" like this (to contain column names in dt that were created later). If I use vars <- copy(names(dt)) it doesn't update, as if storing column names operates by reference similar to making a copy of a whole data.table.

Other functions like nrow() do not "update" like this.

My question is: when do I need to use copy and when do I not? I originally thought it was only for copying whole data.tables, but this makes me wonder where else it will be needed.

dmp
  • 815
  • 1
  • 6
  • 19
  • But that update _by reference_ is exactly what a reference is! If you want a _distinct copy_ you use `copy()` to create an independent instance. – Dirk Eddelbuettel Mar 31 '19 at 22:09
  • Right, but it's unclear to me under which circumstances I should assume `data.table` is using references and under which circumstances I should expect it to make distinct copies of output from functions. – dmp Mar 31 '19 at 22:35
  • I think `data.table` is pretty clear about this: _always a reference_ (for performance reasons) _unless_ you explicitly opt to make `copy()`. – Dirk Eddelbuettel Mar 31 '19 at 22:36
  • I guess I'm hung up on the fact that _always_ doesn't appear to be true? Other base functions like `dim`, `nrow`, etc. do not operate by reference. `n = nrow(dt)`, (n is 1), `dt = rbind(dt,data.table(x=3,y=4))`, (n is still 1) – dmp Mar 31 '19 at 22:41
  • Maybe checking which are base R and which are from `data.table` (including possible replacement functions)? – Dirk Eddelbuettel Mar 31 '19 at 22:54
  • Unless I'm mistaken, the `names` function is from base, not data.table – dmp Mar 31 '19 at 23:10
  • 1
    Related info - https://stackoverflow.com/questions/15913417/why-does-data-table-update-namesdt-by-reference-even-if-i-assign-to-another-v - seems to suggest that `names` is unique in this regard. Also related to: https://stackoverflow.com/questions/18662715/colnames-being-dropped-in-data-table-in-r – thelatemail Mar 31 '19 at 23:32
  • @dmp: Well you can always read up on S3, dispatching and replacement functions. Here we have `data.table:::names<-.data.table`. It's tricky :) – Dirk Eddelbuettel Mar 31 '19 at 23:46
  • @DirkEddelbuettel - `data.table:::\`names<-.data.table\`` if trying to peek at the code in an R session. – thelatemail Mar 31 '19 at 23:48
  • Yes sorry typed while on the move... – Dirk Eddelbuettel Apr 01 '19 at 00:06
  • @dmp, I believe [the answer](https://stackoverflow.com/a/15913648/3358272) in @thelatemail's first link is spot-on to what you're asking about (specifically: *"`names1` is pointing to the same location as dt's column names pointer."*). This was completely new to me, so thank you for asking the question (I never knew!). I'm not going to mark it as duplicate (since it's possible/likely I'm missing something in the underlying need of your question), but please come back with whether this resolves your initial concern/question. – r2evans Apr 01 '19 at 03:59
  • Seems also relevant: [Understanding exactly when a data.table is a reference to (vs a copy of) another data.table](https://stackoverflow.com/questions/10225098/understanding-exactly-when-a-data-table-is-a-reference-to-vs-a-copy-of-another) – markus Apr 01 '19 at 09:10

0 Answers0