r data.table to copy or not to copy

Question

I create a master data table that I extract smaller tables from and then combine them into a new table. The process goes like this

Create a master table from some other data. Call it dt.master

Create a copy of it and do some edits. Example script is

dt.1 <- copy(dt.master)
dt.1 <- dt.1[ v1 %in% "cat1".]

create other versions of dt.1 that are edited a bit. Here's the code where the mistake enters)
```
dt.2 <- dt.3  <- dt.1
```

edit each of the new version as follows

dt.2[, v1 := "dt.2"]
unique(dt.2$v1)
dt.3[, v1 := "dt.3"]
unique(dt.2$v1)

I know (and eventually remember) that dt.3 <- dt.1 doesn't create a new version of dt.1. But unique(dt.2$v1) returns "dt.2" in the code above; in subsequent code it returns "dt.1". I put my solution to this bad coding in the answer, but would also be interested in knowing why unique(dt.2$v1) returns a different answer. Here is some example code that demonstrates this

dt.master <- data.table(v1 = c("cat1", "cat1", "cat2", "cat","cat2" ), v2 = c(1,2,3,4,5))
dt.1 <- copy(dt.master)
dt.1 <- dt.1[v1 %in% "cat1",]
dt.2 <- dt.3  <- dt.1
dt.2[, v1 := "xxx"]
unique(dt.2$v1)
dt.3[, v1 := "yyy"]
unique(dt.3$v1)
print(dt.2)

v1 in dt.2 is supposed to be xxx but in the print statement, it is yyy.

This looks unnecessarily convoluted -- filling columns with values that are also names of tables. Also, this looks like a case where you shouldn't have difficulty providing a concrete example. — Frank, Oct 07 '18 at 18:02
The choice of the table name is just a filler; not sure why I chose that. I'll try to provide a concrete example in a bit. — JerryN, Oct 07 '18 at 18:12
I suspect that this may be relevant: [Understanding exactly when a data.table is a reference to (vs a copy of) another data.table](https://stackoverflow.com/questions/10225098/understanding-exactly-when-a-data-table-is-a-reference-to-vs-a-copy-of-another) — Henrik, Oct 07 '18 at 18:50
I had tried to read that before I constructed my example. Much of it is beyond me, but I don't see where it explains the different results for the unique ... code and the print ... code. — JerryN, Oct 07 '18 at 20:16
... and I wrote the Q and A so that someone else who sees the same strange behavior will at least have a solution for it. Is that answer not worth an up vote? — JerryN, Oct 07 '18 at 20:17
Thanks for adding the example. To me it seems like a dupe of Henrik's link. All three table names are aliases for one table. Btw, I guess you've heard it before, but there's no reason to do `DT <- DT[, x := expr]` since the point of `:=` is to modify DT in-place. — Frank, Oct 07 '18 at 22:39
Re DT <- DT[, x := expr] is redundant. Argh. Yes. I love this element of data.table and use DT[, x := expr] everywhere. Except it seems in constructing examples. I'll edit the example. — JerryN, Oct 07 '18 at 22:42

JerryN · Answer 1 · 2018-10-07T23:07:19.530

To put the various explanations in one place and in language that makes sense to me...

dt.2 <- dt.3 <- dt.1 has three pointers to a single region of memory

dt.1[, v1 := "xxx"] sets part of that region to xxx

print(unique(dt.1$v1 displays that region and shows xxx

dt.2[, v1 := "yyy"] changes the same part of that region to yyy

print(unique(dt.2$v1 displays the same region and now shows yyy

Now that the information in that region of memory has been changed, print(unique(dt.1$v1 displays yyy instead of xxx

A solution I adopted is to replace

dt.2 <- dt.3  <- dt.1

with

dt.2 <- copy(dt.1)
dt.3 <- copy(dt.1)

r data.table to copy or not to copy

1 Answers1