understanding the reference properties of data.table in R

Question

Just to clear some stuff up for myself, I would like to better understand when copies are made and when they are not in data.table. As this question points out Understanding exactly when a data.table is a reference to (vs a copy of) another data.table, if one simply runs the following then you end up modifying the original:

library(data.table)

DT <- data.table(a=c(1,2), b=c(11,12))
print(DT)
#      a  b
# [1,] 1 11
# [2,] 2 12

newDT <- DT        # reference, not copy
newDT[1, a := 100] # modify new DT

print(DT)          # DT is modified too.
#        a  b
# [1,] 100 11
# [2,]   2 12

However, if one does this (for example), then you end up modifying the new version:

DT = data.table(a=1:10)
DT
     a
 1:  1
 2:  2
 3:  3
 4:  4
 5:  5
 6:  6
 7:  7
 8:  8
 9:  9
10: 10

newDT = DT[a<11]
newDT
     a
 1:  1
 2:  2
 3:  3
 4:  4
 5:  5
 6:  6
 7:  7
 8:  8
 9:  9
10: 10

newDT[1:5,a:=0L]

newDT
     a
 1:  0
 2:  0
 3:  0
 4:  0
 5:  0
 6:  6
 7:  7
 8:  8
 9:  9
10: 10

DT
     a
 1:  1
 2:  2
 3:  3
 4:  4
 5:  5
 6:  6
 7:  7
 8:  8
 9:  9
10: 10

As I understand it, the reason this happens is because when you execute a i statement, data.table returns a whole new table as opposed to a reference to the memory occupied by the select elements of the old data.table. Is this correct and true?

EDIT: sorry i meant i not j (changed this above)

Even `newDT <- DT[x < 11]` would have created a copy. Do, `newDT[, b := 5]` after creating `newDT` by subsetting. Use `tracemem` and `.Internal(inspect(.))` are informative tools for understanding this. — Arun, Apr 08 '13 at 22:39
@Arun: i'm sorry i'm not sure i understand your point.. could you please explain what you are referring to? do you mean to say that the first example would work the same as the second? in that case yes - that's true. i just wanted a separate example to make things clear. — Alex, Apr 08 '13 at 22:42
sure, can you explain which `j` statement you're referring to here: `As I understand it, the reason this happens is because when you execute a j statement`, just to be sure. I'll write an answer with what I talked about then. — Arun, Apr 08 '13 at 22:46
First line of mnel's answer is basically what I want to clear with my previous question, forget it now. — Arun, Apr 08 '13 at 22:57

mnel · Accepted Answer · 2013-04-08T23:22:05.797

When you create newDT in the second example, you are evaluating i(not j). := assigns by reference within the j argument. There are no equivalents in the i statement, as the self reference over allocates the columns, but not the rows.

A data.table is a list. It has length == the number of columns, but is over allocated so you can add more columns without copying the entire table (eg using := in j)

If we inspect the data.table, then we can see the truelength (tl = 100) -- that is the numbe of column pointer slots

 .Internal(inspect(DT))
@1427d6c8 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=1, tl=100)
  @b249a30 13 INTSXP g0c4 [NAM(2)] (len=10, tl=0) 1,2,3,4,5,...

Within the data.table each element has length 10, and tl=0. Currently there is no method to increase the truelength of the columns to allow appending extra rows by reference.

From ?truelength

Currently, it's just the list vector of column pointers that is over-allocated (i.e. truelength(DT)), not the column vectors themselves, which would in future allow fast row insert()

When you evaluate i, data.table doesn't check whether you have simply returned all rows in the same order as in the original (and then not copy only in that case), it simply returns the copy.

Excellent answer, as usual! I'll save the embarassment and discard my answer :). — Arun, Apr 08 '13 at 22:59
@mnel: i think i wasn't entirely clear with my question or i don't completely understand your answer. i meant to understand whether, when you evaluate `i`, a copy is returned and not a reference.. is that true? — Alex, Apr 08 '13 at 23:17
@Alex precisely -- I've reworded my final statement (which on second reading was not particularly clear) — mnel, Apr 08 '13 at 23:22

understanding the reference properties of data.table in R

1 Answers1