I appear to have produced some code where data.table is actually doing a copy on assignment even when using :=
. The below is a toy example that illustrates the point. copy.A()
takes arguments DT (a data.table passed by reference) and an integer n. It prints the address of DT, copies A to a new column, and then prints the new address of DT. Nothing is returned; copy.A()
is intended to operate via side-effect.
DT <- data.table(A=rnorm(10000))
copy.A <- function(DT, n) {
address.start <- address(DT)
DT[, sprintf('A.copy.%i', n):=A]
address.final <- address(DT)
cat(sprintf('%.3i) %s --> %s\n', n, address.start, address.final))
}
for(n in 1L:120L)
copy.A(DT, n)
Output:
001) 0x2d979d0 --> 0x2d979d0
002) 0x2d979d0 --> 0x2d979d0
...
098) 0x2d979d0 --> 0x2d979d0
099) 0x2d979d0 --> 0x2d979d0
100) 0x2d979d0 --> 0x6564820 # Copying starts to occur
101) 0x2d979d0 --> 0x2bcfa30
102) 0x2d979d0 --> 0x456cad0
103) 0x2d979d0 --> 0x4282570
...
At some point, the address starts to change when the assignment occurs. Based on this example, I would say whenever we modify a data.table that was passed as an argument via reference, we must explicitly return that data.table or else there is no guarantee that the change will persist. I should add that this behavior is not that surprising. I just hadn't realized it was occurring until now.
This question is really just a request for more information. I haven't found anything in the documentation that really discusses it. Can someone shed some more light on this under-the-hood copying behavior or perhaps point to some documentation that explains this? Does data.table always pre-allocate the same amount of memory, and what are the rules for memory allocation as the size of the data.table increases?