2

I appear to have produced some code where data.table is actually doing a copy on assignment even when using :=. The below is a toy example that illustrates the point. copy.A() takes arguments DT (a data.table passed by reference) and an integer n. It prints the address of DT, copies A to a new column, and then prints the new address of DT. Nothing is returned; copy.A() is intended to operate via side-effect.

DT <- data.table(A=rnorm(10000))

copy.A <- function(DT, n) {
    address.start <- address(DT) 
    DT[, sprintf('A.copy.%i', n):=A]
    address.final <- address(DT)

    cat(sprintf('%.3i) %s --> %s\n', n, address.start, address.final))
}

for(n in 1L:120L) 
    copy.A(DT, n)

Output:

001) 0x2d979d0 --> 0x2d979d0
002) 0x2d979d0 --> 0x2d979d0
...
098) 0x2d979d0 --> 0x2d979d0
099) 0x2d979d0 --> 0x2d979d0
100) 0x2d979d0 --> 0x6564820 # Copying starts to occur
101) 0x2d979d0 --> 0x2bcfa30
102) 0x2d979d0 --> 0x456cad0
103) 0x2d979d0 --> 0x4282570
...

At some point, the address starts to change when the assignment occurs. Based on this example, I would say whenever we modify a data.table that was passed as an argument via reference, we must explicitly return that data.table or else there is no guarantee that the change will persist. I should add that this behavior is not that surprising. I just hadn't realized it was occurring until now.

This question is really just a request for more information. I haven't found anything in the documentation that really discusses it. Can someone shed some more light on this under-the-hood copying behavior or perhaps point to some documentation that explains this? Does data.table always pre-allocate the same amount of memory, and what are the rules for memory allocation as the size of the data.table increases?

andrew
  • 2,524
  • 2
  • 24
  • 36
  • 2
    This is by design. A good place to read up on it is `?alloc.col` – Frank Oct 05 '15 at 19:07
  • Thanks Frank - Off the top of your head, are there any other functions that you would recommend reading about to better understand the data.table internals? Seems like the data.table internals aren't explicitly documented yet based on this open issue: https://github.com/Rdatatable/data.table/issues/944 – andrew Oct 05 '15 at 19:52
  • 3
    Generally, only the user interface needs to be documented, not internals. For this package the true internals are all written in C and so will not get R docs, I guess. The github issue you're pointing toward is about tutorials that would be nice to have, but there really isn't much missing by way of docs at the moment. There are a lot of good references among the top-voted questions and answers on SO, though, like http://stackoverflow.com/a/10226454/1191259 (Matt Dowle and Arun are coauthors of the package) – Frank Oct 05 '15 at 19:58

0 Answers0