1

I seem to have found a situation where the update by reference of data.table does not work as expected and described in Understanding exactly when a data.table is a reference to (vs a copy of) another data.table.

If you load a data.table from a Rdata-file and do an update by reference via := the data.table is copied implicitly (i.e. it's memory-address is changed). That works as long as you do the update in the same environment.

But, if you do the update inside a function (e.g. f(dt)), the data.table dt is not changed outside the function in the calling environment, because it was copied inside the function.

Here is a little example

# function definition
f1 <- function(dt,dtj){
    dt[,c("C","D"):=(dt[dt.j,nomatch=0][,list(C,D)])]
}


# create, save, delete and then load from file
dt <- data.table(A=c("A","A","B"),B=1:3,key=c("A"))
dt.j <- data.table(A=c("A","B","C"),C=5:7,D=c("a","a","b"))
save(dt,file="~/test.Rdata")
rm(dt)
load(file="~/test.Rdata")
address(dt)
f1(dt,dt.j)
address(dt)
dt

The address of dt stays the same as does the data.table.

There is nothing wrong with the code of the function. If I omit the function and just do the update, it works, but it changes the address of the data.table

address(dt)
dt[,c("C","D"):=(dt[dt.j,nomatch=0][,list(C,D)])]
address(dt)
dt

I can cope with this by copying the data.table after loading.

What I'd like to know is if there are other situations apart from the above mentioned, where data.table shows this behavior.

Here's the information about R (i could also replicate this behavior on a windows machine)

R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Generic 22 (Generic)

locale:
 [1] LC_CTYPE=de_DE.UTF-8      LC_NUMERIC=C              LC_TIME=de_DE.utf8       
 [4] LC_COLLATE=de_DE.UTF-8    LC_MONETARY=de_DE.utf8    LC_MESSAGES=de_DE.UTF-8  
 [7] LC_PAPER=de_DE.utf8       LC_NAME=C                 LC_ADDRESS=C             
[10] LC_TELEPHONE=C            LC_MEASUREMENT=de_DE.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6

loaded via a namespace (and not attached):
[1] tools_3.2.2  chron_2.3-47 
Community
  • 1
  • 1
Thomas B.
  • 11
  • 1
  • I don't get what you're saying. When I run your 'little example' code, the `address` calls return the same value... (R 3.2.2, data.table 1.9.7, win 10) – Frank Dec 05 '15 at 22:34
  • 1
    @frank The second part of example is problematic, so not the copy-paste part. @Thomas after `load` function you can use `alloc.col(dt)`. – jangorecki Dec 05 '15 at 22:59
  • 1
    @frank you're right, the address hasn't changed, but neither has the data.table itself. But it should have done so because of an update by reference inside the function. – Thomas B. Dec 06 '15 at 07:31
  • 1
    But if I skip the function (second example) and do the update by reference directly, the address changes and the data.table gets updated (i.e. the result is correct, I can access the result via `dt`). The change of address happens in the function, too. But because it happens inside the scope of the function, it's not visible outside the function in the calling environment. – Thomas B. Dec 06 '15 at 07:40
  • @jangorecki thanks for your hint, that my problem was answered and explained already by another post (I should have searched more thoroughly beforehand). It helped me a lot understanding the reason for this behaviour of data.table. – Thomas B. Dec 07 '15 at 20:40

0 Answers0