I'd like to create a data.table from large pre-existing vectors without copying those vectors. That is, I'd like to create a data.table that does only a shallow copy of the pointers to the underlying vectors, rather than a full copy of the data within the vectors.
I'd think this would be a common desire, but I haven't found any way to do it. Once the vectors are columns within another data.table I can inexpensively make further copies, but I haven't seen instructions for how to create that initial table by reference.
Is this possible? Here's how I'm trying for a single vector, although my actual goal is to create a data.table using several large vectors:
nate@ubuntu:~/R/byreference$ cat dt.R
library(data.table)
# Some large vector that needs to be created anyway
largeVector = rnorm(1000*1000)
# I'd like to see no large memory allocations
Rprofmem("allocations.txt")
# I want to create dt without copying largeVector
dt = as.data.table(list(x = largeVector))
# This variation doesn't work either:
# dt = data.table(list(x = largeVector))
# This one comes closest to working, but acts as copy-on-write
# dt = setDT(list(x = largeVector))
# Currently, I see lots and lots of memory allocations
# (some may be https://github.com/Rdatatable/data.table/issues/1062)
Rprofmem(NULL)
# The addresses of the vectors should be identical if no copy occurred
identical(address(largeVector), address(dt$x)) # FIXME: should be TRUE
# For comparison, the addresses are identical if I copy 'dt'
dtCopy = dt
identical(address(dtCopy$x), address(dt$x))
# I'm not looking for copy-on-write semantics. I'd like a simple
# reference, the same as would occur with a shallow copy of a data.table
dt[, x := 2.0*x]
# But this works! (see second edit at bottom)
# dt[1:.N, x := 2.0*x]
# All of these should be true (currently only the last two are)
identical(dt$x, largeVector) # FIXME: should be TRUE
identical(address(largeVector), address(dt$x)) # FIXME: should be TRUE
identical(dt$x, dtCopy$x)
identical(address(dtCopy$x), address(dt$x))
And here's what I see with R 3.1.2 and data.table 1.9.4:
nate@ubuntu:~/R/byreference$ Rscript dt.R
[1] FALSE
[1] TRUE
[1] FALSE
[1] FALSE
[1] TRUE
[1] TRUE
nate@ubuntu:~/R/byreference$ cat allocations.txt
1480 :"as.data.table"
6320 :"as.data.table"
6320 :"as.data.table"
1064 :"as.data.table"
344 :"as.data.table"
928 :"as.data.table"
1808 :"as.data.table"
600 :"as.data.table"
192 :"as.data.table"
408 :"as.data.table.list" "as.data.table"
1256 :"as.data.table.list" "as.data.table"
1248 :"as.data.table.list" "as.data.table"
1064 :"as.data.table.list" "as.data.table"
240 :"as.data.table.list" "as.data.table"
432 :"as.data.table.list" "as.data.table"
184 :"as.data.table.list" "as.data.table"
8000040 :"copy" "as.data.table.list" "as.data.table"
216 :"copy" "as.data.table.list" "as.data.table"
440 :"copy" "as.data.table.list" "as.data.table"
440 :"copy" "as.data.table.list" "as.data.table"
1064 :"copy" "as.data.table.list" "as.data.table"
536 :"as.data.table.list" "as.data.table"
1816 :"as.data.table.list" "as.data.table"
1808 :"as.data.table.list" "as.data.table"
1064 :"as.data.table.list" "as.data.table"
384 :"as.data.table.list" "as.data.table"
720 :"as.data.table.list" "as.data.table"
256 :"as.data.table.list" "as.data.table"
1024 :"as.data.table.list" "as.data.table"
4016 :"as.data.table.list" "as.data.table"
4016 :"as.data.table.list" "as.data.table"
1064 :"as.data.table.list" "as.data.table"
208 :"as.data.table.list" "as.data.table"
656 :"as.data.table.list" "as.data.table"
1264 :"as.data.table.list" "as.data.table"
416 :"as.data.table.list" "as.data.table"
184 :"eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
336 :"eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
336 :"eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
1064 :"eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
304 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
872 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
872 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
1064 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
208 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
368 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
840 :"alloc.col" "as.data.table.list" "as.data.table"
840 :"alloc.col" "as.data.table.list" "as.data.table"
Wow, constructing 'dt' has many more allocations than I would have expected! While most are small, I'd really like to be able to avoid the large ones, as my vectors might each be several GB.
Edit: Eddi initially marked this as a duplicate of Sub-assign by reference on vector in R. It's not. My goal is not to modify a vector in place; my goal is to create a data.table from a vector without copying that vector. I only used the modification because most readers will not have compiled R in a way that allows user of Rprofmem, and checking for the side-effect is a guarantee that no copy happened. I've changed the example to try to make this clearer.
Edit: That said, I think Eddi is right this my problems are really due to the bug that he just filed (UPDATE: now fixed): https://github.com/Rdatatable/data.table/issues/1248. The combination of "dt = setDT(list(x = largeVector))" and then "dt[1:.N, x := 2.0*x]" works as I would expect: modifications in place, and no large allocations. So while I don't think this is actually a duplicate, it's probably fine to let this question die.