3

I'd like to create a data.table from large pre-existing vectors without copying those vectors. That is, I'd like to create a data.table that does only a shallow copy of the pointers to the underlying vectors, rather than a full copy of the data within the vectors.

I'd think this would be a common desire, but I haven't found any way to do it. Once the vectors are columns within another data.table I can inexpensively make further copies, but I haven't seen instructions for how to create that initial table by reference.

Is this possible? Here's how I'm trying for a single vector, although my actual goal is to create a data.table using several large vectors:

nate@ubuntu:~/R/byreference$ cat dt.R

library(data.table)

# Some large vector that needs to be created anyway
largeVector = rnorm(1000*1000)

# I'd like to see no large memory allocations
Rprofmem("allocations.txt")

# I want to create dt without copying largeVector
dt = as.data.table(list(x = largeVector))

# This variation doesn't work either:
# dt = data.table(list(x = largeVector))

# This one comes closest to working, but acts as copy-on-write
# dt = setDT(list(x = largeVector))

# Currently, I see lots and lots of memory allocations
# (some may be https://github.com/Rdatatable/data.table/issues/1062)
Rprofmem(NULL)

# The addresses of the vectors should be identical if no copy occurred
identical(address(largeVector), address(dt$x))  # FIXME: should be TRUE

# For comparison, the addresses are identical if I copy 'dt'
dtCopy = dt
identical(address(dtCopy$x), address(dt$x))

# I'm not looking for copy-on-write semantics.  I'd like a simple
# reference, the same as would occur with a shallow copy of a data.table
dt[, x := 2.0*x]

# But this works! (see second edit at bottom)
# dt[1:.N, x := 2.0*x]    

# All of these should be true (currently only the last two are)
identical(dt$x, largeVector)                   # FIXME: should be TRUE
identical(address(largeVector), address(dt$x)) # FIXME: should be TRUE
identical(dt$x, dtCopy$x)
identical(address(dtCopy$x), address(dt$x))

And here's what I see with R 3.1.2 and data.table 1.9.4:

nate@ubuntu:~/R/byreference$ Rscript dt.R
[1] FALSE
[1] TRUE
[1] FALSE
[1] FALSE
[1] TRUE
[1] TRUE

nate@ubuntu:~/R/byreference$ cat allocations.txt
1480 :"as.data.table"
6320 :"as.data.table"
6320 :"as.data.table"
1064 :"as.data.table"
344 :"as.data.table"
928 :"as.data.table"
1808 :"as.data.table"
600 :"as.data.table"
192 :"as.data.table"
408 :"as.data.table.list" "as.data.table"
1256 :"as.data.table.list" "as.data.table"
1248 :"as.data.table.list" "as.data.table"
1064 :"as.data.table.list" "as.data.table"
240 :"as.data.table.list" "as.data.table"
432 :"as.data.table.list" "as.data.table"
184 :"as.data.table.list" "as.data.table"
8000040 :"copy" "as.data.table.list" "as.data.table"
216 :"copy" "as.data.table.list" "as.data.table"
440 :"copy" "as.data.table.list" "as.data.table"
440 :"copy" "as.data.table.list" "as.data.table"
1064 :"copy" "as.data.table.list" "as.data.table"
536 :"as.data.table.list" "as.data.table"
1816 :"as.data.table.list" "as.data.table"
1808 :"as.data.table.list" "as.data.table"
1064 :"as.data.table.list" "as.data.table"
384 :"as.data.table.list" "as.data.table"
720 :"as.data.table.list" "as.data.table"
256 :"as.data.table.list" "as.data.table"
1024 :"as.data.table.list" "as.data.table"
4016 :"as.data.table.list" "as.data.table"
4016 :"as.data.table.list" "as.data.table"
1064 :"as.data.table.list" "as.data.table"
208 :"as.data.table.list" "as.data.table"
656 :"as.data.table.list" "as.data.table"
1264 :"as.data.table.list" "as.data.table"
416 :"as.data.table.list" "as.data.table"
184 :"eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
336 :"eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
336 :"eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
1064 :"eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
304 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
872 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
872 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
1064 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
208 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
368 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
840 :"alloc.col" "as.data.table.list" "as.data.table"
840 :"alloc.col" "as.data.table.list" "as.data.table"

Wow, constructing 'dt' has many more allocations than I would have expected! While most are small, I'd really like to be able to avoid the large ones, as my vectors might each be several GB.

Edit: Eddi initially marked this as a duplicate of Sub-assign by reference on vector in R. It's not. My goal is not to modify a vector in place; my goal is to create a data.table from a vector without copying that vector. I only used the modification because most readers will not have compiled R in a way that allows user of Rprofmem, and checking for the side-effect is a guarantee that no copy happened. I've changed the example to try to make this clearer.

Edit: That said, I think Eddi is right this my problems are really due to the bug that he just filed (UPDATE: now fixed): https://github.com/Rdatatable/data.table/issues/1248. The combination of "dt = setDT(list(x = largeVector))" and then "dt[1:.N, x := 2.0*x]" works as I would expect: modifications in place, and no large allocations. So while I don't think this is actually a duplicate, it's probably fine to let this question die.

Community
  • 1
  • 1
Nathan Kurz
  • 1,649
  • 1
  • 14
  • 28
  • btw also relevant to this question is [this bug](https://github.com/Rdatatable/data.table/issues/1248) – eddi Jul 29 '15 at 23:42
  • Eddi has convinced me. This is simply a bug in data.table, and not a particularly useful question. Switching to the combination of "dt = setDT(list(x = largeVector))" and then "dt[1:.N, x := 2.0*x]" works as I would expect: modifications in place, and no large allocations. – Nathan Kurz Jul 30 '15 at 00:46
  • 1
    @NathanKurz while your and the other question's goal may not be the same, the answer is still the same - wrap it in a `list`, then do `setDT`, thus why I marked it as a duplicate – eddi Jul 30 '15 at 15:13
  • @Arun I understand *why* it happens and the motivation behind it, but I still believe it's a bug, as it results in undesired/unexpected behavior. – eddi Jul 30 '15 at 15:17
  • The unexpected behavior is that it's not modifying by reference. – eddi Jul 30 '15 at 15:22
  • I'd rather the behavior was consistent with definitions at the cost of a `memcpy`. My brain doesn't compute `1:.N` in `i` being different from nothing in there or how the address changes or doesn't change depending on rhs of `:=` (without changing type of column) - as a user it just makes no sense. – eddi Jul 30 '15 at 15:37
  • Most users don't need to know, that's true, but as this question clearly shows some do. And as another user myself - if you want it to mean smth else, that's fine, but then the documentation needs to be changed to reflect that. – eddi Jul 30 '15 at 15:46
  • @Arun ok, I'll take a look if I get time – eddi Jul 30 '15 at 15:47

1 Answers1

2

The open data.table issue #1248 [UPDATE: now resolved] notwithstanding, the way to convert a set of vectors to a data.table without copying data is:

a = 1:5
b = 5:1
address(a)
#[1] "000000000FFE6AE0"
address(b)
#[1] "000000000FFE6A50"

dt = setDT(list(a, b))
sapply(dt, address)
#                V1                 V2 
#"000000000FFE6AE0" "000000000FFE6A50"
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
eddi
  • 49,088
  • 6
  • 104
  • 155
  • +1 @nathan-kurz I'll answer on the linked issue but please accept this answer here. It seems to me to answer this question, if you agree. – Matt Dowle Aug 05 '15 at 09:04