17

In writing a package which relies on data.table, I've discovered some odd behavior. I have a function which removes and reorders some column by-reference, and it works just fine, meaning the data.table I passed in was modified without assigning the function output. I have another function which adds new columns however, but those changes do not always persist in the data.table which was passed in.

Here's a small example:

library(data.table)  # I'm using 1.9.4
test <- data.table(id = letters[1:2], val=1:2)
foobar <- function(dt, col) {
    dt[, (col) := 1]
    invisible(dt)
}

test
#  id val
#1: a   1
#2: b   2
saveRDS(test, "test.rds")
test2 <- readRDS("test.rds")
all.equal(test, test2)
#[1] TRUE
foobar(test, "new")
test
#  id val new
#1: a   1   1
#2: b   2   1
foobar(test2, "new")
test2
#  id val
#1: a   1
#2: b   2

What happened? What's different about test2? I can modify existing columns in-place on either:

foobar(test, "val")
test
#  id val new
#1: a   1   1
#2: b   1   1
foobar(test2, "val")
test2
#  id val
#1: a   1
#2: b   1

But adding to test2 still doesn't work:

foobar(test2, "someothercol")
.Last.value
#  id val someothercol
#1: a   1            1
#2: b   1            1
test2
#  id val
#1: a   1
#2: b   1

I can't pin down all the cases where I see this behavior, but saving to and reading from RDS is the first case I can reliably replicate. Writing to and reading from a CSV doesn't seem to have the same problem.

Is this a pointer issue ala this issue, like serializing a data.table destroys the over-allocated pointers? Is there a simple way to restore them? How could I check for them inside my function, so I could restore the pointers or error if the operation isn't going to work?

I know I can assign the function output as a workaround, but that's not very data.table-y. Wouldn't that also create a temporary copy in memory?

Response to Arun's solution

Arun has instructed that it is indeed a pointer issue, which can be diagnosed with truelength and fixed with setDT or alloc.col. I ran into a problem encapsulating his solution in a function (continuing from above code):

func <- function(dt) {if (!truelength(dt)) setDT(dt)}
func2 <- function(dt) {if (!truelength(dt)) alloc.col(dt)}
test2 <- readRDS("test.rds")
truelength(test2)
#[1] 0
truelength(func(test2))
#[1] 100
truelength(test2)
#[1] 0
truelength(func2(test2))
#[1] 100
truelength(test2)
#[1] 0

So it looks like the local copy inside the function is being properly modified, but the reference version is not. Why not?

Community
  • 1
  • 1
ClaytonJY
  • 1,244
  • 10
  • 21
  • Fabulous question/explanations/reproducible example/motivation. Kudos. I can reproduce this even on the development version. Btw, you don't need the `invisible(dt)` part in your function IMO. – David Arenburg Jan 22 '15 at 08:45
  • @DavidArenburg Thanks! I know I could have left it out, but for some reason I wanted to suppress printing. Thinking about it again, it may have been more clear to show the difference between function output and the original data.table. Oh well. – ClaytonJY Jan 22 '15 at 16:19

1 Answers1

15

Is this a pointer issue ala this issue, like serializing a data.table destroys the over-allocated pointers?

Yes loading from disk sets the external pointer to NULL. We will have to over-allocate again.

Is there a simple way to restore them?

Yes. You can test for truelength() of the data.table, and if it's 0, then use setDT() or alloc.col() on it.

truelength(test2) # [1] 0
if (!truelength(test2))
    setDT(test2)
truelength(test2) # [1] 100

foobar(test2, "new")
test2[]
#    id val new
# 1:  a   1   1
# 2:  b   2   1

This should probably go in as a FAQ (can't remember seeing it there).
Already in FAQ in Warning Messages section.

jangorecki
  • 16,384
  • 4
  • 79
  • 160
Arun
  • 116,683
  • 26
  • 284
  • 387
  • Is there a reason to choose `setDT` over `alloc.col`? – ClaytonJY Jan 22 '15 at 16:33
  • 2
    No reason. I just chose `setDT()` because `alloc.col` doesn't return the result invisibly, so I'll have to wrap it with `invisible()`. `setDT()` seemed therefore shorter. – Arun Jan 22 '15 at 16:58
  • Oh okay that makes sense. Since it'll be in a function, invisibility at this step doesn't matter, so I think `alloc.col` communicates intention better. Thanks for the quick and accurate reply @Arun! – ClaytonJY Jan 22 '15 at 17:15
  • actually, this doesn't seem to work inside of a function, with either `setDT` or `alloc.col`. I can see the block trigger and bump up `truelength` using `browser`, but the table I pass in still isn't modified, and it's `truelength` is still 0 after the call, e.g. `func <- function(dt) {if (!truelength(dt)) setDT(dt)}`. Any ideas? – ClaytonJY Jan 22 '15 at 17:45
  • it does! It brings me to a conclusion I was hoping to avoid (gotta re-assign somewhere), but at least the copy is shallow. – ClaytonJY Jan 22 '15 at 18:44
  • Great. We'll try and see if this case can be improved. Could you file this as an issue please (not really an issue, but could be improved if possible)? – Arun Jan 22 '15 at 18:47