15

(This is a follow up question to this.)

Check this toy code:

> x <- data.frame(a = 1:2)
> foo <- function(z) { setDT(z) ; z[, b:=3:4] ; z } 
> y <- foo(x)
> 
> class(x)
[1] "data.table" "data.frame"
> x
   a
1: 1
2: 2

It looks like setDT did change x's class, but the addition of data did not apply to x.
What happened here?

Ofek Shilon
  • 14,734
  • 5
  • 67
  • 101
  • 2
    At least some elements of the same question have been discussed here: https://github.com/Rdatatable/data.table/issues/4589 – s_baldur Jul 07 '20 at 12:40
  • `z` is a reference to `x` until `setDT`. So `setDT` is applied to `x`. If you change `z` like in `foo <- function(z) {z$b <- 3:4; setDT(z); z }` `z` is no longer a reference to `x` and `setDT` does not change `x`. See output of: `foo <- function(z) {print(address(z)); z}; address(x); y <- foo(x); address(y)` – GKi Jul 07 '20 at 12:41
  • Or try: `x <- data.frame(a = 1:2); y <- x; setDT(y); class(x)` – GKi Jul 07 '20 at 12:48
  • @GKi Would be interesting if you expanded that answer to include the relevant vocabulary and logic (why this works like this). – s_baldur Jul 07 '20 at 12:48
  • @sindri_baldur It was my post on github. I didn't get a satisfactory answer there (yet), so thought I'd try here. – Ofek Shilon Jul 07 '20 at 12:50
  • @sindri_baldur It works like this, as R is not *call by value*, it is *call by reference* as long as the value is *not* changed. – GKi Jul 07 '20 at 12:50
  • @GKi setDT So why won't setDF change the class back? Check `x <- data.frame(a = 1:2); foo <- function(z) { setDT(z) ; z[, b:=3:4] ; setDF(z) }; y<-foo(x); class(x)` – Ofek Shilon Jul 07 '20 at 12:55
  • @OfekShilon At that point, when you use `setDF`, `z` was already changed. So it is no longer a reference to `x`. – GKi Jul 07 '20 at 12:59
  • @GKi `:=` operates by reference, it doesn't make a copy of `z`. Checking with `address` it looks like `setDT` is the one making a copy of `z` (which makes the behaviour even stranger) – Ofek Shilon Jul 07 '20 at 13:05
  • @OfekShilon. Yes `setDT` makes the copy. I have add a wiki "Answer" showing that. – GKi Jul 07 '20 at 13:08
  • 4
    This seems relevant https://stackoverflow.com/questions/26069219/using-setdt-inside-a-function?noredirect=1&lq=1 – Frank Jul 09 '20 at 06:02

3 Answers3

4

In your function z is a reference to x until setDT.

library(data.table)
foo <- function(z) {print(address(z)); setDT(z); print(address(z))} 
x <- data.frame(a = 1:2)
address(x)
#[1] "0x555ec9a471e8"
foo(x)
#[1] "0x555ec9a471e8"
#[1] "0x555ec9ede300"

In setDT it comes to the following line where z is still pointing to the same address like x:

setattr(z, "class", data.table:::.resetclass(z, "data.frame"))

setattr does not make a copy. So x and z are still pointing to the same address and both are now of class data.frame:

x <- data.frame(a = 1:2)
z <- x
class(x)
#[1] "data.frame"
address(x)
#[1] "0x555ec95de600"
address(z)
#[1] "0x555ec95de600"

setattr(z, "class", data.table:::.resetclass(z, "data.frame"))

class(x)
#[1] "data.table" "data.frame"
address(x)
#[1] "0x555ec95de600"
address(z)
#[1] "0x555ec95de600"

Then setalloccol is called which calls in this case:

assign("z", .Call(data.table:::Calloccolwrapper, z, 1024, FALSE))

which now let x and z point to different addresses.

address(x)
#[1] "0x555ecaa09c00"
address(z)
#[1] "0x555ec95de600"

And both have the class data.frame

class(x)
#[1] "data.table" "data.frame"
class(z)
#[1] "data.table" "data.frame"

I think when they would have used

class(z) <- data.table:::.resetclass(z, "data.frame")

instead of

setattr(z, "class", data.table:::.resetclass(z, "data.frame"))

the problem would not occur.

x <- data.frame(a = 1:2)
z <- x
address(x)
#[1] "0x555ec9cd2228"
class(z) <- data.table:::.resetclass(z, "data.frame")
class(x)
#[1] "data.frame"
class(z)
#[1] "data.table" "data.frame"
address(x)
#[1] "0x555ec9cd2228"
address(z)
#[1] "0x555ec9cd65a8"

but after class(z) <- value z will not point to the same address where it points before:

z <- data.frame(a = 1:2)
address(z)
#[1] "0x5653dbe72b68"
address(z$a)
#[1] "0x5653db82e140"
class(z) <- c("data.table", "data.frame")
address(z)
#[1] "0x5653dbe82d98"
address(z$a)
#[1] "0x5653db82e140"

but after setDT it will also not point to the same address where it points before:

z <- data.frame(a = 1:2)
address(z)
#[1] "0x55b6f04d0db8"
setDT(z)
address(z)
#[1] "0x55b6efe1e0e0"

As @Matt-dowle pointed out, it is also possible to change the data in x over z:

x <- data.frame(a = c(1,3))
z <- x
setDT(z)
z[, b:=3:4]
z[2, a:=7]
z
#   a b
#1: 1 3
#2: 7 4
x
#   a
#1: 1
#2: 7
R.version.string
#[1] "R version 4.0.2 (2020-06-22)"
packageVersion("data.table")
#[1] ‘1.12.8’
GKi
  • 37,245
  • 2
  • 26
  • 48
  • 1
    Thanks! This seems like the correct answer - a discussion with data.table's author continues here:https://github.com/Rdatatable/data.table/issues/4589 . Will update. – Ofek Shilon Jul 12 '20 at 07:18
1

A supplement to GKi's answer:

setalloccol's location is indeed the direct culprit: it performs a shallow copy (i.e., generates a new vector of pointers to the existing data columns) and in addition allocates extra 1024 (by default) slots for additional columns. If setting the class to data.frame is performed after this shallow copy (either by class(z)<- or by setattr) it is applied to this new vector and not the original argument.

However.

Even after using a fixed version of setDT (with setattr called after setalloccol), it seems there is no way to get consistent behaviour. Some operations apply to the caller copy, and some don't.

df <- data.frame(a=1:2, b=3:4)

foo1 <- function(z) { 
  setDT.fixed(z)
  z[, b:=5]   # will apply to the caller copy
  data.table::setDF(z)
}

foo1(df)
#    a b
# 1: 1 5
# 2: 2 5
class(df)
# [1] "data.frame"
df
#   a b
# 1 1 5
# 2 2 5

foo2 <- function(z) { 
  setDT.fixed(z)
  z[, c:=5]   # will NOT apply to the caller copy
  data.table::setDF(z)
}
foo2(df)
#    a b c
# 1: 1 3 5
# 2: 2 4 5
# Warning message:
# In `[.data.table`(z, , `:=`(c, 5)) :
#  Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
class(df)
# [1] "data.table" "data.frame"
df
#    a b
# 1: 1 3
# 2: 2 4

(Using the j argument, e.g., z[!is.na(a), b:=6] gives an extra dimension of weirdness which I won't go into here).

Bottom line, the data.table package took on the brave task of punching a hole in R's all-value semantics. It was pretty successful until setDT came along (BTW, in response to a SO question here). Using setDT within a function on an argument will probably never have consistent semantics and is almost guaranteed to get you nasty surprises.

Ofek Shilon
  • 14,734
  • 5
  • 67
  • 101
0
library(data.table)

x <- data.frame(a = 1:2)
y <- x                #y is a reference to x
address(x)
#[1] "0x55e07e31a1e8"
address(y)
#[1] "0x55e07e31a1e8"
setDT(y)              #Add data.table to attr of y AND x, create a copy of it and let y point to it and make y a DT
address(x)
#[1] "0x55e07e31a1e8"
address(y)
#[1] "0x55e07e7b1300"
class(x)
#[1] "data.table" "data.frame"

x[, b:=3:4]
#Warnmeldung:
#In `[.data.table`(x, , `:=`(b, 3:4)) :
#  Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.

z <- data.frame(a = 1:2)
class(z) <- c("data.table", "data.frame")
z[, b:=3:4]
#Warnmeldung:
#In `[.data.table`(x, , `:=`(b, 3:4)) :
#  Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
GKi
  • 37,245
  • 2
  • 26
  • 48
  • 1
    Note that even this seems to contradict the docs: https://www.rdocumentation.org/packages/data.table/versions/1.12.8/topics/setDT says "n data.table parlance, all set* functions change their input by reference. That is, no copy is made at all,". I suspect (by the GH discussion) only a *shallow* copy is made, but can't verify. Anyway, does this explain the behaviour in the question? – Ofek Shilon Jul 07 '20 at 13:26
  • @OfekShilon The copy is made by *R* and not by *data.table*. But the copy is made *after* data.table makes `setattr` - so in our case both `x` and `y` get the `data.table` class. – GKi Jul 07 '20 at 13:36
  • 1
    @OfekShilon Actually I think it is a **bug** in DT, because `x` claims only to be a DT, but it is not! – GKi Jul 07 '20 at 13:43
  • you mean that its `class` contains both data.table and data.frame? This is the expected behaviour: `y<-data.table(a=1:2);class(y)`. R's classes and inheritance are nasty - the class attr is a list of 'parents', where R should look for implementations (in order). – Ofek Shilon Jul 07 '20 at 13:44
  • @OfekShilon The class attr is OK. But the internal data structure is not like a DT and so I get the warning. – GKi Jul 07 '20 at 13:51
  • 1
    Oliver discussed this at his answer to the linked question: https://stackoverflow.com/a/62742393/89706 . I don't think this is a bug by itself. – Ofek Shilon Jul 07 '20 at 14:05
  • 1
    @Gki, to shed a little light about what `setDT()` is doing, I believe `setDT()` modifies the class by reference but only overalloates columns for the object passed to `setDT()`. Which is why you get the .internal.selfref message / the class is a data.table. – Andrew Jul 07 '20 at 14:10