1

Strange behaviour from data.table package, try below code, why does the ordering changes in x?

#R version 3.1.0 (2014-04-10)
#data.table_1.9.2 same error for (data.table_1.9.4)

require(data.table)

#dummy data
dat <- fread("A,B
6,7
4,5
1,2
3,4
0,2")

#get x and y
x <- dat$A
y <- dat[,A]

#compare - x and y, same.
x # [1] 6 4 1 3 0
y # [1] 6 4 1 3 0
all(x==y) # [1] TRUE

#Set key on column A
setkey(dat,A)

#compare - x is not same as y anymore!
x # [1] 0 1 3 4 6
y # [1] 6 4 1 3 0
all(x==y) # [1] FALSE
zx8754
  • 52,746
  • 12
  • 114
  • 209
  • 1
    This is due to R's copy-on-modify semantics + data.table's update by reference. Use `copy()` explicitly. Check `?copy` from data.table 1.9.4. – Arun Oct 09 '14 at 10:57
  • OK `x <- copy(dat$A)` works fine. Why doesn't `y` need explicit `copy()`? – zx8754 Oct 09 '14 at 11:07
  • Due to R's copy-on-modify semantics. Use `.Internal(inspect(.))` and note where the addresses are identical.. – Arun Oct 09 '14 at 11:23

1 Answers1

1

To expand my comments:

After doing:

require(data.table)
dat <- fread("A,B
6,7
4,5
1,2
3,4
0,2")

# get x and y
x <- dat$A
y <- dat[,A]

If you do:

.Internal(inspect(x))
# @7fa677439e40 13 INTSXP g0c3 [NAM(2)] (len=5, tl=5) 6,4,1,3,0
.Internal(inspect(dat$A))
# @7fa677439e40 13 INTSXP g0c3 [NAM(2)] (len=5, tl=5) 6,4,1,3,0

The address @7fa677439e40 as you can see is identical (the value itself will be different on your device). This is because R doesn't really copy the data when we use the $ operator to extract an entire column and assign it to a variable. It copies only when it's absolutely essential.

Doing the same for the second case:

.Internal(inspect(y))
# @7fa677455248 13 INTSXP g0c3 [NAM(2)] (len=5, tl=5) 6,4,1,3,0
.Internal(inspect(dat)) # pasting the first 3 lines of output here
# @7fa674a0be00 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @7fa677439e40 13 INTSXP g0c3 [NAM(2)] (len=5, tl=5) 6,4,1,3,0 <~~~~~~~ 
#   @7fa677439e88 13 INTSXP g0c3 [NAM(2)] (len=5, tl=5) 7,5,2,4,2

The address of y and dat[, A] (see arrow mark) are not identical. This is because the data.table subset created a copy already. In R, both dat$A and dat[["A"]] will not make a copy under these circumstances (also good to know when you don't want to make unnecessary copies!).

Please write back if you have more questions.

HTH

More info on copy-on-modify.

Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387