I have a data.frame orig
, which is subset and assigned to cpy
.
library(data.table)
orig <- data.frame(id=letters[c(2,1,2,1)], col1=c(300,46,89,2),
col2=1:4, col3=1:4)
print(orig)
# id col1 col2 col3
# b 300 1 1
# a 46 2 2
# b 89 3 3
# a 2 4 4
cpy <- orig[,c("id","col1","col2")]
cpy
is a shallow copy of orig
and references parts of orig
(all but the omitted columns).
Because cpy
is a subset of orig
, it references the shared columns only and the update by reference feature of setDT(cpy)
does not come into play. This leaves orig
and cpy
in a potentially dangerous state where they only share the pointers to a subset of their columns.
setDT(cpy)
.Internal(inspect(orig))
.Internal(inspect(cpy))
If now setkey
is applied to cpy
its columns and therefore those columns in orig
get sorted (here update by reference plays out). The omitted columns (col3
) are not affected by the sorting because they are unknown in cpy
. They then are out of sync with the rest of the object.
setkey(cpy,id,col1)
print(cpy)
# id col1 col2
# a 2 4
# a 46 2
# b 89 3
# b 300 1
print(orig)
# id col1 col2 col3
# a 2 4 1
# a 46 2 2
# b 89 3 3
# b 300 1 4
To avoid this behaviour, any action which forces a deep instead of a shallow copy while assigning cpy
(e.g. copy()
) breaks the reference to orig
and thus prevents the unwanted messing up there.
Is there any way that cpy
does not loose the reference to the object orig
itself and its omitted columns?