setkey on a subsetted shallow copy of a dataframe break the origin

Question

I have a data.frame orig, which is subset and assigned to cpy.

library(data.table)

orig <- data.frame(id=letters[c(2,1,2,1)], col1=c(300,46,89,2), 
                   col2=1:4, col3=1:4)

print(orig)
# id col1 col2 col3
# b  300    1    1
# a   46    2    2
# b   89    3    3
# a    2    4    4

cpy <- orig[,c("id","col1","col2")]

cpy is a shallow copy of orig and references parts of orig (all but the omitted columns).

Because cpy is a subset of orig, it references the shared columns only and the update by reference feature of setDT(cpy) does not come into play. This leaves orig and cpy in a potentially dangerous state where they only share the pointers to a subset of their columns.

setDT(cpy)

.Internal(inspect(orig))
.Internal(inspect(cpy))

If now setkey is applied to cpy its columns and therefore those columns in orig get sorted (here update by reference plays out). The omitted columns (col3) are not affected by the sorting because they are unknown in cpy. They then are out of sync with the rest of the object.

setkey(cpy,id,col1)

print(cpy)
# id col1 col2
# a    2    4
# a   46    2
# b   89    3
# b  300    1

print(orig)
# id col1 col2 col3
# a    2    4    1
# a   46    2    2
# b   89    3    3
# b  300    1    4

To avoid this behaviour, any action which forces a deep instead of a shallow copy while assigning cpy (e.g. copy()) breaks the reference to orig and thus prevents the unwanted messing up there.

Is there any way that cpy does not loose the reference to the object orig itself and its omitted columns?

Maybe file a bug with the dtplyr package so they can implement select differently: https://github.com/hadley/dtplyr If you are willing to use data.table syntax (which seems likely), just adding `%>% copy` to your pipeline should work. — Frank, Apr 10 '18 at 17:18
Yes, it's an intended function of data.table. To avoid it, you need to make a deep copy of the data.table, not just assign it a new reference, which is what you did. See here: https://stackoverflow.com/questions/10225098/understanding-exactly-when-a-data-table-is-a-reference-to-vs-a-copy-of-another — Ben K, Apr 10 '18 at 17:34
Welp, this is one of the newest features in R it seems called a "shallow copy". Basically, R doesn't make a copy of an object if it wasn't alternated or was alternated in way that doesn't require a full copy of the object. An easy way to check if a copy was made is to use `tracemem`. after running `tracemem(orig)`, Compare: `cpy <- orig %>% select(-1)` vs. `cpy <- orig %>% mutate(col3 = 8)`. In the first case, only a shalow copy was made. When using base R this shouldn't effect you, but because data.table is using these pointers in order to modify in place, you need to be aware of this. — David Arenburg, Apr 10 '18 at 17:44
Thanks @BenK, after reading this question this makes much more sense. I edited the question accordingly. — bebru, Apr 10 '18 at 19:45
Thanks @DavidArenburg, after playing around with `tracemem`and `.Internal(inspect(cpy))` I noticed this doesn't have to do with `dplyr` or a related package and edited the question. — bebru, Apr 10 '18 at 19:48
@Frank, `copy` seems exactly the intended function to prevent this, thanks! — bebru, Apr 10 '18 at 19:50
Also, take a look at this Q/A https://stackoverflow.com/questions/25945392/update-by-reference-vs-shallow-copy?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa — David Arenburg, Apr 11 '18 at 06:00

setkey on a subsetted shallow copy of a dataframe break the origin

0 Answers0