1

I experience some unexpected behavior when using grouped modification of a column in a data.table:

# creating a data.frame
data <- data.frame(sequence = rep(c("A","B","C","D"), c(2,3,3,2)), trim = 0, random_value = NA)
data[c(1:4, 10), "trim"] <- 1

# copying data to data_temp
data_temp <- data

# assigning some random value to data_temp so that it should no longer be a
# copy of "data"
data_temp[1, "random_value"] <- rnorm(1)

# converting data_temp to data.table
setDT(data_temp)

# expanding trim parameter to group and subsetting
data_temp <- data_temp[, trim := sum(trim), by = sequence][trim == 0]

data_temp comes out as expected with only the "C" sequence entries remaining. However, I would also expect the "data" object to remain unchanged. This is not the case. The "data" object looks as follows:

   sequence trim random_value
1         A    2           NA
2         A    2           NA
3         B    2           NA
4         B    2           NA
5         B    2           NA
6         C    0           NA
7         C    0           NA
8         C    0           NA
9         D    1           NA
10        D    1           NA

So the assignment by reference of the "trim" variable also happened in the original data.frame.

I am using data.table_1.11.4 and R version 3.4.3 for compatibility reasons.

Is the error a result of using old versions or am I doing something wrong / do I need to change the code to avoid that error?

Phil
  • 954
  • 1
  • 8
  • 22
  • 2
    Read `help("copy")`. – Roland Jul 28 '19 at 19:00
  • Ah thanks. Good to know that it's also necessary to use copy() if the objects that I copy are not actually data.table objects but data.frames, only one of which will become a data.table later. – Phil Jul 28 '19 at 19:08
  • 1
    @Roland I was surprised to see that `data_temp[1, "random_value"] <- rnorm(1)` does not copy the entire data.frame, but only the "random_value" vector. So, after this line, the sequence and trim variables of the separate data.frames still point to the same objects in memory. I verified this with `.Internal(inspect(.))`. I wonder how long this behavior has been the default in base R. Maybe since lists were allowed to hold pointers? – lmo Jul 28 '19 at 19:19
  • @David. it is unclear this is a duplicate question. Although the advice of "create a copy before doing anything" will solve both issues, the copying behavior of `->` differs for data.frame and data.table objects. You can see this by repeating matt dowle's example with data.framea and inspecting the memory location of the vectors. This would more accurately mirror the above situation. – lmo Jul 29 '19 at 14:38

1 Answers1

4

As @Roland kindly pointed out in his comment to the original question, it's necessary to use the "copy()" function to explicitly copy objects in data.table. Otherwise data.table won't regard copied objects as distinct objects and will modify columns with the same name in both objects. As @Imo checked, only columns that are changed in just one of the two data.frames and not by reference (e.g. "random_value" in the example) are actually copied / unlinked.

The issue can be easily fixed by using the copy() function:

# creating a data.frame
data <- data.frame(sequence = rep(c("A","B","C","D"), c(2,3,3,2)), trim = 0, random_value = NA)
data[c(1:4, 10), "trim"] <- 1

# copying data to data_temp explicitly
data_temp <- copy(data)

# assigning some random value to data_temp so that it should no longer be a
# copy of "data" - if the copy() function isn't used, that just unlinks the 
# "random_value" column, but not the others
data_temp[1, "random_value"] <- rnorm(1)

# converting data_temp to data.table
setDT(data_temp)

# expanding trim parameter to group and subsetting
data_temp <- data_temp[, trim := sum(trim), by = sequence][trim == 0]

So it's necessary to use the copy() function every time you don't want data.table modifications by reference done on the copied tables affect the original table (or vice versa) - even if at the time you copy the tables they are not (yet) data.table class objects.

Phil
  • 954
  • 1
  • 8
  • 22