2

I am trying to write to a subset of rows of a data.table by reference in order to deal with training, testing, and excluded rows of data for a model.

However, when I define this subset of rows and attempt to write to it, it breaks the reference without warning. Conceptually, I know that this works:

library('data.table')

a <- data.table(a1=c(0,1), a2=c(2,3))
a
#    a1 a2
# 1:  0  2
# 2:  1  3

b <- a

b[,b1:=4]
b
#    a1 a2 b1
# 1:  0  2  4
# 2:  1  3  4

a
#    a1 a2 b1
# 1:  0  2  4
# 2:  1  3  4

But what I am trying to do is something like:

a <- data.table(a1=c(0,1), a2=c(2,3))
a
b <- a[1,]
b
#    a1 a2
# 1:  0  2

b[,b1:=4]
b
#    a1 a2 b1
# 1:  0  2  4
a
#    a1 a2
# 1:  0  2
# 2:  1  3

# What I would really like is
#>a
#    a1 a2 b1
# 1:  0  2  4
# 2:  1  3 NA

I am having a hard time reconciling this behavior with the explanation here which suggests that using the data table assignment := shouldn't break the reference like <- would.

I have a key for every row, so merging the scores back is not a big deal. I'm just curious if there's a way to pass it. Basically I am trying to createDataPartition() around some excluded rows and finding the book-keeping kind of annoying.

Community
  • 1
  • 1
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • 3
    I'm not sure if I'm misunderstanding, but do you just want `a[1, b1 := 4]` ? – thelatemail Sep 24 '15 at 04:21
  • @thelatemail I'm trying to understand if I can achieve the same result as `a[1, b1 := 4]` by reference by working on `b`. I would prefer to work on `b` because (in real life) there is a more complicated row selection process to get the rows of `b` than just `a[1,]`, and I need to index against selected rows only for `createDataPartition`, then write some stuff back to the original data.table but using that index. – C8H10N4O2 Sep 24 '15 at 04:26
  • You should be able to operate on `b` and get the same result, e.g. `a <- data.table(a1=c(0,1), a2=c(2,3)); b <- a; b[1, b1 := 4]` . I'm not sure why increasing the complexity of the selection procedure would break this logic. – thelatemail Sep 24 '15 at 04:35
  • @thelatemail thanks for your time. What I am doing now is basically what you suggest -- `a[1, b1 := 4]` -- but then I have to keep track of those indices everytime I want to write to the data.table. I was curious whether I could define the indices once and for all by creating a new data.table `b` just pointing to those rows, and still be able to write back to the corresponding rows of the original. If not, there is no point for me to create the data.table `b.` – C8H10N4O2 Sep 24 '15 at 05:11
  • 2
    It's funny that you linked that post but didn't follow the steps in the answer. After doing `b <- a` ; `tracemem(a); tracemem(b)` you will see they are the same object, while after doing `b <- a[1,]` and then `tracemem(a); tracemem(b)`, they aren't. This is called [copy on modify](http://stackoverflow.com/questions/15759117/what-exactly-is-copy-on-modify-semantics-in-r-and-where-is-the-canonical-source). You should really check what are `a` and `b` before jumping to `b[,b1:=4]`. In other words, as long as `b` wasn't changed when passed to `a`, they are the same, if not, well, they are not. – David Arenburg Sep 24 '15 at 05:54
  • @DavidArenburg in my defense, the post was kind of long... :) – C8H10N4O2 Sep 24 '15 at 12:16

0 Answers0