This is from an observation during my answering this question from @sds here.
First, let me switch on the trace messages for data.table
:
options(datatable.verbose = TRUE)
dt <- data.table(a = c(rep(3, 5), rep(4, 5)), b=1:10, c=11:20, d=21:30, key="a")
Now, suppose one wants to get the sum of all columns grouped by column a
, then, we could do:
dt.out <- dt[, lapply(.SD, sum), by = a]
Now, suppose I'd want to add also the number of entries that belong to each group to dt.out
, then I normally assign it by reference as follows:
dt.out[, count := dt[, .N, by=a][, N]]
# or alternatively
dt.out[, count := dt[, .N, by=a][["N"]]]
In this assignment by reference, one of the messages data.table
produces is:
RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.
This is a message from a file in data.table's source directory assign.C
. I dont want to paste the relevant snippet here as it's about 18 lines. If necessary, just leave a comment and I'll paste the code. dt[, .N, by=a][["N"]]
just gives [1] 5 5
. So, it's not a named vector
. And I don't understand what this recycled list
in RHS is..
But if I do:
dt.out[, `:=`(count = dt[, .N, by=a][, N])]
# or equivalently
dt.out[, `:=`(count = dt[, .N, by=a][["N"]])]
Then, I get the message:
Direct plonk of unnamed RHS, no copy.
As I understand this, the RHS has been duplicated in the first case, meaning it's making a copy (shallow/deep, this I don't know). If so, why is this happening?
Even if not, why the changes in assignment by reference between two internally? Any ideas?
To bring out the main underlying question that I had in my mind while writing this post (and seem to have forgotten!): Is it "less efficient" to assign as dt.out[, count := dt[, .N, by=a][["N"]]]
(compared to the second way of doing it)?