understanding optimisation messages on assignment by reference in a data.table

Question

This is from an observation during my answering this question from @sds here.

First, let me switch on the trace messages for data.table:

options(datatable.verbose = TRUE)
dt <- data.table(a = c(rep(3, 5), rep(4, 5)), b=1:10, c=11:20, d=21:30, key="a")

Now, suppose one wants to get the sum of all columns grouped by column a, then, we could do:

dt.out <- dt[, lapply(.SD, sum), by = a]

Now, suppose I'd want to add also the number of entries that belong to each group to dt.out, then I normally assign it by reference as follows:

dt.out[, count := dt[, .N, by=a][, N]]
# or alternatively
dt.out[, count := dt[, .N, by=a][["N"]]]

In this assignment by reference, one of the messages data.table produces is:

RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.

This is a message from a file in data.table's source directory assign.C. I dont want to paste the relevant snippet here as it's about 18 lines. If necessary, just leave a comment and I'll paste the code. dt[, .N, by=a][["N"]] just gives [1] 5 5. So, it's not a named vector. And I don't understand what this recycled list in RHS is..

But if I do:

dt.out[, `:=`(count = dt[, .N, by=a][, N])]
# or equivalently
dt.out[, `:=`(count = dt[, .N, by=a][["N"]])]

Then, I get the message:

Direct plonk of unnamed RHS, no copy.

As I understand this, the RHS has been duplicated in the first case, meaning it's making a copy (shallow/deep, this I don't know). If so, why is this happening?

Even if not, why the changes in assignment by reference between two internally? Any ideas?

To bring out the main underlying question that I had in my mind while writing this post (and seem to have forgotten!): Is it "less efficient" to assign as dt.out[, count := dt[, .N, by=a][["N"]]] (compared to the second way of doing it)?

Happy to answer but what's the questions in the S.O. sense? I'll answer the first part for now ... — Matt Dowle, Apr 22 '13 at 16:54
I'll edit the question to make sure that my question is, in addition to the ones here, is it *inefficient* to assign using `a := .`. — Arun, Apr 22 '13 at 17:04
At the top I'm fairly sure it should be `options(datatable.verbose=TRUE)`. There isn't a `datatable.warnings` option but nothing would complain that setting it was ineffective. — Matt Dowle, May 03 '13 at 10:17

score 7 · Accepted Answer · edited May 25 '14 at 17:45

7

Update: The expression,

DT[, c(..., lapply(.SD, .), ..., by=.]

has been optimised internally in commit #1242 of v1.9.3 (FR #2722). Here's the entry from NEWS:

o Complex j-expressions of the form DT[, c(..., lapply(.SD, fun)), by=grp]are now optimised, as long as .SD is only present in the form lapply(.SD, fun).

For ex: DT[, c(.I, lapply(.SD, sum), mean(x), lapply(.SD, log)), by=grp]
is optimised to: DT[, list(.I, x=sum(x), y=sum(y), ..., mean(x), log(x), log(y), ...), by=grp]

But DT[, c(.SD, lapply(.SD, sum)), by=grp] for example isn't optimised yet. This partially resolves FR #2722. Thanks to Sam Steingold for filing the FR.

Where it says NAMED vector it means that in the internal R sense at C level; i.e., whether an object has been assigned a symbol and is called something, not whether an atomic vector has a "names" attribute or not. The NAMED value in the SEXP structure takes value 0, 1 or 2. R uses that to know whether it needs to copy-on-subassign or not. See section 1.1.2 of R-ints.

What would be better is if optimization of j in data.table could handle :

DT[, c(lapply(.SD,sum),.N), by=a]

That works but may be slow. Currently only the simpler form is optimized :

DT[, lapply(.SD,sum), by=a]

To answer main question, yes the following :

Direct plonk of unnamed RHS, no copy.

is desirable compared to :

RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.

Another way to achieve this is :

dt.out[, count := dt[, .N, by=a]$N]

I'm not quite sure why [["N"]] returns a NAM(2) compared to $N which doesn't.

edited May 25 '14 at 17:45

Arun

116,683
26
284
387

answered Apr 22 '13 at 17:03

Matt Dowle

58,872
22
166
224

Thanks Matthew for your patience. Yes, I already benchmarked @djhurio's answer [**here**](http://stackoverflow.com/a/16134117/559784) (in the same post I've linked on top) and found it a tad slower. Sorry, I've not yet familiarised myself with the R-C integration. My question is basically between the two methods of assigning by reference: Do the different messages in each case have anything to do with efficiency (even though I understand that the answer may very well have much to do with reading section 1.1.2)? – Arun Apr 22 '13 at 17:14
Thanks Matthew. Indeed it would be nice if `DT <- DT[, c(lapply(.SD,sum),.N), by=a]` were optimized because then I would be able to discard the old `DT` right away. – sds Apr 23 '13 at 14:06
@sds Ok. Please could you file a feature request. Thanks. – Matt Dowle Apr 23 '13 at 14:32
@MatthewDowle: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2722&group_id=240&atid=978 – sds Apr 23 '13 at 14:47
My html version of `?R-ints` does not have section numbers, but I think I get the idea. Am I right in thinking that I cannot avoid this warning in many cases? My current case is `dt[,bah:=ifelse(!is.na(a),a,b)]`, for example. – Frank May 02 '13 at 17:25
@Frank Sorry I don't follow. – Matt Dowle May 02 '13 at 18:53
@MatthewDowle I'm getting this warning, and was wondering if there is some guidance written up somewhere on how to avoid it (or explaining that it is frequently not possible). It's not really a problem with my current data set (which is small), but I was curious, especially because I can't really see why the assignment I was working with should give this warning. Anyway, if my query still doesn't make sense, that's ok. :) – Frank May 02 '13 at 20:03
@Frank First confusion now fixed, arising from the top of the question. This isn't a warning but a trace message that's only visible when `datatable.verbose` is turned on. – Matt Dowle May 03 '13 at 10:24
@Frank But yes good point. I'm not sure what's happening with that `ifelse`. Have now raised [#2763](https://r-forge.r-project.org/tracker/?group_id=240&atid=978&func=detail&aid=2763) to investigate and potentially fix that. Thanks. – Matt Dowle May 03 '13 at 10:31
@MatthewDowle OK thanks. I tried posting a reply on R-Forge, but it's somewhat lacking in line breaks. I just wanted to say thanks and that the "warning" (extra trace message) is not there for `DT = data.table(DUMMYCOL=1:3); DT[,foo:=ifelse(T,1,2)]`; but comes back if there's only one row: `DT1=data.table(DUMMYCOL=1); DT1[,foo:=ifelse(T,1,2)]`. – Frank May 03 '13 at 12:56

understanding optimisation messages on assignment by reference in a data.table

1 Answers1

Linked