data.table
objects now have a := operator. What makes this operator different from all other assignment operators? Also, what are its uses, how much faster is it, and when should it be avoided?

- 46,417
- 11
- 121
- 167

- 71,271
- 35
- 175
- 235
1 Answers
Here is an example showing 10 minutes reduced to 1 second (from NEWS on homepage). It's like subassigning to a data.frame
but doesn't copy the entire table each time.
m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)
system.time(for (i in 1:1000) DF[i,1] <- i)
user system elapsed
287.062 302.627 591.984
system.time(for (i in 1:1000) DT[i,V1:=i])
user system elapsed
1.148 0.000 1.158 ( 511 times faster )
Putting the :=
in j
like that allows more idioms :
DT["a",done:=TRUE] # binary search for group 'a' and set a flag
DT[,newcol:=42] # add a new column by reference (no copy of existing data)
DT[,col:=NULL] # remove a column by reference
and :
DT[,newcol:=sum(v),by=group] # like a fast transform() by group
I can't think of any reasons to avoid :=
! Other than, inside a for
loop. Since :=
appears inside DT[...]
, it comes with the small overhead of the [.data.table
method; e.g., S3 dispatch and checking for the presence and type of arguments such as i
, by
, nomatch
etc. So for inside for
loops, there is a low overhead, direct version of :=
called set
. See ?set
for more details and examples. The disadvantages of set
include that i
must be row numbers (no binary search) and you can't combine it with by
. By making those restrictions set
can reduce the overhead dramatically.
system.time(for (i in 1:1000) set(DT,i,"V1",i))
user system elapsed
0.016 0.000 0.018

- 58,872
- 22
- 166
- 224
-
28Thanks for developing this package. I have a feeling I'm going to be revising a ***lot*** of my code to use this package. – Iterator Aug 11 '11 at 17:48
-
Great. Happy to help further. People who revise their code to use data.table often find the amount of code collapses down considerably (easier to debug and maintain), see reviews on Crantastic. I would have loved to fix <- in R (and other things) so you wouldn't need to change any code, and I haved posted to r-devel, but I can't see a way to make an omelette without breaking some eggs (sorry!) – Matt Dowle Aug 11 '11 at 18:06
-
@Matthew Overloading the `<-` operator would make a great question. – Ari B. Friedman Aug 11 '11 at 19:49
-
@gsk3 If the question is why I didn't do that, yes that's a great question. You ask, I'll answer :) – Matt Dowle Aug 11 '11 at 20:56
-
1On chat I was asked to self ask/answer (which apparently is [encouraged](http://meta.stackexchange.com/questions/17463/should-i-ask-a-question-i-know-the-answer-to)) - that question is [here](http://stackoverflow.com/q/7033106/403310) – Matt Dowle Aug 12 '11 at 07:29
-
5@MatthewDowle Want to include an explanation of when not to use := and to use set() instead? – Ari B. Friedman Jul 22 '12 at 17:42
-
1@Ari Just saw your comment, not sure how I missed it. Good idea - now added. – Matt Dowle Aug 15 '12 at 13:21
-
2@MatthewDowle I'd +1 again if I could. – Ari B. Friedman Aug 15 '12 at 14:11
-
@MattDowle Why the difference in parentheses-use for referencing a column name between the `set(DT, i, "V1", i)` command (you must use parentheses) and the basic `DT[, V1]` (where you don't use parentheses)? – Dr. Beeblebrox May 08 '14 at 08:18
-
@jabberwocky Where you say "parentheses", did you mean "quotes"? i.e. why `V1` in one but `"V1"` in the other? Or are you asking about `(` vs `[`? – Matt Dowle May 14 '14 at 09:25
-
@MattDowle My mistake, sorry. I mean quotes. – Dr. Beeblebrox May 15 '14 at 07:00
-
1@jabberwocky No worries, ok let's see. The 3rd argument of `set()` is a column name only (as defined in `?set`). You might want this to be literal (e.g. `"V1"`) or held in a variable (e.g. `colName` which may then contain `"V1"`, `"colA"` or another columns name). The second argument inside `DT[,]` is always an expression evaluated within the scope of the data.table. `DT[,V1]` is the simplest case, but things like `DT[,V1*V2]` and `DT[,sum(V1)]` are more common. Does that help? – Matt Dowle May 16 '14 at 03:12
-
1@jabberwocky It may help to consider that `DT[,"V1"]` returns simply `"V1"`. This is explained by the very first FAQ 1.1. There is no point of `DT[,"V1"]` really. The behaviour is like that for consistency (i.e. the 2nd argument is always evaluated within scope of the data.table, even in this case) which users requested. It soon becomes natural to use `DT[,V1]` instead. – Matt Dowle May 16 '14 at 03:19
-
@MattDowle Sorry for not being clear. I've read the FAQ and understand why I *don't* use the quotes (even though I understand even more now after your explanation!). I meant for my question to be about why you *do* use quotes in `set(DT, i, "V1", i)`. – Dr. Beeblebrox May 16 '14 at 12:40
-
3@jabberwocky No problem. `set(DT, i, "V1", i)` sets the `"V1"` column whilst `set(DT, i, colVar, i)` sets the column name contained in the `colVar` variable (e.g. if `colVar = "V1"` was done earlier). The quotes indicate to take the column name literally rather than lookup the variable. – Matt Dowle May 17 '14 at 13:55