I have a question concerning data.table. I love it but I think I was/am sometime misusing the .SD, and I would appreciate some clarification about when it is interesting to use it in data.table.
Here are two examples where I came to think that I was misusing .SD :
The first one is as discussed here (thanks for the Henry's comment)
library(microbenchmark)
library(data.table)
DTlength <- 2000
DT <-
data.table(
id = rep(sapply(combn(LETTERS, 6, simplify = FALSE), function(x) {
paste(x, collapse = "")
}), each = 4)[1:DTlength],
replicate(10, sample(1001, DTlength, replace = TRUE)),
Answer = sample(c("Yes", "No"), DTlength, TRUE)
)
microbenchmark(
"without SD" = {
b <- DT[, Answer[1], by = id][, V1]
},
"without SD alternative" = {
b <- DT[DT[, .I[1], by = id][, V1], Answer]
},
"with SD" = {
b <- DT[, .SD[1, Answer], by = id][, V1]
}
)
Unit: microseconds
expr min lq mean median uq max neval
without SD 455.795 493.949 569.4979 529.847 558.564 2323.283 100
without Sd alternative 961.231 1010.667 1160.9114 1060.513 1113.641 7783.798 100
with SD 121217.691 123557.590 131071.5699 127495.437 130340.977 240317.227 100
.SD operation are quite slow compared to alternative in grouping operations. Even if you want to group to the entire data.table, the alternatives are slightly faster (although the time difference here is maybe not worth the loss of clarity of the syntax):
microbenchmark(
"with SD" = {b <-DT[,.SD[1], by = id]},
"Without SD" = {b <- DT[DT[,.I[1],by = id][,V1]]}
)
Unit: milliseconds
expr min lq mean median uq max neval
with SD 1.058872 1.361436 1.560866 1.643078 1.741540 1.960206 100
Without SD 1.067898 1.169642 1.279443 1.233437 1.348719 1.781334 100
The second example illustrates the fact that you can't really use .SD to assign new variable to a value with a condition within groups (or I didn't find the way):
DT[, .SD[V1 - V1[1] > 100][, plouf2 := Answer], by = id] # doesn't assign plouf2
DT[DT[, .I[V1 - V1[1] > 100], by = id][, V1], plouf2 := Answer] # this does
There are two situations where I found it useful to use .SD : the DT[,lapply(.SD,fun),.SDcols = ]
use that is very convenient, and when one wants to assign all values in the group to a particular value that meets a particular condition within the group :
DT[, plouf3 := .SD[V1 - V1[1] > 100, Answer][1], by = id]
# all values are assigned, which is actually different from
DT[DT[, .I[V1 - V1[1] > 100][1], by = id][, V1], plouf2 := Answer]
# where only the values that match the condition V1-V1[1]>100 are assigned
So my question: are there other situations where it is needed/interesting to use .SD ?
Thank you in advance for the help.