3

I have a question concerning data.table. I love it but I think I was/am sometime misusing the .SD, and I would appreciate some clarification about when it is interesting to use it in data.table.

Here are two examples where I came to think that I was misusing .SD :

The first one is as discussed here (thanks for the Henry's comment)

library(microbenchmark)
library(data.table)

DTlength <- 2000
DT <-
  data.table(
    id = rep(sapply(combn(LETTERS, 6, simplify = FALSE), function(x) {
      paste(x, collapse = "")
    }), each = 4)[1:DTlength],
    replicate(10, sample(1001, DTlength, replace = TRUE)),
    Answer = sample(c("Yes", "No"), DTlength, TRUE)
  )

microbenchmark(
  "without SD" = {
    b <- DT[, Answer[1], by = id][, V1]
  },
  "without SD alternative" = {
    b <- DT[DT[, .I[1], by = id][, V1], Answer]
  },
  "with SD" = {
    b <- DT[, .SD[1, Answer], by = id][, V1]
  }
)

Unit: microseconds
                   expr        min         lq        mean     median         uq        max neval
             without SD    455.795    493.949    569.4979    529.847    558.564   2323.283   100
 without Sd alternative    961.231   1010.667   1160.9114   1060.513   1113.641   7783.798   100
                with SD 121217.691 123557.590 131071.5699 127495.437 130340.977 240317.227   100

.SD operation are quite slow compared to alternative in grouping operations. Even if you want to group to the entire data.table, the alternatives are slightly faster (although the time difference here is maybe not worth the loss of clarity of the syntax):

microbenchmark(
  "with SD" = {b <-DT[,.SD[1], by = id]},
  "Without SD" = {b <- DT[DT[,.I[1],by = id][,V1]]}
)

Unit: milliseconds
       expr      min       lq     mean   median       uq      max neval
    with SD 1.058872 1.361436 1.560866 1.643078 1.741540 1.960206   100
 Without SD 1.067898 1.169642 1.279443 1.233437 1.348719 1.781334   100

The second example illustrates the fact that you can't really use .SD to assign new variable to a value with a condition within groups (or I didn't find the way):

DT[, .SD[V1 - V1[1] > 100][, plouf2 := Answer], by = id] # doesn't assign plouf2
DT[DT[, .I[V1 - V1[1] > 100], by = id][, V1], plouf2 := Answer] # this does

There are two situations where I found it useful to use .SD : the DT[,lapply(.SD,fun),.SDcols = ] use that is very convenient, and when one wants to assign all values in the group to a particular value that meets a particular condition within the group :

DT[, plouf3 := .SD[V1 - V1[1] > 100, Answer][1], by = id] 
# all values are assigned, which is actually different from 
DT[DT[, .I[V1 - V1[1] > 100][1], by = id][, V1], plouf2 := Answer] 
# where only the values that match the condition V1-V1[1]>100 are assigned

So my question: are there other situations where it is needed/interesting to use .SD ?

Thank you in advance for the help.

denis
  • 5,580
  • 1
  • 13
  • 40
  • 1
    First question, possible duplicate of [Subset by group with data.table](https://stackoverflow.com/questions/16573995/subset-by-group-with-data-table): "the main reason the OP is slow is not just that it has `.SD` in it, but the fact that it uses it in a particular way - by calling `[.data.table`, which at the moment has a huge overhead, so running it in a loop (when one does a `by`) accumulates a very large penalty". See also [Optimize .SD query to keep the elegance but make it faster](https://github.com/Rdatatable/data.table/issues/613) and links therein. – Henrik Nov 28 '17 at 19:27
  • Thank you for the links. – denis Nov 29 '17 at 10:57
  • I edited the question to be more precise. – denis Nov 30 '17 at 08:10
  • 1
    You will usually want to use `.SD` when you want to operate over multiple columns (like you already showed) or after a certain operation to get in return few (or all columns back) for instance `DT[, if(any(x > 2)) .SD, by = y]` or `DT[, .SD[1L], by = x]`. You can also use it for a conditional join such as `DT[x > 2, .SD[DT, x, on = .(y)]]`. Other than that, I don't really see a reason to use it and you probably use the actual vectors. – David Arenburg Dec 04 '17 at 22:41
  • 1
    this question is too broad, and I'd vote to close it if I could; I use `.SD` when it's useful, which is fairly often - it's not a good answer, but that's mainly because this is not a good question – eddi Dec 08 '17 at 19:05
  • 2
    Possible duplicate of [What does .SD stand for in data.table in R](https://stackoverflow.com/questions/8508482/what-does-sd-stand-for-in-data-table-in-r). Especially given the recent, very thorough [answer](https://stackoverflow.com/a/47406952/1851712) by @MichaelChirico. – Henrik Dec 08 '17 at 19:29
  • Thanks for the link @Henrik. Yes the answer by MichaelChirico is a wonderful answer to my question. Thanks a lot. – denis Dec 09 '17 at 19:47
  • @eddi sorry that the question is that broad, but I really come often to this question. I didn't find the post https://stackoverflow.com/questions/8508482/what-does-sd-stand-for-in-data-table-in-r that actually answer most of the question (so there is a good answer, even to a bad question) – denis Dec 09 '17 at 19:50

1 Answers1

1

Regarding your first question

The benchmark would only be fair if all three methods would generate the same output. The "without SD alternative" method generates a different result, so let's set that one aside.

The "with SD" and "without SD" methods generate the same output but the latter is more efficient. Here is why: when you do ... .SD[1, Answer] ... you are basically subsetting all of the columns for the matching rows, then you are performing the next operation (which is to fetch the first value of the vector Answer) on this subset. However, in the "without SD" method, you are only subsetting one vector (not all vectors) and then fetching the first value of that one vector. The unnecessary subsetting of the additional, unused columns in the "with SD" method is what makes it slow.

Regarding your second question

This command does not assign the values to DT:

DT[, .SD[V1 - V1[1] > 100][, plouf2 := Answer], by = id]

The reason is that the .SD operator is a one-way operator, that is if you change something in the subset that .SD gives you, it doesn't apply it back on the larger data.table but only applies it on the in-memory copy of the subset. It is not fair to call it an in-memory copy, because .SD does not actually copy the data (it just points the relevant portion of the memory that holds the subset of interest), but the point is that assignments to it will only be applied to this in-memory pointer and not the original underlying data.

Note: You could argue, then, that it should not support assignments whatsoever. I don't know what Matt Dowle thinks, but in my humble opinion, the assignment is actually a useful feature! For instance:

DT.2 <- DT[, .SD[V1 - V1[1] > 100][, plouf2 := Answer], by = id]

This way I have a very short, highly readable piece of code, that generates the output I desire and stores it in a new data.table without modifying the original data.table! Any other way that I can think of to generate this exact output without using .SD and without touching the original data.table involves much longer code.

Regarding your last question

.SD is useful when you want to deal with many or all columns of a data.table and not just a few or only one column. (This is why the "with SD" method you used in the first part is not an appropriate way to do what you want to do). The examples provided in What does .SD stand for in data.table in R are very useful to demonstrate when .SD can be very handy. In my opinion, the main advantage of .SD is not with the efficiency at the which the code runs, but rather, in the efficiency in which you can turn a concept into R code, and the readability of that piece of code.

Merik
  • 2,767
  • 6
  • 25
  • 41
  • thanks for your answer. The example provided in https://stackoverflow.com/questions/8508482/what-does-sd-stand-for-in-data-table-in-r are indeed really nice. Thank you for the explanations – denis Dec 09 '17 at 19:52
  • By the way the different methods generate the same result. The without SD alternative generate a vector of character whereas the other two generate data tables, but the result is the same. I eddited the post so the result are 100% identical. – denis Dec 11 '17 at 08:08
  • Well, they are *semantically* the same (to you and I) but *technically* different (the processing and memory footprint of making a data.table is different than that of a vector). A good benchmark is one in which the results of all methods are **exactly** identical. – Merik Dec 11 '17 at 21:12