17

While constructing expressions to put in the j-slot of a [.data.table call, it would often be helpful to be able to examine and play around with the contents of .SD.

This naive attempt doesn't work...

library(data.table)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)

DT[, browser(), by=x]
# Called from: `[.data.table`(DT, , browser(), by = x)
Browse[1]> 
Browse[1]> .SD
# NULL data.table

... even though a variable named .SD and several others related to the current data.table subset are all present in the local environment

Browse[1]> ls(all.names = TRUE)
#  [1] ".BY"       ".GRP"      ".I"        ".iSD"      ".N"        ".SD"      
#  [7] "Cfastmean" "mean"      "print"     "x"        
Browse[1]> .N
# [1] 3
Browse[1]> .I
# [1] 4 5 6

Using .I, I can view something +/- like .SD, but it would be nice to be able to directly access its value:

Browse[1]> DT[.I]
#    x y v
# 1: b 1 4
# 2: b 3 5
# 3: b 6 6

My questions: Why is the expected value of .SD not directly available from within a browser() call (while .I, .N, .GRP and .BY are)? Is there some alternative way to access the value of .SD?

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • 2
    I wonder, at the time `browser()` is called, is `.SD` actually populated with anything? `str(.SD)` shows `Classes ‘data.table’ and 'data.frame': 0 obs. of 0 variables` etc – Gavin Simpson Mar 27 '13 at 19:58
  • @GavinSimpson -- I think you're probably on to something there. The partial answer I just added seems like additional evidence in that direction. I wonder also if delayed evaluation of `.SD` is somehow involved. – Josh O'Brien Mar 27 '13 at 20:04

1 Answers1

17

Updated in light of Matthew Dowle's comments:

It turns out that .SD is, internally, the environment within which all j expressions are evaluated, including those which don't explicitly reference .SD at all. Filling it with all of DT's columns for each subset of DT is not cheap, timewise, so [.data.table() won't do so unless it really needs to.

Instead, making great use of R's lazy-evaluation of arguments, it previews the unevaluated j expression, and only adds to .SD columns that are referenced therein. If .SD itself is mentioned, it adds all of DT's columns.

So, to view .SD, just include some reference to it in the j-expression. Here is one of many expressions that will work:

library(data.table)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)

## This works
DT[, if(nrow(.SD)) browser(), by=x]
# Called from: `[.data.table`(DT, , if (nrow(.SD)) browser(), by = x)
Browse[1]> .SD
#    y v
# 1: 1 1
# 2: 3 2
# 3: 6 3

And here are a couple more:

DT[,{.SD; browser()}, by=x]
DT[,{browser(); .SD}, by=x]  ## Notice that order doesn't matter

To see for yourself that .SD just loads columns needed by the j-expression, run these each in turn (typing .SD when entering the browser environment, and Q to leave it and return to the normal command-line):

DT[, {.N * y ; browser()}, by=x]
DT[, {v^2 ; browser()}, by=x]
DT[, {y*v ; browser()}, by=x]
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • Section 2.1 of [the `data.table` FAQ](http://datatable.r-forge.r-project.org/datatable-faq.pdf) refers to the huge slow-down that using `.SD` can entail. – Josh O'Brien Mar 27 '13 at 20:49
  • Josh, not really. section 2.1 of FAQ recommends using `.SD`, but *not* with `with = FALSE`. **** SD object is efficiently implemented internally and more efficient than passing an argument to a function. Please don't do this though : `DT[,.SD[,"sales",with=FALSE],by=grp]` ****. – Arun Mar 27 '13 at 21:08
  • @Arun -- Is it just recommending against using `with=FALSE`? I guess I have to admit I've never understood what exactly is being said in that section, so have just avoided `.SD` when I can. – Josh O'Brien Mar 27 '13 at 21:12
  • Yes. `with = FALSE` basically mimics a `data.frame` equivalent. For example, if you've to access the 3rd column, you can do: `DT[, 3, with = FALSE]`. But without that, `.SD` is definitely faster, especially when operated on columns (or column-wise). – Arun Mar 27 '13 at 21:45
  • 9
    +1 Yes exactly. The mere existence of the symbol `.SD` in `j` triggers `.SD` to be populated with all columns of the subset. Otherwise, `.SD` is only populated with the columns that `j` needs. Internally, at C level, `.SD` is the static environment in which `j` is evaluated. It's where all the magic happens, regardless of whether `.SD` is used (as a symbol) by `j` or not. This is what FAQ 3.1 point 1 is referring to. And is why we advise not to use `.SD` unless `j` really does need all columns and all rows from it (e.g. `DT[,lapply(.SD,sum),by=...]`). – Matt Dowle Mar 27 '13 at 21:52
  • 1
    @MatthewDowle -- Very enlightening comment. Thanks! (1) Following up on what you said, even this will work: `DT[, {browser(); .SD}, by=x]`. Will edit my answer to reflect that when I get a chance. (2) Re: Arun's comment, should these two actually be equally fast: `DT[,.N*.SD[,v], by=x]; DT[,.N*.SD[,"v",with=FALSE], by=x]`? Does `[.data.table` optimize the first `j` expression but not the second? Or, since they both make reference to `.SD`, do they both take the same hit? (3) Again, thanks for all your work on this project! – Josh O'Brien Mar 27 '13 at 22:59
  • (1) Yes the order doesn't matter. The entire `j` expression is inspected first, once, before grouping commences. (2) Both equally slow and equally bad practice (see end of FAQ 2.1) because both use `.SD` wastefully and also both `j`s call `[.data.table` which has overhead when looped or grouped. But let me know if benchmarks prove me wrong! (3) No problem, thanks! – Matt Dowle Mar 27 '13 at 23:08
  • 3
    And this is all thanks to lazy argument evaluation in R. Which is one reason why `data.table` is in R, rather than Python or Julia for example. Without R's lazy evaluation it wouldn't be possible to inspect `i` and `j` and optimize them before evaluation, iiuc. – Matt Dowle Mar 27 '13 at 23:18
  • 3
    @MatthewDowle Thanks for the insightful comments here. All very interesting. – Gavin Simpson Mar 28 '13 at 01:48
  • @MattDowle Now that data.table is being implemented in Python, I'm curious about this your statement saying that data.table relies on R's unique lazy evaluation framework. What's changed that's allowed data.table to be implemented in Python? Or is it not a full implementation? – Michael Oct 07 '19 at 06:33