Having some trouble with the data.table syntax

Question

I have been trying to work with data.table as much as I can. But I do not always fully understand the syntax. I found this line in my code, but I cannot figure out what it does. Could someone perhaps explain it to me?

df <- setDT(df)[, .SD[1], by = .(ID, year)]

It especially concerns he [1] in SD[1]. Does it have something to do with subsetting to one row per ID-year?

http://franknarf1.github.io/r-tutorial/_book/tables.html#tables https://stackoverflow.com/tags/data.table/info — jogo, Aug 02 '19 at 07:57
Fyi, `setDT(df); unique(df, by=c("cyl", "am"))` does almost the same thing (except doesn't put cyl, am as the leftmost columns) — Frank, Aug 02 '19 at 16:26

score 3 · Accepted Answer · answered Aug 02 '19 at 10:24

.SD[1] select first row of each group . Here groups are specified by by which are ID and year.

We can take an example using mtcars dataset

df <- mtcars
setDT(df)[,.SD[1L], by = .(cyl, am)]

#   cyl am  mpg  disp  hp drat    wt  qsec vs gear carb
#1:   6  1 21.0 160.0 110 3.90 2.620 16.46  0    4    4
#2:   4  1 22.8 108.0  93 3.85 2.320 18.61  1    4    1
#3:   6  0 21.4 258.0 110 3.08 3.215 19.44  1    3    1
#4:   8  0 18.7 360.0 175 3.15 3.440 17.02  0    3    2
#5:   4  0 24.4 146.7  62 3.69 3.190 20.00  1    4    2
#6:   8  1 15.8 351.0 264 4.22 3.170 14.50  0    5    4

So here it selects first row from each cyl and am.

@Tom see also this answer, which is now a vignette in the development version of `data.table`: https://stackoverflow.com/a/47406952/3576984 — MichaelChirico, Aug 04 '19 at 06:47

akrun · Answer 2 · 2019-08-03T02:01:57.733

We can use .I which would be more efficient

library(data.table)
df <- copy(mtcars)
setDT(df)[df[, .I[1L], by = .(cyl, am)]$V1]
#.    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#2: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#3: 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#4: 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#5: 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#6: 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4

Benchmarks

set.seed(24)
dt <- data.table(grp = rep(1:1e6, each = 20), 
    as.data.frame(matrix(rnorm( 20000000 * 20), ncol = 20)))

system.time({

    dt[,.SD[1L], by = .(grp)]
    })
#   user  system elapsed 
#  2.018   0.309   0.532 
    system.time({

       dt[dt[, .I[1L], by = .(grp)]$V1]

    })
#   user  system elapsed 
# 1.218   0.327   0.273

According to https://github.com/Rdatatable/data.table/issues/735, .SD[1] should be optimized now. If it's still slower, maybe worth reporting or illustrating with a benchmark — Frank, Aug 02 '19 at 16:28

Having some trouble with the data.table syntax

2 Answers2

Benchmarks