2

I have been trying to work with data.table as much as I can. But I do not always fully understand the syntax. I found this line in my code, but I cannot figure out what it does. Could someone perhaps explain it to me?

df <- setDT(df)[, .SD[1], by = .(ID, year)]

It especially concerns he [1] in SD[1]. Does it have something to do with subsetting to one row per ID-year?

Tom
  • 2,173
  • 1
  • 17
  • 44
  • 1
    http://franknarf1.github.io/r-tutorial/_book/tables.html#tables https://stackoverflow.com/tags/data.table/info – jogo Aug 02 '19 at 07:57
  • 5
    It select first row for each `(ID, year)`. – Ronak Shah Aug 02 '19 at 08:04
  • Fyi, `setDT(df); unique(df, by=c("cyl", "am"))` does almost the same thing (except doesn't put cyl, am as the leftmost columns) – Frank Aug 02 '19 at 16:26

2 Answers2

3

.SD[1] select first row of each group . Here groups are specified by by which are ID and year.

We can take an example using mtcars dataset

df <- mtcars
setDT(df)[,.SD[1L], by = .(cyl, am)]

#   cyl am  mpg  disp  hp drat    wt  qsec vs gear carb
#1:   6  1 21.0 160.0 110 3.90 2.620 16.46  0    4    4
#2:   4  1 22.8 108.0  93 3.85 2.320 18.61  1    4    1
#3:   6  0 21.4 258.0 110 3.08 3.215 19.44  1    3    1
#4:   8  0 18.7 360.0 175 3.15 3.440 17.02  0    3    2
#5:   4  0 24.4 146.7  62 3.69 3.190 20.00  1    4    2
#6:   8  1 15.8 351.0 264 4.22 3.170 14.50  0    5    4

So here it selects first row from each cyl and am.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • @Tom see also this answer, which is now a vignette in the development version of `data.table`: https://stackoverflow.com/a/47406952/3576984 – MichaelChirico Aug 04 '19 at 06:47
2

We can use .I which would be more efficient

library(data.table)
df <- copy(mtcars)
setDT(df)[df[, .I[1L], by = .(cyl, am)]$V1]
#.    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#2: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#3: 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#4: 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#5: 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#6: 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4

Benchmarks

set.seed(24)
dt <- data.table(grp = rep(1:1e6, each = 20), 
    as.data.frame(matrix(rnorm( 20000000 * 20), ncol = 20)))

system.time({

    dt[,.SD[1L], by = .(grp)]
    })
#   user  system elapsed 
#  2.018   0.309   0.532 
    system.time({

       dt[dt[, .I[1L], by = .(grp)]$V1]

    })
#   user  system elapsed 
# 1.218   0.327   0.273 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    According to https://github.com/Rdatatable/data.table/issues/735, .SD[1] should be optimized now. If it's still slower, maybe worth reporting or illustrating with a benchmark – Frank Aug 02 '19 at 16:28
  • 1
    @Frank Yes, it improved, still `.I` shows better timing – akrun Aug 03 '19 at 02:02