Any way to use `by` variable in data.table to operate on rows (rather than columns) by group

Question

I am wondering if there is any way to use grouping in data.tables to select rows from within each group. The easiest way to explain what I'm looking for is by way of a minimal example.

1) A toy data.table to work with:

set.seed(123)
mydt = data.table(
  grp  = rep(LETTERS[1:5], each = 20),
  site = rep(letters[1:20], 5),
  V1 = rnorm(100))

2) This doesn't work:

Now suppose we want to extract the row from each group of grp for which V1 is the smallest. The naiive approach below fails for obvious reasons:

mydt[which.min(V1), .(site, V1), by=grp]
#    grp site        V1
# 1:   D    l -2.309169

This only returns a single value (the smallest V1 in the whole table), because the subsetting on i is conducted prior to the looping over by. Whereas what we really want is the smallest row within each group.

3) This does work, but is hacky:

res = data.table()
for (mygrp in unique(mydt[,grp])) {
  smallest = mydt[grp == mygrp, which.min(V1)]
  res = rbind(res, mydt[grp == mygrp][smallest])
  }
#    grp site        V1
# 1:   A    r -1.966617
# 2:   B    f -1.686693
# 3:   C    q -1.548753
# 4:   D    l -2.309169
# 5:   E    t -1.026421

4) Is there a better way?

Is there a more canonical data.table style apporach to doing this?

I think `mydt[mydt[, .I[which.min(V1)], by=grp]$V1]` might be what you want. — Rich Scriven, Feb 08 '17 at 20:02
Thanks @RichScriven. That works. Post as answer? (perhaps with a little explanation about how it works). — dww, Feb 08 '17 at 20:06
This has actually been asked before. Explanation is in the linked duplicate. — Rich Scriven, Feb 08 '17 at 20:06

Any way to use `by` variable in data.table to operate on rows (rather than columns) by group

0 Answers0