Subset by column criteria AND randomly sample rows of a data.table

Question

@gented's answer here demonstrates how to randomly select a subset of rows from a data.table.

What if I wanted to select all rows in a data.table for which the values in a certain column meet a specific condition, AND ADDITIONALLY select a random subset of rows from the data.table for which the values in the same column meet a different condition?

Say, for example, that I wanted a random sample of 5 rows from the mtcars data.table for which cyl == 6, and all rows for which cyl == 8.

Is this achievable in a better way than:

rbind(
    mtcars[ cyl == 8 ],
    mtcars[ cyl == 6 ][ sample(.N, 5) ]
)

That is, can I subset the data.table in a single set of []'s so that I could also, for example, apply a function within that call (in the lapply(.SD, function) format)?

This obviously does not achieve the desired result, but is similar to the syntax I'm looking for:

mtcars[ 
    cyl == 8 | ( cyl == 6 & sample( .N, 5 ) ), 
    lapply(.SD, generic_funciton), 
    .SDcols = (specific_cols) 
]

Jaap · Accepted Answer · 2019-07-06T12:56:29.470

To achieve that, I would utilize the .I special symbol as follows:

DT <- as.data.table(mtcars)

DT[c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))]

Now you can do some computations:

set.seed(2019)
DT[c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))
   , lapply(.SD, mean)
   , by = am
   , .SDcols = 3:5]

which gives:

   am   disp       hp     drat
1:  0 325.64 179.0667 3.224667
2:  1 243.00 204.7500 3.890000

If you want to reuse that index vector at a later moment, you can store it beforehand:

idx <- c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))

DT[idx, lapply(.SD, mean), .SDcols = 3:5]

Alexis · Answer 2 · 2019-07-04T19:14:10.393

As long as i ends up with something that can be used to select rows, you can put any valid expression there, which technically means you can write:

DT[c(sample(which(cyl == 6), 5L), which(cyl == 8))]

But that probably won't benefit from optimizations.

Based on this answer (and secondary indices), I would think something like this would be a lot faster:

sample_if <- function(condition, values, n) {
  if (condition)
    sample(values, n)
  else
    values
}

some_fun <- function(.SD) {
  .SD
}

DT[DT[.(c(6, 8)), sample_if(.BY$cyl == 6, .I, 5L), by = "cyl", on = "cyl"]$V1,
   some_fun(.SD),
   .SDcols = c("cyl", "mpg")]
    cyl  mpg
 1:   6 19.7
 2:   6 19.2
 3:   6 21.4
 4:   6 21.0
 5:   6 18.1
 6:   8 18.7
 7:   8 14.3
 8:   8 16.4
 9:   8 17.3
10:   8 15.2
11:   8 10.4
12:   8 10.4
13:   8 14.7
14:   8 15.5
15:   8 15.2
16:   8 13.3
17:   8 19.2
18:   8 15.8
19:   8 15.0

Subset by column criteria AND randomly sample rows of a data.table

2 Answers2