1

I have some data.table from which I want to select a random subset, but only for some operations.

Suppose the data is

dat <- data.table(id=1:100, group=sample(1:20,100, replace=TRUE), a=runif(100), b=rnorm(100))

and I want to do two things:

  1. count the number of ids per group
  2. select from each group one id at random and record its value on a and b

I could follow How do you extract a few random rows from a data.table on the fly and choose

dat[n=.N, a=a[sample(.N,1)], b=b[sample(.N,1)], group]

but I am afraid, this will select a and b independently from one another. Is there a way of selecting the same?

bumblebee
  • 1,116
  • 8
  • 20
  • use {} in j to do multiple expressions -- first, select an index by sample()ing from .I, then apply this random index to both vectors – MichaelChirico Jun 25 '19 at 19:26

1 Answers1

8

Part 1

If you want to count the number of unique ids and some ids repeat within groups

dat[, .(n_ids = uniqueN(id)), group]

If ids don't repeat within groups or you don't want to count them on a unique basis

dat[, .(n_ids = .N), group]

Part 2

If ids repeat within groups and you want to return all rows for the randomly selected id in each group

dat[dat[, .(id = sample(id, 1)), group], on = .(id, group)]

If ids do not repeat, or you only want one row per group anyway

dat[dat[, sample(.I, 1), group]$V1]

Thanks to Frank's comment, you can also do the second option for parts 1 & 2 above in one line. This returns the row like dat[dat[, sample(.I, 1), group]$V1] but also adds a column N showing the number of ids (assumed to equal the number of rows in the group)

dat[sample(.N), c(.SD[1], .N), keyby=group]
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38