How to sample percent by group using data.table?

Question

This post discusses a routine for sampling with different percentages by group.

But what about if you just want to sample, say, 50% without replacement by group? What about if you want to sample 50% with replacement by group?

With dplyr, you have sample_frac to perform this. What about data.table?

Maël · Answer 1 · 2023-02-22T09:27:18.040

2

You could use sample with .N to get a proportion. You can use replace = TRUE to sample with replacement (default is FALSE):

DT = data.table(a = sample(1:2), b = sample(1:1000,20))
DT[, .SD[sample(.N, floor(.5 * .N))], by = a]

#     a   b
#  1: 2 552
#  2: 2 246
#  3: 2 979
#  4: 2 611
#  5: 2 469
#  6: 1 703
#  7: 1 909
#  8: 1 274
#  9: 1 279
# 10: 1 316

A faster alternative is (taken from @akrun):

DT[DT[, .I[sample(.N, floor(0.5 * .N))], by = a]$V1]

edited Feb 22 '23 at 09:27

answered Feb 21 '23 at 17:13

Maël

45,206
3
29
67

Are there any optimizations that can be done to this to avoid or improve on .SD? For example: https://stackoverflow.com/questions/15273491/r-data-table-slow-aggregation-when-using-sd – Union find Feb 21 '23 at 17:32
I will rely on this function very very heavily.. so think tens of thousands of group resamples in simulations. – Union find Feb 21 '23 at 17:32

jblood94 · Accepted Answer · 2023-02-22T22:13:44.587

1

If the group ordering of the data.table to be sampled remains stable throughout the simulation, pre-calculating the indices more than doubles the speed for thousands of replications.

library(data.table)

dt <- data.table(A = sample(1:10, 1e3, 1), B = sample(1000))

system.time(for (i in 1:1e4) dt[dt[, .I[sample(.N, .N%/%2)], A][[2]]])
#>    user  system elapsed 
#>    4.83    0.23    5.06
system.time({
  idx <- dt[,.(.(.I)), A][[2]]
  for (i in 1:1e4) dt[unlist(lapply(idx, function(x) sample(x, length(x)%/%2)))]
})
#>    user  system elapsed 
#>    1.78    0.13    1.90

edited Feb 22 '23 at 22:13

answered Feb 22 '23 at 13:47

jblood94

10,340
1
10
15

Is this percent by group? @jblood94 – Union find Feb 22 '23 at 15:13
It will be 50% by group. To sample at a, e.g., 40% rate, change `length(x)%/%2` to `floor(length(x)*0.4)`. – jblood94 Feb 22 '23 at 15:16
This proved to be about 2.5x as fast as my dplyr::sample_frac implementation. So major kudos. – Union find Feb 22 '23 at 16:42
This is actually throwing an error for me now: `i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14).` – Union find Feb 22 '23 at 18:05
Though, not in all cases. Not sure why. – Union find Feb 22 '23 at 18:18
1

Try using `lapply` instead of `sapply`. – jblood94 Feb 22 '23 at 18:39

How to sample percent by group using data.table?

2 Answers2