0

This post discusses a routine for sampling with different percentages by group.

But what about if you just want to sample, say, 50% without replacement by group? What about if you want to sample 50% with replacement by group?

With dplyr, you have sample_frac to perform this. What about data.table?

divibisan
  • 11,659
  • 11
  • 40
  • 58
Union find
  • 7,759
  • 13
  • 60
  • 111

2 Answers2

2

You could use sample with .N to get a proportion. You can use replace = TRUE to sample with replacement (default is FALSE):

DT = data.table(a = sample(1:2), b = sample(1:1000,20))
DT[, .SD[sample(.N, floor(.5 * .N))], by = a]

#     a   b
#  1: 2 552
#  2: 2 246
#  3: 2 979
#  4: 2 611
#  5: 2 469
#  6: 1 703
#  7: 1 909
#  8: 1 274
#  9: 1 279
# 10: 1 316

A faster alternative is (taken from @akrun):

DT[DT[, .I[sample(.N, floor(0.5 * .N))], by = a]$V1]
Maël
  • 45,206
  • 3
  • 29
  • 67
  • Are there any optimizations that can be done to this to avoid or improve on .SD? For example: https://stackoverflow.com/questions/15273491/r-data-table-slow-aggregation-when-using-sd – Union find Feb 21 '23 at 17:32
  • I will rely on this function very very heavily.. so think tens of thousands of group resamples in simulations. – Union find Feb 21 '23 at 17:32
1

If the group ordering of the data.table to be sampled remains stable throughout the simulation, pre-calculating the indices more than doubles the speed for thousands of replications.

library(data.table)

dt <- data.table(A = sample(1:10, 1e3, 1), B = sample(1000))

system.time(for (i in 1:1e4) dt[dt[, .I[sample(.N, .N%/%2)], A][[2]]])
#>    user  system elapsed 
#>    4.83    0.23    5.06
system.time({
  idx <- dt[,.(.(.I)), A][[2]]
  for (i in 1:1e4) dt[unlist(lapply(idx, function(x) sample(x, length(x)%/%2)))]
})
#>    user  system elapsed 
#>    1.78    0.13    1.90
jblood94
  • 10,340
  • 1
  • 10
  • 15