2

Do you have a faster way to resample (with replacement) groups in a dataset using R?

Edit: Note that I would like to resample groups of rows, not individual rows.

toydata <- data.frame(
  group = rep(letters[1:3], each = 2),
  rep   = rep(1:2, times = 3),
  value = 1:6)
    print(toydata)

  group rep value
1     a   1     1
2     a   2     2
3     b   1     3
4     b   2     4
5     c   1     5
6     c   2     6
ngroups <- n_distinct(toydata$group)
nreps   <- nrow(toydata) / ngroups    

s <- sample(unique(toydata$group), replace = TRUE)  # resampling groups with replacement
toydata_resampled <- left_join(
  x  = data.frame(group = rep(s, each = nreps), rep = rep(1:nreps, ngroups)),
  y  = toydata,
  by = c("group", "rep"))

One expected output:

> print(toydata_resampled)
  group rep value
1     a   1     1
2     a   2     2
3     a   1     1
4     a   2     2
5     c   1     5
6     c   2     6
Marco
  • 9,334
  • 7
  • 33
  • 51
  • I am not sure if the question you asked is similar to [the post](https://stackoverflow.com/questions/8273313/sample-random-rows-in-dataframe) – Denny Chen Feb 24 '22 at 08:39
  • It is similar, exept that I would like to resample groups of rows, not individual rows. – Marco Feb 24 '22 at 09:03
  • Is the second output a possible expected output? Why is your code not enough? – Maël Feb 24 '22 at 09:20
  • Yes it is one expected output. It works, but pretty slow (this resampling step will be repeated many many times in my pipeline). – Marco Feb 24 '22 at 09:23

2 Answers2

1

split your dataframe by groups, then sample the list, and return as data.frame.

set.seed(1)
do.call(rbind, sample(split(toydata, toydata$group), replace = T))

output

     group rep value
a.1      a   1     1
a.2      a   2     2
c.5      c   1     5
c.6      c   2     6
a.11     a   1     1
a.21     a   2     2
Maël
  • 45,206
  • 3
  • 29
  • 67
0

Is this faster?

iter <- 10
uq <- unique(toydata$group)
for (i in 1:iter){
  if (i == 1){
    output <- subset(toydata, group == sample(unique(toydata$group), 1))
  } else {
    output <- rbind(output, subset(toydata, group == sample(unique(toydata$group), 1)))
  }
}

output

> output
    group rep value
 1:     a   1     1
 2:     a   2     2
 3:     b   1     3
 4:     b   2     4
 5:     c   1     5
 6:     c   2     6
 7:     c   1     5
 8:     c   2     6
 9:     a   1     1
10:     a   2     2
11:     a   1     1
12:     a   2     2
13:     c   1     5
14:     c   2     6
15:     b   1     3
16:     b   2     4
17:     c   1     5
18:     c   2     6
19:     a   1     1
20:     a   2     2
21:     c   1     5
22:     c   2     6
23:     b   1     3
24:     b   2     4
25:     a   1     1
26:     a   2     2
    group rep value
Denny Chen
  • 489
  • 3
  • 8
  • I am not sure whether it is faster than `do.call` with `rbind` or not. But I think @Mael 's answer will be faster. The answer is given for an alternative. – Denny Chen Feb 24 '22 at 09:47