2

I want to draw clusters (defined by the variable id) with replacement from a dataset, and in contrast to previously answered questions, I want clusters that are chosen K times to have each observation repeated K times. That is, I'm doing cluster bootstrapping.

For example, the following samples id=1 twice, but repeats the observations for id=1 only once in the new dataset s. I want all observations from id=1 to appear twice.

f <- data.frame(id=c(1, 1, 2, 2, 2, 3, 3), X=rnorm(7))
set.seed(451)
new.ids <- sample(unique(f$id), replace=TRUE)
s <- f[f$id %in% new.ids, ]
jay.sf
  • 60,139
  • 8
  • 53
  • 110
half-pass
  • 1,851
  • 4
  • 22
  • 33

2 Answers2

3

One option would be to lapply over each new.id and save it in a list. Then you can stack that all together:

library(data.table)
rbindlist(lapply(new.ids, function(x) f[f$id %in% x,]))
#  id           X
#1:  1  1.20118333
#2:  1 -0.01280538
#3:  1  1.20118333
#4:  1 -0.01280538
#5:  3 -0.07302158
#6:  3 -1.26409125
jay.sf
  • 60,139
  • 8
  • 53
  • 110
Mike H.
  • 13,960
  • 2
  • 29
  • 39
1

Just in case one would need to have a "new_id" that corresponded to the index number (i.e. sample order) -- (I needed to have "new_id" so that i could run mixed effects models without having several instances of a cluster treated as one cluster because they shared the same id):

library(data.table)
f = data.frame( id=c(1,1,2,2,2,3,3), X = rnorm(7) )
set.seed(451); new.ids = sample( unique(f$id), replace=TRUE )
## ss has unique valued `new_id` for each cluster
ss = rbindlist(mapply(function(x, index) cbind(f[f$id %in% x,], new_id=index),
                      new.ids,
                      seq_along(new.ids),
                      SIMPLIFY=FALSE
))
ss

which gives:

> ss
   id          X new_id
1:  1 -0.3491670      1
2:  1  1.3676636      1
3:  1 -0.3491670      2
4:  1  1.3676636      2
5:  3  0.9051575      3
6:  3 -0.5082386      3

Note the values of X are different because set.seed is not set before the rnorm() call, but the id is the same as the answer of @Mike H.

This link was useful to me in constructing this answer: R lapply statement with index [duplicate]

swihart
  • 2,648
  • 2
  • 18
  • 42