I am attempting to do bootstrap resampling on a multilevel/hierarchical dataset. The observations are (unique) patients clustered within hospitals.
My strategy is to sample with replacement from the patients within each hospital in turn, which will ensure that all hospitals are represented in the sample and that when repeated all the samples sizes will be the same. This is method 2 here.
My code is like this:
hv <- na.omit(unique(dt$hospital))
samp.out <- NULL
for (hosp in hv ) {
ss1 <- dt[dt$hospital==hosp & !is.na(dt$hospital),]
ss2 <- ss1[sample(1:nrow(ss1),nrow(ss1), replace=T),]
samp.out <- rbind(samp.out,ss2)
}
This seems to work (though if anyone can see any problem I would be grateful).
The issue is that it is slow, so I would like to know if there are ways to speed this up.
Update:
I have tried to implement Ari B. Friedman's answer but without success - so I have modified it slightly, with the aim of constructing a vector which then indexes the original dataframe. Here is my new code:
# this is a vector that will hold unique IDs
v.samp <- rep(NA, nrow(dt))
#entry to fill next
i <- 1
for (hosp in hv ) {
ss1 <- dt[dt$hospital==hosp & !is.na(dt$hospital),]
# column 1 contains a unique ID
ss2 <- ss1[sample(1:nrow(ss1),nrow(ss1), replace=T),1]
N.fill <- length(ss2)
v.samp[ seq(i,i+N.fill-1) ] <- ss2
# update entry to fill next
i <- i + N.fill
}
samp.out <- dt[dt$unid %in% v.samp,]
This is fast ! BUT, it fails to work properly because it only selects the unique IDs of v.samp
in the final line, but the sampling is with replacement so there are repeated IDs in v.samp
. Any further help will be much appreciated