1

I am attempting to do bootstrap resampling on a multilevel/hierarchical dataset. The observations are (unique) patients clustered within hospitals.

My strategy is to sample with replacement from the patients within each hospital in turn, which will ensure that all hospitals are represented in the sample and that when repeated all the samples sizes will be the same. This is method 2 here.

My code is like this:

hv <- na.omit(unique(dt$hospital))

samp.out <- NULL

for (hosp in hv ) {
    ss1 <- dt[dt$hospital==hosp & !is.na(dt$hospital),]
    ss2 <- ss1[sample(1:nrow(ss1),nrow(ss1), replace=T),]
    samp.out <- rbind(samp.out,ss2)
}

This seems to work (though if anyone can see any problem I would be grateful).

The issue is that it is slow, so I would like to know if there are ways to speed this up.

Update:

I have tried to implement Ari B. Friedman's answer but without success - so I have modified it slightly, with the aim of constructing a vector which then indexes the original dataframe. Here is my new code:

# this is a vector that will hold unique IDs
v.samp <- rep(NA, nrow(dt))

#entry to fill next
i <- 1

for (hosp in hv ) {
    ss1 <- dt[dt$hospital==hosp & !is.na(dt$hospital),]

    # column 1 contains a unique ID
    ss2 <- ss1[sample(1:nrow(ss1),nrow(ss1), replace=T),1]
    N.fill <- length(ss2)
    v.samp[ seq(i,i+N.fill-1) ] <- ss2

    # update entry to fill next
    i <- i + N.fill
}

samp.out <- dt[dt$unid %in% v.samp,]

This is fast ! BUT, it fails to work properly because it only selects the unique IDs of v.samp in the final line, but the sampling is with replacement so there are repeated IDs in v.samp. Any further help will be much appreciated

Joe King
  • 2,955
  • 7
  • 29
  • 43
  • Try `samp.out <- dt[v.samp,]` instead. – Ari B. Friedman Oct 20 '12 at 08:40
  • @AriB.Friedman Thanks, but I don't really understand why that would work ? The IDs are essentially random numbers (anonymised patient IDs) rather than consecutive row numbers. I have tried it anyway, and after 10 mins it returned the correct number of rows, but the entire dataframe contained NA everywhere. – Joe King Oct 20 '12 at 08:52

1 Answers1

2

The usual trick to speeding up bootstrapping is to draw the whole sample (all replicates) for each hospital at once, then assign them to replicates. That way you only run ss1<- once per hospital. You can likely improve on that by not subsetting for each hospital. Another huge win might come from pre-allocating rather than rbinding. More suggestions on speed improvements.

To re-allocate, calculate how many entries you need (call it N.out). Then, just before your loop, add:

samp.out <- rep(NA, N.out)

And replace your rbind line with:

samp.out[ seq(i,i+N.iter) ] <- ss2

Where i is your calculation of the first entry not yet filled, and i+N.iter is the last entry you have data to fill on this round.

See the R Inferno for more details and tricks.

Update

You have two approaches and you're mixing them. You can either make v.samp a data.frame and just sample all the rows into it in real-time, or you can sample IDs, and then select a data.frame using the vector of IDs outside of the loop. The key to the latter is that myDF[c(1,1,5,2,3),] will give you a data.frame which repeats the first row--exactly what you want, and exactly what that feature was designed for. Make sure v.samp is an ID that you can select from a data.frame on (either a row number or a row name), then select outside the loop.

Community
  • 1
  • 1
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • Thank you ! I may be misunderstanding, but my code _was_ intended to run `ss1 <- ` once per hospital. Also can you give me a clue how I can pre-allocate `samp.out` (this sounds promising). – Joe King Oct 19 '12 at 22:25
  • Thanks for your updated answer. I tried to implement your suggestion but it didn't quite work, maybe because `ss2` is an extract of a data frame, so the replacement line gave the error about replacement length differing (also I think it should be `seq(i,i+N.iter-1)`. I then had the idea to use your suggesting to construct just a vector which is then used to index into the original data frame but I can't quite get that to work. I'm going to update the question with my new code now, so if you could look at that I would be very grateful ! – Joe King Oct 20 '12 at 08:00
  • Thanks for your new update. I have now added a new (sequential) row ID with `dt$ID <- seq(1:nrow(dt))` and then `samp.out <- dt[v.samp,]` seems to work great. 25 secs now, instead of 25 mins :) – Joe King Oct 20 '12 at 09:26
  • 60-fold speed improvement. Darn, didn't make two full orders of magnitude ;-) – Ari B. Friedman Oct 20 '12 at 09:29