How to bootstrap a function after taking a randomly drawn sample without replacement

Question

I have some code that allows me to take two randomly drawn samples from a dataset, apply a function and repeat the procedure a certain number of times (see below code from associated question: How to bootstrap a function with replacement and return the output).

Example data:

> dput(a)
structure(list(index = 1:30, val = c(14L, 22L, 1L, 25L, 3L, 34L, 
35L, 36L, 24L, 35L, 33L, 31L, 30L, 30L, 29L, 28L, 26L, 12L, 41L, 
36L, 32L, 37L, 56L, 34L, 23L, 24L, 28L, 22L, 10L, 19L), id = c(1L, 
2L, 2L, 3L, 3L, 4L, 5L, 6L, 7L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 
14L, 15L, 16L, 16L, 17L, 18L, 19L, 20L, 21L, 21L, 22L, 23L, 24L, 
25L)), .Names = c("index", "val", "id"), class = "data.frame", row.names = c(NA, 
-30L))

Code:

   library(plyr)
    extractDiff <- function(P){
      subA <- P[sample(nrow(P), 15, replace=TRUE), ] # takes a random sample of 15 rows
      subB <- P[sample(nrow(P), 15, replace=TRUE), ] # takes a second random sample of 15 rows
      meanA <- mean(subA$val)
      meanB <- mean(subB$val)
      diff <- abs(meanA-meanB)
      outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
      return(outdf)
    }

    set.seed(42)
    fin <- do.call(rbind, replicate(10, extractDiff(a), simplify=FALSE))

Rather than taking TWO randomly drawn samples of size 15, I would like to take one randomly drawn sample of size 15, then extract the remaining 15 rows in the dataset after the first random draw has been taken (i.e. subA would equal the first randomly drawn sample of 15 obs, subB would equal the remaining 15 obs after subA has been taken). I am really not sure how to go about doing this. Any help would be really appreciated. Thanks!

not sure I totally understand -- your `ddply` step reduces the number of rows from 30 (in `P`) to 25 (in `xA`). So does the 15-row-sample happen before that step? Or does it get replaced? — AndrewMacDonald, Jun 25 '14 at 18:20
@ Andrew: apologies - ddply should not have been in there. I have removed this. All I want to do is to take a random sample of 15 from the original data and store it as subA, then store the remaining 15 obs as subB within the extractDiff function. Many thanks! — jjulip, Jun 25 '14 at 18:50

score 1 · Answer 1 · answered Jun 25 '14 at 18:55

In that case, I would just shuffle up the row numbers of P (stored in index below) and then choose the first 15 for subA and the second 15 for subB:

library(plyr)
extractDiff <- function(P){
  index <- sample(seq_len(nrow(P)),replace = FALSE)
  subA <- P[index[1:15], ] # takes a random sample of 15 rows
  subB <- P[index[16:30], ] # takes a second random sample of 15 rows
  meanA <- mean(subA$val)
  meanB <- mean(subB$val)
  diff <- abs(meanA-meanB)
  outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
  return(outdf)
}

set.seed(42)
fin <- do.call(rbind, replicate(10, extractDiff(a), simplify=FALSE))

@ Andrew: Many thanks for the answer. Much appreciated. I like its simplicity. — jjulip, Jun 26 '14 at 09:26

Barker · Accepted Answer · 2014-06-25T19:51:05.023

I believe you can do this by making a small change to your code as so.

extractDiff <- function(P){
  sampleset = sample(nrow(P), 15, replace=FALSE) #select the first 15 rows, note replace=FALSE
  subA <- P[sampleset, ] # takes the 15 selected rows
  subB <- P[-sampleset, ] # takes the remaining rows in the set
  meanA <- mean(subA$val)
  meanB <- mean(subB$val)
  diff <- abs(meanA-meanB)
  outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
  return(outdf)
}

However, please note that this is not compatable with boot strapping as bootstrapping requires replacement. If on the other hand you want to sample with replacement from the data set, and then sample with replacement from the dataset not selected in the first sampling you could do the following.

extractDiff <- function(P){
  sampleset1 = sample(nrow(P), 15, replace=TRUE) #select the first 15 rows, note replace=TRUE
  sampleset2 = sample((1:nrow(P))[-unique(sampleset1)],15,replace=TRUE) #selects only from rows not used in sampleset1
  subA <- P[sampleset1, ] # takes the 15 selected rows
  subB <- P[sampleset2, ] # takes the 15 selected rows in the remaining set set
  meanA <- mean(subA$val)
  meanB <- mean(subB$val)
  diff <- abs(meanA-meanB)
  outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
  return(outdf)
}

However this still may not be ideal depending on your application as the second dataset is more likely to have multiple instances of a value than the first. If you were selecting a smaller proportion of the total set it would be much less of a problem. You may be better off dividing the set into two using 'shuffle' and sampling with replacement from both halves so the two sets are more even, but this will prevent the first set from being a true boot strapping set again.

@ Many thanks for your suggestions. Really helpful and appreciated. I didn't realise that sample() took the 'first' no. rows that you request (as suggested above). I thought is sampled randomly? Thanks. — jjulip, Jun 26 '14 at 09:24
Sample does select randomly. When I said "the first 15 rows" I meant it selects rows the first time, for the first data set. They are randomly selected. The key here is the use of replacement so that when you sample, you can take some elements twice. — Barker, Jun 26 '14 at 15:05

How to bootstrap a function after taking a randomly drawn sample without replacement

2 Answers2