Randomly subset data set multiple times and calculate means and variances

Question

I never came to any conclusions re: this question, so I thought I would rephrase it and ask again.

I would like to subsample my dataset 10,000 times to generate means and 95% CIs for each of my responses.

Here is an example of how the data set is structured:

x <- read.table(tc <- textConnection("
study      expt    variable  value1  value2
  1         1         A       1.0      1.1 
  1         2         B       1.1      2.1 
  1         3         B       1.2      2.9
  1         4         C       1.5      2.3 
  2         1         A       1.7      0.3 
  2         2         A       1.9      0.3 
  3         1         A       0.2      0.5"), header = TRUE); close(tc)

I would like to subsample each study/variable combination only once. So, for example, the subsetted dataset would look like this:

study      expt    variable  value1  value2
  1         1         A       1.0      1.1 
  1         2         B       1.1      2.1 
  1         4         C       1.5      2.3 
  2         1         A       1.7      0.3 
  3         1         A       0.2      0.5

Notice rows 3 and 6 are gone, because both measured a variable twice (B in the first case, A in the second case).

I want to draw subsampled data sets again and again so I may derive overall means of value1 and value2 with 95% CIs for each variable. So the output I would like after the whole subsampling routine would be:

variable   mean_value1   lower_value1  upper_value1  mean_value2  etc....
   A            2.3           2.0          2.6           2.1
   B            2.5           2.0          3.0           2.5
   C            2.1           1.9          2.3           2.6

Here is some code I have to grab the subset:

 subsample<-function(x, B){
samps<-ddply(x, .(study,variable), nrow)[,3] #for each study/variable combination, 
                                                  #how many experiments are there
expIdx<-which(!duplicated(x$study)) #what is the first row of each study
n<-length(samps) #how many studies are there

sapply(1:B, function(a) { #use sapply for the looping, as it's more efficient than for
    idx<-floor(runif(n, rep(0,n), samps)) #get the experiment number-1 for each study
    x$value[idx+expIdx] #now get a vector of values
})

Any help is appreciated. I recognize this is complicated so please let me know if you need clarification!

Providing a minimal reproducible example is encouraged. See : http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Joris Meys, Jul 25 '11 at 14:50
Sorry, thanks for updating my question--still new to StackOverflow! — jslefche, Jul 25 '11 at 14:55
@jslefche: There's [a decent blog article about writing questions](http://tinyurl.com/so-hints). — Lightness Races in Orbit, Jul 25 '11 at 15:22
Thanks all, you will see from my next question that I took your comments to heart: http://stackoverflow.com/questions/6819047/r-ggplot2-offset-scatterplot-points — jslefche, Jul 25 '11 at 16:07
Ok, so what you're describing isn't bootstrapping, which involves resampling your data (with replacement); what you're describing is random *subsetting* of your data. I can write some code that will do this, but are you sure it's what you want? — joran, Jul 25 '11 at 19:26
@Joran--you're correct, I want to randomly subsample my data set over and over to get some estimate of the mean and variance (as 95% CIs). My statistical ignorance reigns its head again, as apparently the bootstrapping procedure can only be performed with replacement. — jslefche, Jul 26 '11 at 13:26
Unfortunately, as you state, the use of ddply is really inefficient in this case, and so it takes >1 hr to run the script. I was probing to see if anyone had a more efficient solution. I meant no offense! — jslefche, Aug 25 '11 at 13:06

score 3 · Answer 1 · answered Jul 25 '11 at 15:12

3

Split your data by Study, Experiment and Variable, then apply the bootstrap to each subset. There are many ways to do this, including:

sdfr <- with(dfr, split(dfr, list(Study, Experiment, Variable)))
sdfr <- Filter(nrow, sdfr)   #to remove empty data frames

lapply(sdfr, function(x) 
{
  boot(x$Response1, statistic = mean, R = 10000, sim = "parametric")
})

answered Jul 25 '11 at 15:12

Richie Cotton

118,240
47
247
360

The OP's description of the bootstrapping is a little murky, but it's possible they meant to bootstrap over the response estimates across Experiments for each Variable *within* Study, in which case you'd simply omit Experiment when splitting. With the OP's example data I think this code will bootstrap a single value in each case. – joran Jul 25 '11 at 16:54
@jslefche - That doesn't sound like what Richie's code is doing. Additionally, this description of your bootstrapping procedure makes even less sense to me that the one in your OP. Perhaps you could edit your question to show us what you think the output ought to look like? – joran Jul 25 '11 at 18:52
@Joran--sorry for the confusion, I'm still a bit new at this. I edited my post above and tried to be as explicit as possible. Please let me know if you need any more information! – jslefche Jul 25 '11 at 19:08

score 2 · Answer 2 · answered Jul 26 '11 at 14:53

Here's a solution, although fair warning, it's not going to scale terribly well and I'm unaware of the statistical validity of this kind of scheme:

#Replicate your example data
set.seed(1)
dat <- expand.grid(Study = 1:4,Experiment = 1:3, Response = LETTERS[1:4])
dat$Value1 <- runif(48)
dat$Value2 <- runif(48)

#Function to apply to each Response level
#Note the rather inefficient use of ddply 
# in a for loop to do the 'stratified' 
# subsampling you describe
myFun <- function(x,B){
    rs <- matrix(NA,B,2)
    for (i in 1:B){
        temp <- ddply(x,.(Study), .fun = function(x) x[sample(1:nrow(x),1),])
        rs[i,] <- colMeans(temp[,4:5])
    }
    c(Value1 = mean(x$Value1), quantile(rs[,1],probs=c(0.025,0.975)),
            Value2 = mean(x$Value2), quantile(rs[,2],probs=c(0.025,0.975)))
}

ddply(dat,.(Response),.fun = myFun,B=50)

Example output

  Response    Value1      2.5%     97.5%    Value2      2.5%     97.5%
1        A 0.4914725 0.2721876 0.8311799 0.4600546 0.2596446 0.6909686
2        B 0.5941457 0.4018281 0.8047503 0.5241470 0.2865285 0.7099486
3        C 0.4596998 0.2752685 0.6340614 0.5761497 0.3546133 0.8115933
4        D 0.5550651 0.2717772 0.7298913 0.4645609 0.1868757 0.7985816

Unfortunately this appears to still select studies randomly from the data set. — jslefche, Aug 01 '11 at 13:01
@jslefche - That is incorrect. There is no random selection by Study in this code. — joran, Aug 01 '11 at 13:46

Randomly subset data set multiple times and calculate means and variances

2 Answers2