1

I never came to any conclusions re: this question, so I thought I would rephrase it and ask again.

I would like to subsample my dataset 10,000 times to generate means and 95% CIs for each of my responses.

Here is an example of how the data set is structured:

x <- read.table(tc <- textConnection("
study      expt    variable  value1  value2
  1         1         A       1.0      1.1 
  1         2         B       1.1      2.1 
  1         3         B       1.2      2.9
  1         4         C       1.5      2.3 
  2         1         A       1.7      0.3 
  2         2         A       1.9      0.3 
  3         1         A       0.2      0.5"), header = TRUE); close(tc)

I would like to subsample each study/variable combination only once. So, for example, the subsetted dataset would look like this:

study      expt    variable  value1  value2
  1         1         A       1.0      1.1 
  1         2         B       1.1      2.1 
  1         4         C       1.5      2.3 
  2         1         A       1.7      0.3 
  3         1         A       0.2      0.5

Notice rows 3 and 6 are gone, because both measured a variable twice (B in the first case, A in the second case).

I want to draw subsampled data sets again and again so I may derive overall means of value1 and value2 with 95% CIs for each variable. So the output I would like after the whole subsampling routine would be:

variable   mean_value1   lower_value1  upper_value1  mean_value2  etc....
   A            2.3           2.0          2.6           2.1
   B            2.5           2.0          3.0           2.5
   C            2.1           1.9          2.3           2.6

Here is some code I have to grab the subset:

 subsample<-function(x, B){
samps<-ddply(x, .(study,variable), nrow)[,3] #for each study/variable combination, 
                                                  #how many experiments are there
expIdx<-which(!duplicated(x$study)) #what is the first row of each study
n<-length(samps) #how many studies are there

sapply(1:B, function(a) { #use sapply for the looping, as it's more efficient than for
    idx<-floor(runif(n, rep(0,n), samps)) #get the experiment number-1 for each study
    x$value[idx+expIdx] #now get a vector of values
})

Any help is appreciated. I recognize this is complicated so please let me know if you need clarification!

jslefche
  • 4,379
  • 7
  • 39
  • 50
  • Providing a minimal reproducible example is encouraged. See : http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Joris Meys Jul 25 '11 at 14:50
  • 1
    Sorry, thanks for updating my question--still new to StackOverflow! – jslefche Jul 25 '11 at 14:55
  • @jslefche: There's [a decent blog article about writing questions](http://tinyurl.com/so-hints). – Lightness Races in Orbit Jul 25 '11 at 15:22
  • Thanks all, you will see from my next question that I took your comments to heart: http://stackoverflow.com/questions/6819047/r-ggplot2-offset-scatterplot-points – jslefche Jul 25 '11 at 16:07
  • Ok, so what you're describing isn't bootstrapping, which involves resampling your data (with replacement); what you're describing is random *subsetting* of your data. I can write some code that will do this, but are you sure it's what you want? – joran Jul 25 '11 at 19:26
  • @Joran--you're correct, I want to randomly subsample my data set over and over to get some estimate of the mean and variance (as 95% CIs). My statistical ignorance reigns its head again, as apparently the bootstrapping procedure can only be performed with replacement. – jslefche Jul 26 '11 at 13:26
  • My code still does exactly what you describe. – joran Aug 24 '11 at 19:13
  • Unfortunately, as you state, the use of ddply is really inefficient in this case, and so it takes >1 hr to run the script. I was probing to see if anyone had a more efficient solution. I meant no offense! – jslefche Aug 25 '11 at 13:06

2 Answers2

3

Split your data by Study, Experiment and Variable, then apply the bootstrap to each subset. There are many ways to do this, including:

sdfr <- with(dfr, split(dfr, list(Study, Experiment, Variable)))
sdfr <- Filter(nrow, sdfr)   #to remove empty data frames

lapply(sdfr, function(x) 
{
  boot(x$Response1, statistic = mean, R = 10000, sim = "parametric")
})
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • The OP's description of the bootstrapping is a little murky, but it's possible they meant to bootstrap over the response estimates across Experiments for each Variable *within* Study, in which case you'd simply omit Experiment when splitting. With the OP's example data I think this code will bootstrap a single value in each case. – joran Jul 25 '11 at 16:54
  • @jslefche - That doesn't sound like what Richie's code is doing. Additionally, this description of your bootstrapping procedure makes even less sense to me that the one in your OP. Perhaps you could edit your question to show us what you think the output ought to look like? – joran Jul 25 '11 at 18:52
  • @Joran--sorry for the confusion, I'm still a bit new at this. I edited my post above and tried to be as explicit as possible. Please let me know if you need any more information! – jslefche Jul 25 '11 at 19:08
2

Here's a solution, although fair warning, it's not going to scale terribly well and I'm unaware of the statistical validity of this kind of scheme:

#Replicate your example data
set.seed(1)
dat <- expand.grid(Study = 1:4,Experiment = 1:3, Response = LETTERS[1:4])
dat$Value1 <- runif(48)
dat$Value2 <- runif(48)

#Function to apply to each Response level
#Note the rather inefficient use of ddply 
# in a for loop to do the 'stratified' 
# subsampling you describe
myFun <- function(x,B){
    rs <- matrix(NA,B,2)
    for (i in 1:B){
        temp <- ddply(x,.(Study), .fun = function(x) x[sample(1:nrow(x),1),])
        rs[i,] <- colMeans(temp[,4:5])
    }
    c(Value1 = mean(x$Value1), quantile(rs[,1],probs=c(0.025,0.975)),
            Value2 = mean(x$Value2), quantile(rs[,2],probs=c(0.025,0.975)))
}

ddply(dat,.(Response),.fun = myFun,B=50)

Example output

  Response    Value1      2.5%     97.5%    Value2      2.5%     97.5%
1        A 0.4914725 0.2721876 0.8311799 0.4600546 0.2596446 0.6909686
2        B 0.5941457 0.4018281 0.8047503 0.5241470 0.2865285 0.7099486
3        C 0.4596998 0.2752685 0.6340614 0.5761497 0.3546133 0.8115933
4        D 0.5550651 0.2717772 0.7298913 0.4645609 0.1868757 0.7985816
joran
  • 169,992
  • 32
  • 429
  • 468