2

I've created the following code that nests a for loop inside of a for loop in R. It is a simulation to calculate Power. I've read that R isn't great for doing for loops but I was wondering if there are any efficiencies I could apply to make this run a bit faster. I'm fairly new to R as well as programming of any sort. Right now the run times I'm seeing are:

m=10 I get .17 sec

m=100 I get 3.95 sec

m=1000 I get 246.26 sec

m=2000 I get 1003.55 sec

I was hoping to set the number of times to sample, m, upwards of 100K but I'm afraid to even set this at 10K

Here is the code:

m = 1000                        # number of times we are going to  take samples
popmean=120                     # set population mean at 120
popvar=225                      # set known/established population 
variance at 225
newvar=144                      # variance of new methodology 
alpha=.01                       # set alpha
teststatvect = matrix(nrow=m,ncol=1)    # empty vector to populate with test statistics
power = matrix(nrow=200,ncol=1)     # empty vector to populate with power

system.time(                    # not needed - using to gauge how long this takes
    for (n in 1:length(power))          # begin for loop for different sample sizes
      for(i in 1:m){                # begin for loop to take "m" samples
      y=rnorm(n,popmean,sqrt(newvar))   # sample of size n with mean 120 and var=144
      ts=sum((y-popmean)^2/popvar)      # calculate test statistic for each sample
      teststatvect[i]=ts            # loop and populate the vector to hold test statistics
      vecpvals=pchisq(teststatvect,n)   # calculate the pval of each statistic
      power[n]=length(which(vecpvals<=alpha))/length(vecpvals) # loop to populate      power vector. Power is the proportion lessthan ot equal to alpha
        }
   }
 )
Will Phillips
  • 805
  • 2
  • 10
  • 20
  • [Some suggestions](http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r/8474941#8474941). In general, you want to vectorize. – Ari B. Friedman Oct 22 '12 at 22:42
  • The code doesn't run, i think you are missing a `{` after the first `for`. – Ryogi Oct 22 '12 at 22:50
  • You are computing `vecpvals` and `power` for each value of `i`; take those calculations outside your inner loop since they only need to be done one per outer loop. – Brian Diggs Oct 22 '12 at 23:34
  • you're right. just a copy-n-paste problem. It was in my original code in R. Thank you for pointing it out – Will Phillips Oct 22 '12 at 23:45

2 Answers2

3

I reorganized your code a bit and got rid of the inner loop.

  • Sampling one long vector of random numbers (and then collapsing it into a matrix) is much faster than repeatedly sampling short vectors (replicate, as suggested in another answer, is nice for readability, but in this case you can do better by sampling random numbers in a block)
  • colSums is faster than summing inside a for loop or using apply.
  • it's just sugar (i.e. it isn't actually any more efficient), but you can use mean(pvals<=alpha) in place of sum(pvals<=alpha)/length(alpha)
  • I defined a function to return the power for a specified set of parameters (including sample size), then used sapply to range over the vector of sizes (not faster than a for loop, but cleaner and maybe easier to generalize).

Code:

powfun <- function(ssize=100,
                   m=1000,      ## samples per trial
                   popmean=120, ## pop mean
                   popvar=225,  ## known/established pop variance
                   newvar=144,  ## variance of new methodology
                   alpha=0.01,
                   sampchisq=FALSE)  ## sample directly from chi-squared distrib?
{
    if (!sampchisq) {
      ymat <- matrix(rnorm(ssize*m,popmean,sd=sqrt(newvar)),ncol=m)
      ts <- colSums((ymat-popmean)^2/popvar)          ## test statistic
    } else {
      ts <- rchisq(m,df=ssize)*newvar/popvar
    }
    pvals <- pchisq(ts,df=ssize)                    ## pval
    mean(pvals<=alpha)                              ## power
}

Do you really need the power for every integer value of sample size, or would a more widely spaced sample be OK (if you need exact values, interpolation would probably be pretty accurate)

ssizevec <- seq(10,250,by=5)
set.seed(101)
system.time(powvec <- sapply(ssizevec,powfun,m=5000))  ## 13 secs elapsed

This is reasonably fast and might get you up to m=1e5 if you needed, but I'm not quite sure why you need results that are that precise -- the power curve is reasonably smooth with m=5000 ...

If you're impatiently waiting for long simulations, you can also get a progress bar to print by replacing sapply(ssizevec,powfun,m=5000) with library(plyr); aaply(ssizevec,.margins=1,powfun,.progress="text",m=5000)

Finally, I think you can speed the whole up a lot by sampling chi-squared values directly, or by doing an analytical power calculation (!). I think that rchisq(m,df=ssize)*newvar/popvar is equivalent to the first two lines of the loop, and you might even be able to do a numerical computation on the chi-squared densities directly ...

system.time(powvec2 <- sapply(ssizevec,powfun,m=5000,sampchisq=TRUE))
## 0.24 seconds elapsed

(I just tried this out, sampling m=1e5 at every value of sample size from 1 to 200 ... it takes 24 seconds ... but I still think it might be unnecessary.)

A picture:

par(bty="l",las=1)
plot(ssizevec,powvec,type="l",xlab="sample size",ylab="power",
     xlim=c(0,250),ylim=c(0,1))
lines(ssizevec,powvec2,col="red")

enter image description here

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Wow, 5000 in 13 seconds. Nice. As I mentioned in my original post, I'm new to R and programming in general. I'll need to digest what you've written but I think I'm following what you say. I did want all the points just so I can easily tell at what point power reaches 80%, 90% etc... – Will Phillips Oct 22 '12 at 23:56
  • Well, it's 5000 in 0.24 seconds if you sample directly with `rchisq` ... to do interpolation and inversion, try `library(splines); ?interpSpline; ?predict` ... – Ben Bolker Oct 23 '12 at 00:02
  • Where you have sapply(ssizevec,powfun,m=5000) is ssizevec being used in the powfun function as "ssize"? – Will Phillips Oct 23 '12 at 00:13
  • yes. it gets used as the first argument that isn't otherwise specified. – Ben Bolker Oct 23 '12 at 00:27
  • These RVs are not chisquare. I'm calculating power in a comparison of variances, which rejects Ho given Ha is true. Under the assumption of Ho, this is Chisquare. But under Ha it isn't. I haven't learned enough about non-central chisquare to know if that might be the case. – Will Phillips Oct 23 '12 at 02:34
  • I think they *are* chi-squared, scaled by an appropriate factor, under the assumption of different variances; under the assumption that the *mean* was different they would be non-central chi-square. (Notice that my simulation with chi-squared samples works almost perfectly ...) – Ben Bolker Oct 23 '12 at 11:56
0

In general, you want as far as possible to take advantage of vectorization, not so much for speed as readability/comprehension.

Why is writing to power[n] inside the inner loop (and I guess the calculation of vecpals as well)? Shouldn't that be in the outer loop after the inner loop executes? You may want to move the calculation of the square root outside both loops.

Why are teststatvect and power initialized as matrices (which are explicitly two dimensional arrays) and not vectors (or rather, as one dimensional arrays, using array)? Is variance at 225just the end of the comment from the previous line? You may want to check formatting. (Is this homework?)

For what it looks like you're trying to do here, you may want to take advantage of the very handy function replicate, perhaps by writing a specific function to call it on.

Glen_b
  • 7,883
  • 2
  • 37
  • 48
  • Thank you for the tip about replicate. I need to research that. As to your "why" questions, the answer is in my original post. I'm new to R and programming. I'm just trying to figure things out. I wrote code that worked but not efficiently. Just want to see if it can be better. As to your question about homework, no. – Will Phillips Oct 22 '12 at 23:59