0

I am writing a simulation function to calculate the power of t test in R. However, it is not efficient to write loops in R, is there any other way that I can achieve my goal without a loop?

#Define a simulation function
simulation <- function(N,alpha,sigma,diff,mu1){
  p_values = c()
  for(i in 1:10000){
    group1 <- rnorm(n=N/2, mean = mu1, sd = sigma)
    group2 <- rnorm(n=N/2, mean = mu1+diff, sd = sigma)
    p_values[i] <- t.test(group1,group2)$p.value
  }
  
  prop.table(table(p_values<alpha))[["TRUE"]]
}
Dianafreedom
  • 391
  • 1
  • 13
  • 1
    I believe that loops are not necessarily slow in R. And it's also not necessarily faster to use 'lapply' variants. loops are slow in comparison to vectorization! If there is no good way to vectorize your problem, then choosing either a loop or 'lapply' solution (which is also just a loop over the elements in the vector/list you pass on as an argument is perfectly fine. – Mario Niepel Dec 21 '20 at 18:18
  • https://stackoverflow.com/questions/28983292/is-the-apply-family-really-not-vectorized – Ben Bolker Dec 21 '20 at 18:20
  • Thanks @BenBolker for that rabbit hole. So in the end it seems to all come down to the details of which part is written efficiently in C and which part is not? I tend to write lapply solutions myself since they don't clutter up the code and the functions can be easier hidden/reused. But the reflexive 'don't write loops in R' seems to be somewhat overblown. – Mario Niepel Dec 21 '20 at 18:24

4 Answers4

3

tl;dr the loops are fine. The only way I found to speed this up significantly was to write a bespoke version that strips stats:::t.test.default down to only the essential code required to compute the p-value (skipping tests for different options, computation of confidence interval, etc.). That gets a factor of about 2x speedup; I don't see an easy way to speed up further without coding in C++ (e.g. with the Rcpp package).

More notes:

  • pre-allocating the p_values vector is the first thing I tried, but it makes little overall difference (the t.test() function is the bottleneck)
  • replacing prop_table(...) with mean(p_values<alpha) also made little difference
  • power.t.test() solves the same problem (I think: I'm not sure whether it assumes equal variances or not) and is much faster (but that may not be the point of your question)
  • another possible way to speed this up (although I doubt it will do much) is to pick all of the normal deviates at once and stick them in appropriately dimensioned matrices, then index the matrices (rather than calling rnorm() every time). This seems annoying and my guess is that it won't make too much difference in this case.
  • you can actually be able to vectorize the entire calculation — but this might not work if you wanted to do something more specialized/complicated. I wrote it: it doesn't give exactly the same answer (0.3498 vs 0.0.35215 for nsim=1e5) as the other sims, but I think? this is because the random numbers are assigned in a slightly different order. To my surprise, it's not much faster than the second version ...
## rewrite sim1 function slightly: add convenient default values
sim1 <- function(N=1000,alpha=0.05,sigma=1,diff=0.1,mu1=1, nsim=1e4) {
    p_values = c()
    set.seed(12345)
    for(i in seq(nsim)) {
        group1 <- rnorm(n=N/2, mean = mu1, sd = sigma)
        group2 <- rnorm(n=N/2, mean = mu1+diff, sd = sigma)
        p_values[i] <- t.test(group1,group2)$p.value
    }
    prop.table(table(p_values<alpha))[["TRUE"]]
}
## stripped-down function to compute t-test p value, based on stats:::t.test.default
my_t <- function(x,y) {
    vx <- var(x)
    vy <- var(y)
    nx <- length(x)
    ny <- length(y)
    stderrx <- sqrt(vx/nx)
    stderry <- sqrt(vy/ny)
    stderr <- sqrt(stderrx^2 + stderry^2)
    df <- stderr^4/(stderrx^4/(nx - 1) + stderry^4/(ny - 1))
    tstat <- (mean(x) - mean(y))/stderr
    pval <- 2 * pt(-abs(tstat), df)
    return(pval)
}
## faster sim function
sim2 <- function(N=1000,alpha=0.05,sigma=1,diff=0.1,mu1=1, nsim=1e4) {
    p_values <- numeric(nsim)  ## pre-allocate loop
    set.seed(12345)
    for(i in seq(nsim)) {
        group1 <- rnorm(n=N/2, mean = mu1, sd = sigma)
        group2 <- rnorm(n=N/2, mean = mu1+diff, sd = sigma)
        p_values[i] <- my_t(group1, group2)
    }
    mean(p_values<alpha) ## replace prop.table with cheaper alternative
}
## vectorized sim function
sim3 <- function(N=1000,alpha=0.05,sigma=1,diff=0.1,mu1=1, nsim=1e4) {
    set.seed(12345)
    group1 <- matrix(rnorm(n=N/2*nsim, mean=mu1,sd=sigma),
                     nrow=nsim)
    group2 <- matrix(rnorm(n=N/2*nsim, mean=mu1+diff,sd=sigma),
                     nrow=nsim)
    vx <- apply(group1, 1, var)
    vy <- apply(group2, 1, var)
    nx <- ny <- N/2
    stderrx <- sqrt(vx/nx)
    stderry <- sqrt(vy/ny)
    stderr <- sqrt(stderrx^2 + stderry^2)
    df <- stderr^4/(stderrx^4/(nx - 1) + stderry^4/(ny - 1))
    tstat <- (rowMeans(group1) - rowMeans(group2))/stderr
    p_values <- 2 * pt(-abs(tstat), df)
    mean(p_values<alpha) ## replace prop.table with cheaper alternative
}

identical(sim1(), sim2())  ## TRUE (value= 0.3553)
system.time(sim1(nsim=1e5))  ## 11.6 seconds
system.time(sim2(nsim=1e5))  ## 6 seconds
power.t.test(n=500,delta=0.1,sd=1)  ## value=0.3518
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
2

You can actually back out of the for loop altogether by generating single streams of group1 and group2 then dimensioning them as matrices with ncol = 10000, then sapply to process the t.tests:

bigNum <- 100000
iters <- bigNum * (N/2)
group1 <- rnorm(n=iters, mean = mu1, sd = sigma)
group2 <- rnorm(n=iters, mean = mu1+diff, sd = sigma)
m1 <- matrix(group1, ncol = bigNum)
m2 <- matrix(group2, ncol = bigNum)
pvalues <- sapply(1:bigNum, function(x) t.test(m1[ , x], m2[ , x])$p.value)
SteveM
  • 2,226
  • 3
  • 12
  • 16
  • Yes, but I don't think this will actually speed things up very much. Did you try a timing test ...? – Ben Bolker Dec 21 '20 at 19:39
  • No test. But you could try it at bigNum = 1000 first. Let us know which solution works for you if any. Good luck. – SteveM Dec 21 '20 at 19:44
  • I get this solution as taking **longer** than the original example on my system (13+ vs 11+ seconds for `bigNum=1e5`), possibly because of time taken for memory allocation ... vs about 6 seconds for both of my solutions – Ben Bolker Dec 21 '20 at 19:52
1

Based on my stats meth courses you can use lists and lapply() as well as mapply():

simulation <- function(N,alpha,sigma,diff,mu1){
  set.seed(12345)
  Lgroup1 <- list()
  Lgroup2 <- list()
  Lgroup1 <- lapply(1:10000, function(x) {Lgroup1[[x]]<-rnorm(n=N/2, mean = mu1, sd = sigma)})
  Lgroup2 <- lapply(1:10000, function(x) {Lgroup2[[x]]<-rnorm(n=N/2, mean = mu1+diff, sd = sigma)})
  p_values <- mapply(function(x,y) t.test(x,y)$p.value,
                     x=Lgroup1,y=Lgroup2)
  prop.table(table(p_values<alpha))[["TRUE"]]
}

Explanation:

lapply() is replacing the loop creating the objects for group 1 and group 2. Then with mapply() we can obtain the p values and store in a vector for future goals.

1

On my machine, the ttest2 function from the Rfast package is a bit faster than @BenBolker s sim2() function. If you can live with a slightly different random seed initialization, you might be able to further speed up the function using the parallel package (given multiple cores and sufficient memory) on a linux / macos system:

library(parallel)
library(Rfast)
#> Loading required package: Rcpp
#> Loading required package: RcppZiggurat

my_t <- function(x,y) {
  vx <- var(x)
  vy <- var(y)
  nx <- length(x)
  ny <- length(y)
  stderrx <- sqrt(vx/nx)
  stderry <- sqrt(vy/ny)
  stderr <- sqrt(stderrx^2 + stderry^2)
  df <- stderr^4/(stderrx^4/(nx - 1) + stderry^4/(ny - 1))
  tstat <- (mean(x) - mean(y))/stderr
  pval <- 2 * pt(-abs(tstat), df)
  return(pval)
}

sim2 <- function(N=1000,alpha=0.05,sigma=1,diff=0.1,mu1=1, nsim=1e4) {
  p_values <- numeric(nsim)  ## pre-allocate loop
  set.seed(12345)
  for(i in seq(nsim)) {
    group1 <- rnorm(n=N/2, mean = mu1, sd = sigma)
    group2 <- rnorm(n=N/2, mean = mu1+diff, sd = sigma)
    p_values[i] <- my_t(group1, group2)
  }
  mean(p_values<alpha) ## replace prop.table with cheaper alternative
}

sim5 <- function(N=1000, alpha=0.05, sigma=1, diff=0.1, mu1=1, nsim=1e4, 
                 ncores=detectCores() - 1){
  ok <- RNGkind()
  RNGkind("L'Ecuyer-CMRG")
  set.seed(12345)
  y <- mclapply(seq(nsim), function(i){
    group1 <- rnorm(n=N/2, mean = mu1, sd = sigma)
    group2 <- rnorm(n=N/2, mean = mu1 + diff, sd = sigma)
    ttest2(group1, group2)[2]
  }, mc.cores = ncores, mc.set.seed = TRUE)
  RNGkind(ok[1])
  mean(unlist(y, use.names = FALSE) < alpha)
}

system.time({ s2 <- sim2(nsim=1e5)})
#>    user  system elapsed 
#>   8.214   0.196   8.074
s2
#> [1] 0.34898
system.time({ s5 <- sim5(nsim=1e5)})
#>    user  system elapsed 
#>  17.196   0.573   1.712
s5
#> [1] 0.35056

Created on 2020-12-21 by the reprex package (v0.3.0)

user12728748
  • 8,106
  • 2
  • 9
  • 14