tl;dr the loops are fine. The only way I found to speed this up significantly was to write a bespoke version that strips stats:::t.test.default
down to only the essential code required to compute the p-value (skipping tests for different options, computation of confidence interval, etc.). That gets a factor of about 2x speedup; I don't see an easy way to speed up further without coding in C++ (e.g. with the Rcpp
package).
More notes:
- pre-allocating the
p_values
vector is the first thing I tried, but it makes little overall difference (the t.test()
function is the bottleneck)
- replacing
prop_table(...)
with mean(p_values<alpha)
also made little difference
power.t.test()
solves the same problem (I think: I'm not sure whether it assumes equal variances or not) and is much faster (but that may not be the point of your question)
- another possible way to speed this up (although I doubt it will do much) is to pick all of the normal deviates at once and stick them in appropriately dimensioned matrices, then index the matrices (rather than calling
rnorm()
every time). This seems annoying and my guess is that it won't make too much difference in this case.
- you can actually be able to vectorize the entire calculation — but this might not work if you wanted to do something more specialized/complicated. I wrote it: it doesn't give exactly the same answer (0.3498 vs 0.0.35215 for nsim=1e5) as the other sims, but I think? this is because the random numbers are assigned in a slightly different order. To my surprise, it's not much faster than the second version ...
## rewrite sim1 function slightly: add convenient default values
sim1 <- function(N=1000,alpha=0.05,sigma=1,diff=0.1,mu1=1, nsim=1e4) {
p_values = c()
set.seed(12345)
for(i in seq(nsim)) {
group1 <- rnorm(n=N/2, mean = mu1, sd = sigma)
group2 <- rnorm(n=N/2, mean = mu1+diff, sd = sigma)
p_values[i] <- t.test(group1,group2)$p.value
}
prop.table(table(p_values<alpha))[["TRUE"]]
}
## stripped-down function to compute t-test p value, based on stats:::t.test.default
my_t <- function(x,y) {
vx <- var(x)
vy <- var(y)
nx <- length(x)
ny <- length(y)
stderrx <- sqrt(vx/nx)
stderry <- sqrt(vy/ny)
stderr <- sqrt(stderrx^2 + stderry^2)
df <- stderr^4/(stderrx^4/(nx - 1) + stderry^4/(ny - 1))
tstat <- (mean(x) - mean(y))/stderr
pval <- 2 * pt(-abs(tstat), df)
return(pval)
}
## faster sim function
sim2 <- function(N=1000,alpha=0.05,sigma=1,diff=0.1,mu1=1, nsim=1e4) {
p_values <- numeric(nsim) ## pre-allocate loop
set.seed(12345)
for(i in seq(nsim)) {
group1 <- rnorm(n=N/2, mean = mu1, sd = sigma)
group2 <- rnorm(n=N/2, mean = mu1+diff, sd = sigma)
p_values[i] <- my_t(group1, group2)
}
mean(p_values<alpha) ## replace prop.table with cheaper alternative
}
## vectorized sim function
sim3 <- function(N=1000,alpha=0.05,sigma=1,diff=0.1,mu1=1, nsim=1e4) {
set.seed(12345)
group1 <- matrix(rnorm(n=N/2*nsim, mean=mu1,sd=sigma),
nrow=nsim)
group2 <- matrix(rnorm(n=N/2*nsim, mean=mu1+diff,sd=sigma),
nrow=nsim)
vx <- apply(group1, 1, var)
vy <- apply(group2, 1, var)
nx <- ny <- N/2
stderrx <- sqrt(vx/nx)
stderry <- sqrt(vy/ny)
stderr <- sqrt(stderrx^2 + stderry^2)
df <- stderr^4/(stderrx^4/(nx - 1) + stderry^4/(ny - 1))
tstat <- (rowMeans(group1) - rowMeans(group2))/stderr
p_values <- 2 * pt(-abs(tstat), df)
mean(p_values<alpha) ## replace prop.table with cheaper alternative
}
identical(sim1(), sim2()) ## TRUE (value= 0.3553)
system.time(sim1(nsim=1e5)) ## 11.6 seconds
system.time(sim2(nsim=1e5)) ## 6 seconds
power.t.test(n=500,delta=0.1,sd=1) ## value=0.3518