Calculating the mean within a loop reduces performance

Question

Consider the first example: It calculates mean within the loop.

st <- Sys.time() #Starting Time 
set.seed(123456789)
vara <- c()
sda <- c()
mvara <- c() #store mean
msda <- c() #store mean of standard deviation

K <- 100000

for(i in 1:K) {
  a <- rnorm(30)
  vara[i] <- var(a)
  sda[i] <- sd(a)
  mvara[i] <- mean(mvara)
  msda[i] <- mean(msda)
}

et <- Sys.time()

et-st  #time taken by code (approx more than one minute)

Consider the same code, except that the same mean is calculated outside the loop.

st <- Sys.time() #Starting Time 
set.seed(123456789)
vara <- c()
sda <- c()

K <- 100000

for(i in 1:K) {
  a <- rnorm(30)
  vara[i] <- var(a)
  sda[i] <- sd(a)
}

mvara <- cumsum(vara)/ (1:K)
msd <- cumsum(sda)/ (1:K)
et <- Sys.time()  #less than 5 seconds

I just wanted to know, why there is so much difference in performance of both the codes? Where one should take care when using loop?

Possible duplicate of [Why are loops slow in R?](https://stackoverflow.com/questions/7142767/why-are-loops-slow-in-r) — NelsonGon, Feb 21 '19 at 16:18
Not duplicate. My question is very specific. Both are using loop. But first is very slow while not the second. — Neeraj, Feb 21 '19 at 16:55
The duplicate explains it: in the first case, you call the `mean` function 100,000 separate times, while in the second case you call the `cumsum` function which is optimized to perform this operation without all the overhead of calling `mean` so many times — divibisan, Feb 21 '19 at 20:07

score 1 · Accepted Answer · answered Feb 21 '19 at 17:28

R is fastest when you use its internal optimized code to execute loops. My understanding of the reasons behind that are poor (the thread in the comment above has explanations from more knowledgeable people), but I believe some of it has to do with memory pre-allocation, and some with the way it transforms the problem into more efficient pieces.

Your code "outside the loop" could be made yet about 20x faster (on my system, went from 7.17 sec to 0.43 sec) by creating all your random numbers first, and then solving the whole table at once, instead of swapping between those two tasks in your loop. And that's using dplyr; I presume a data.table solution could be another 5-10x faster, especially given the large number of groups.

library(dplyr)
set.seed(123456789)
K <- 100000
n <- 30
a_df <- data.frame(trial = rep(1:K, each = 30),
                   val   = rnorm(K*n))

results <- a_df %>%
  group_by(trial) %>%
  summarize(vara = var(val),
            sda  = sd(val)) %>%
  mutate(mvara = cumsum(vara) / trial,
         msd   = cumsum(sda)  / trial)

Calculating the mean within a loop reduces performance

1 Answers1