9

I've been seeing a lot of comments among data scientists online about how for loops are not advisable. However, I recently found myself in a situation where using one was helpful. I would like to know if there is a better alternative for the following process (and why the alternative would be better):

I needed to run a series of repeated-measures ANOVA and approached the problem similarly to the reproducible example you see below.

[I am aware that there are other issues regarding running multiple ANOVA models and that there are other options for these sorts of analyses, but for now I'd simply like to hear about the use of for loop]

As an example, four repeated-measures ANOVA models - four dependent variables that were each measured at three occasions:

set.seed(1976)
code <- seq(1:60)
time <- rep(c(0,1,2), each = 20)
DV1 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 14, 2))
DV2 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
DV3 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 8, 2))
DV4 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
dat <- data.frame(code, time, DV1, DV2, DV3, DV4)

outANOVA <- list()

for (i in names(dat)) {
  y <- dat[[i]]
  outANOVA[i] <- summary(aov(y ~ factor(time) + Error(factor(code)), 
                                  data = dat))
}

outANOVA
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
Peter Miksza
  • 347
  • 3
  • 11
  • You could just use `lapply`, like `lapply(dat, function(x) summary(aov(x ~ factor(time) + Error(factor(code)), data = dat)))`. If you're using a `for` loop, it's best to preallocate if possible. – A5C1D2H2I1M1N2O1R2T1 Feb 14 '18 at 17:49
  • 2
    Don't fall for such advice. It comes from a time when the main interpreter was S-Plus, and its loops were so bad it had two: `for` and `For`. These days the first rule is to write _clear_ and _correct_ code. – Dirk Eddelbuettel Feb 14 '18 at 18:04
  • 1
    To echo Dirk, read the (short) article on page 46: ['How can I avoid this loop or make it faster?'](https://www.r-project.org/doc/Rnews/Rnews_2008-1.pdf) by Uwe Ligges and John Fox, who know R as well as anyone. Among other things, they advise *'Do not avoid loops simply for the sake of avoiding loops'*. – Hugh Mar 26 '18 at 12:07

2 Answers2

15

You could write it this way, it's more compact:

outANOVA <-
  lapply(dat,function(y)
    summary(aov(y ~ factor(time) + Error(factor(code)),data = dat)))

for loops are not necessarily slower than apply functions but they're less easy to read for many people. It is to some extent a matter of taste.

The real crime is to use a for loop when a vectorized function is available. These vectorized functions usually contain for loops written in C that are much faster (or call functions that do).

Notice that in this case we also could avoid to create a global variable y and that we didn't have to initialize the list outANOVA.

Another point, directly from this relevant post :For loops in R and computational speed (answer by Glen_b):

For loops in R are not always slower than other approaches, like apply - but there's one huge bugbear - •never grow an array inside a loop

Instead, make your arrays full-size before you loop and then fill them up.

In your case you're growing outANOVA, for big loops it could become problematic.

Here is some microbenchmark of different methods on a simple example:

n <- 100000
microbenchmark::microbenchmark(
preallocated_vec  = {x <- vector(length=n); for(i in 1:n) {x[i] <- i^2}},
preallocated_vec2 = {x <- numeric(n); for(i in 1:n) {x[i] <- i^2}},
incremented_vec   = {x <- vector(); for(i in 1:n) {x[i] <- i^2}},
preallocated_list = {x <- vector(mode = "list", length = n); for(i in 1:n) {x[i] <- i^2}},
incremented_list  = {x <- list(); for(i in 1:n) {x[i] <- i^2}},
sapply            = sapply(1:n, function(i) i^2),
lapply            = lapply(1:n, function(i) i^2),
times=20)

# Unit: milliseconds
# expr                     min         lq       mean     median         uq        max neval
# preallocated_vec    9.784237  10.100880  10.686141  10.367717  10.755598  12.839584    20
# preallocated_vec2   9.953877  10.315044  10.979043  10.514266  11.792158  12.789175    20
# incremented_vec    74.511906  79.318298  81.277439  81.640597  83.344403  85.982590    20
# preallocated_list  10.680134  11.197962  12.382082  11.416352  13.528562  18.620355    20
# incremented_list  196.759920 201.418857 212.716685 203.485940 205.441188 393.522857    20
# sapply              6.557739   6.729191   7.244242   7.063643   7.186044   9.098730    20
# lapply              6.019838   6.298750   6.835941   6.571775   6.844650   8.812273    20
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • Yes, those are clear benefits. Thank you. Would the same logic hold if instead of getting the singular ANOVA output, I wanted to extract - say - F statistics and p values for each analysis to be converted into a table later? Then I'd be storing multiple separate items rather than just `outANOVA`. Would a `lapply` statement work in that situation? – Peter Miksza Feb 14 '18 at 18:00
  • Yes, you'd store each set of results in a sublist, and then you'd reformat the data afterward. If you want include an example to clarify I'll sort it out for you. – moodymudskipper Feb 14 '18 at 18:05
  • Pretty sure this code is looking for the variable `y` in `dat`, which will not work; the usual method is to build the formula using `paste` first. `do.call` can also be helpful. See for example https://stackoverflow.com/a/17026386/210673 – Aaron left Stack Overflow Feb 14 '18 at 18:05
  • 1
    "for loops are not much slower than apply functions but they're less easy to read." Actually, for loops are much faster than a sapply/lapply loops, because they don't have the overhead of using lists, which are very unperformant. Benchmark the following: `x <- vector(length=n); for(i in 1:n) {x[i] <- i^2}` and `sapply(1:n, function(i) i^2)` – thc Feb 14 '18 at 18:55
  • Alright, but they can be less performant if incrementing a list for example. Would it be fair if I say "For loops are not necessarily slower" ? Note that in OP's case he is using a list anyway, in your example sapply is coercing a list to a vector. still my answer should be general. – moodymudskipper Feb 14 '18 at 18:57
  • @Moody_Mudskipper sure, makes sense. But I find for my own use, if you use `for`, you don't need lists. For example, if your iteration output is a string and a number, you have no choice but to use lists with `lapply`, but with `for` it is natural. – thc Feb 14 '18 at 19:23
  • 1
    I've added a microbenchmark ;). Waiting for some comments before describing the results. – moodymudskipper Feb 14 '18 at 19:24
  • @Moody_Mudskipper wasn't expecting that tbh. `sapply` always takes longer than `lapply` due to simplification, ok, but why does `incremented_list` take that long? Growing such a long list really takes its toll in R... – catastrophic-failure Feb 14 '18 at 19:37
  • A bit of simplification attained a bit more performance on my end (using list indexing, and pre-compiled functions directly): `[...] incremented_list2 = {x <- list(); for(i in 1:n) {x[[i]] <- i^2}}, sapply2 = sapply(1:n, "^", 2), lapply2 = lapply(1:n, "^", 2), [...]` – catastrophic-failure Feb 14 '18 at 19:43
  • Maybe `x[[i]]<-` would have given a different result ? I can't update now – moodymudskipper Feb 14 '18 at 19:44
  • @Moody_Mudskipper No, it doesn't, with `"[["` you're accessing the list element and allocating, with `"["` the resulting query is also a list. Also optimization flags don't change results that much from what I gathered. – catastrophic-failure Feb 14 '18 at 19:45
  • I would love to add to this post an example where for loops are indeed faster, so far I say they can be but I don't show any evidence. – moodymudskipper Mar 25 '18 at 21:34
3

For your use case, I would say the point is moot. Applying vectorization (and, in the process, obfuscating the code) has no benefits here.

Here's an example below, where I did a microbenchmark::microbenchmark of your solution as presented in OP, Moody's solution as in his post, and a third solution of mine, with even more vectorization (triple nested lapply).

Microbenchmark

set.seed(1976); code = seq(1:60); time = rep(c(0,1,2), each = 20);
DV1 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 14, 2)); DV2 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2)); DV3 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 8, 2)); DV4 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
dat = data.frame(code, time, DV1, DV2, DV3, DV4)

library(microbenchmark)

microbenchmark(
    `Peter Miksza` = {
        outANOVA1 = list()
        for (i in names(dat)) {
            y = dat[[i]]
            outANOVA1[i] = summary(aov(y ~ factor(time) + Error(factor(code)), 
                data = dat))
    }},
    Moody_Mudskipper = {
        outANOVA2 =
            lapply(dat,function(y)
                summary(aov(y ~ factor(time) + Error(factor(code)),data = dat)))
    },
    `catastrophic_failure` = {
        outANOVA3 = 
            lapply(lapply(lapply(dat, function(y) y ~ factor(time) + Error(factor(code))), aov, data = dat), summary)
    },
    times = 1000L)

Results

#Unit: milliseconds
#                 expr      min       lq     mean   median       uq       max neval cld
#         Peter Miksza 26.25641 27.63011 31.58110 29.60774 32.81374 136.84448  1000   b
#     Moody_Mudskipper 22.93190 23.86683 27.20893 25.61352 28.61729 135.58811  1000  a 
# catastrophic_failure 22.56987 23.57035 26.59955 25.15516 28.25666  68.87781  1000  a 

fiddling with JIT compilation, running compiler::setCompilerOptions(optimize = 0) and compiler::enableJIT(0) the following result ensues as well

#Unit: milliseconds
#                 expr      min       lq     mean   median       uq      max neval cld
#         Peter Miksza 23.10125 24.27295 28.46968 26.52559 30.45729 143.0731  1000   a
#     Moody_Mudskipper 22.82366 24.35622 28.33038 26.72574 30.27768 146.4284  1000   a
# catastrophic_failure 22.59413 24.04295 27.99147 26.23098 29.88066 120.6036  1000   a

Conclusion

As alluded by Dirk's comment, there isn't a difference in performance, but readability is greatly impaired using vectorization.

On growing lists

Experimenting with Moody's solutions, it seems growing lists can be a bad idea if the resulting list is moderately long. Also, using byte-compiled functions directly can provide a small improvement in performance. Both are expected behaviors. Pre-allocation might prove sufficient for your application though.

catastrophic-failure
  • 3,759
  • 1
  • 24
  • 43
  • 1
    As I noted in my answer this is really a matter of taste, I personally much prefer the `lapply` version visually. And it doesn't clutter the workspace. – moodymudskipper Feb 14 '18 at 19:36
  • @Moody_Mudskipper Yeah, I'm really used to it as well, makes it easier to think in terms of how we apply functions, as we use elements in a list instead of indices pointing to said elements. – catastrophic-failure Feb 14 '18 at 19:38
  • 1
    I'd be curious to poll R users, I feel the SO community doesn't like loops that much overall, but maybe It's not so clear. – moodymudskipper Feb 14 '18 at 19:39