When to use the apply function

Question

Say I have something like:

# Create some data:
treatment <- round(runif(20, min = 0, max = 1),0)
d2 <- round(runif(20, min = 0, max = 1),0)
bxd2 <- treatment * d2
infection <- round(runif(20, min = 0, max = 100),0) 
lung <- round(runif(20, min = 0, max = 100),0) 
head <- round(runif(20, min = 0, max = 100),0) 

df <- data.frame(treatment, d2, bxd2, infection, lung, head)

rm(treatment, d2, bxd2, infection, lung, head)


reg_func <- function(i,data){
form <- paste(colnames(df)[i+3], c("treatment + d2 + bxd2"), sep = "~") 
form <- as.formula(form)
print(lm(form, data = data))
}

for (i in 1:3) {
name <- paste0("reg", i)
assign(name, reg_func(i, df))
}

Now this works the way I would like, I end up with reg1,...,regN assigned in the workspace (bad habit, but works well for econometrics).

My question is now: why would I want to turn (something like the above) into an apply instance? The for loop seems so easy, yet constanly I hear people saying "... you should really use [X]apply".

to limit the amount of crud in your workspace; to make your code more transparent to whoever might need to maintain it; for the sake of efficiency; so that you know which outcome you are predicting in your model — Russ Hyde, Mar 18 '15 at 19:17
http://stackoverflow.com/questions/28983292/is-the-apply-family-really-not-vectorized and http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar — Metrics, Mar 18 '15 at 19:47
http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega — Rich Scriven, Mar 18 '15 at 19:49

Mehdi Nellen · Answer 1 · 2015-03-18T19:38:55.610

0

It's not much faster:

> a <- seq(300)
> system.time(replicate(1000, sapply(a ,mean)))
   user  system elapsed 
  2.215   0.000   2.216 
> v <- c()
> system.time(replicate(1000, for(i in a){v <- c(v,mean(i))}))
   user  system elapsed 
  2.315   0.000   2.315

But it prevents this from happening:

> i <- 1
> for(i in a) mean(i)
> i
[1] 300

And it keeps things nice and clean:

> sapply(a ,mean)
> ls()
[1] "a"  
> for(i in a) mean(i)
> ls()
[1] "a"    "i"

edited Mar 18 '15 at 19:38

answered Mar 18 '15 at 19:22

Mehdi Nellen

8,486
4
33
48

The time comparison in this case is inaccurate because `sapply` allocates a new vector and stores the results in it, so the two aren't doing the same thing. – mrip Mar 18 '15 at 19:27
But your code doesn't produce the same result in both cases, so comparing the timing of both versions doesn't tell you anything about the speed of one approach versus the other. – mrip Mar 18 '15 at 19:30

score 0 · Accepted Answer · answered Mar 18 '15 at 19:33

The short answer is don't worry about using a loop. The main reasons for using apply are, first, in some cases it can be more efficient, and second, it makes the code cleaner.

But sometimes it doesn't make the code cleaner. For example, if you really want a bunch of variables named reg1 reg2 etc. as opposed to a list called reg, then your version will be cleaner. And since you are doing a regression each time through the loop, the performance difference will be tiny, because most of the work is in the regression and not the looping, regardless of how you code it.

Now, I would argue that a list named reg is a more useful way to store these results, for a lot of reasons. For example, if you want to iterate through them, you can just do reg[[i]] as opposed to pasting strings. But it sounds like you've thought about this and decided on naming the variables like this for a reason.

How would you turn my code into an apply instance, I can seem to wrap my head around it — Repmat, Mar 18 '15 at 20:24
`lapply(1:3, function(x) assign(paste0("reg", x), reg_func(x, df)))` — Mehdi Nellen, Mar 18 '15 at 21:23

score 0 · Answer 3 · answered Mar 18 '15 at 19:43

Yes, apply() really cannot faster:

> mat <- matrix(rnorm(1000 * 1000), nrow = 1000)
> system.time({
+     v <- numeric(1000)
+     for (i in 1:1000) v[i] <- mean(mat[i, ])
+ })
## user  system elapsed 
## 0.021   0.001   0.023 
>
> system.time(apply(mat, 1, mean))
## user  system elapsed 
## 0.021   0.001   0.022

For matrices, if you what to take means, this could be better:

> system.time(rowMeans(mat))
## user  system elapsed 
## 0.003   0.000   0.003

But for lists and data frames, lapply() and sapply() can be faster:

> system.time({
+     v <- numeric(1000)
+     for (i in 1:1000) v[i] <- mean(df[[i]])
+ })
## user  system elapsed 
## 0.015   0.000   0.016 
>
> system.time(sapply(df, mean))
## user  system elapsed 
## 0.008   0.000   0.008

Because R is a script language, the speed is usually slower than lower level languages like C++ etc. So if possible, use these functions that have binary code or byte code embedded in, it will save

Downvote rationale: In general `apply` would not be used in this instance but rather either `sapply` or `lapply`. Also misses the point of the question which is not focused on speed but rather code clarity and maintainability when looping over parameters. — IRTFM, Mar 18 '15 at 20:22

When to use the apply function

3 Answers3