22

As a matter of best practices, I'm trying to determine if it's better to create a function and apply() it across a matrix, or if it's better to simply loop a matrix through the function. I tried it both ways and was surprised to find apply() is slower. The task is to take a vector and evaluate it as either being positive or negative and then return a vector with 1 if it's positive and -1 if it's negative. The mash() function loops and the squish() function is passed to the apply() function.

million  <- as.matrix(rnorm(100000))

mash <- function(x){
  for(i in 1:NROW(x))
    if(x[i] > 0) {
      x[i] <- 1
    } else {
      x[i] <- -1
    }
    return(x)
}

squish <- function(x){
  if(x >0) {
    return(1)
  } else {
    return(-1)
  }
}


ptm <- proc.time()
loop_million <- mash(million)
proc.time() - ptm


ptm <- proc.time()
apply_million <- apply(million,1, squish)
proc.time() - ptm

loop_million results:

user  system elapsed 
0.468   0.008   0.483 

apply_million results:

user  system elapsed 
1.401   0.021   1.423 

What is the advantage to using apply() over a for loop if performance is degraded? Is there a flaw in my test? I compared the two resulting objects for a clue and found:

> class(apply_million)
[1] "numeric"
> class(loop_million)
[1] "matrix"

Which only deepens the mystery. The apply() function cannot accept a simple numeric vector and that's why I cast it with as.matrix() in the beginning. But then it returns a numeric. The for loop is fine with a simple numeric vector. And it returns an object of same class as that one passed to it.

Navy Cheng
  • 573
  • 4
  • 14
Milktrader
  • 9,278
  • 12
  • 51
  • 69
  • 2
    Use `system.time()` instead of `proc.time`, it's better suited for the task. Or better yet, follow some of the examples in this post and get better results by replicating the test multiple times and taking the mean of that: http://stats.stackexchange.com/questions/3235/timing-functions-in-r – Chase Apr 03 '11 at 23:45
  • Thanks for the timing link. Just getting started in benchmarking. – Milktrader Apr 03 '11 at 23:52
  • You should also check `microbenchmark` package for more accurate measures. – aL3xa Apr 04 '11 at 02:12

5 Answers5

42

The point of the apply (and plyr) family of functions is not speed, but expressiveness. They also tend to prevent bugs because they eliminate the book keeping code needed with loops.

Lately, answers on stackoverflow have over-emphasised speed. Your code will get faster on its own as computers get faster and R-core optimises the internals of R. Your code will never get more elegant or easier to understand on its own.

In this case you can have the best of both worlds: an elegant answer using vectorisation that is also very fast, (million > 0) * 2 - 1.

hadley
  • 102,019
  • 32
  • 183
  • 245
  • 6
    this echoes what I found in R Inferno by Burns, that the apply family of functions are basically R loops and their benefit is not speed. He calls it loop-hiding. – Milktrader Apr 05 '11 at 00:18
  • I would like to point out that this solution (which should be the default one to consider in this and similar cases) is not just very fast, it is ten times faster than `ifelse`, eleven times faster than OP's `mash` (using `for`) and 162 times faster than `apply`'ing the `squish` function from OP. (Timings using `library(microbenchmark)` with `times=100` and the `million` from OP as data.) – lebatsnok Oct 03 '18 at 08:35
  • 1
    Sorry to bring this up, but although I agree with you that expressiveness and intention are important, I don't agree with the attitude to wait for pcs to become faster. I needed my results yesterday, I cannot have to wait days for something that written properly, not even optimized, would not require more than some minutes – Net_Raider Apr 09 '20 at 13:56
12

As Chase said: Use the power of vectorization. You're comparing two bad solutions here.

To clarify why your apply solution is slower:

Within the for loop, you actually use the vectorized indices of the matrix, meaning there is no conversion of type going on. I'm going a bit rough over it here, but basically the internal calculation kind of ignores the dimensions. They're just kept as an attribute and returned with the vector representing the matrix. To illustrate :

> x <- 1:10
> attr(x,"dim") <- c(5,2)
> y <- matrix(1:10,ncol=2)
> all.equal(x,y)
[1] TRUE

Now, when you use the apply, the matrix is split up internally in 100,000 row vectors, every row vector (i.e. a single number) is put through the function, and in the end the result is combined into an appropriate form. The apply function reckons a vector is best in this case, and thus has to concatenate the results of all rows. This takes time.

Also the sapply function first uses as.vector(unlist(...)) to convert anything to a vector, and in the end tries to simplify the answer into a suitable form. Also this takes time, hence also the sapply might be slower here. Yet, it's not on my machine.

IF apply would be a solution here (and it isn't), you could compare :

> system.time(loop_million <- mash(million))
   user  system elapsed 
   0.75    0.00    0.75    
> system.time(sapply_million <- matrix(unlist(sapply(million,squish,simplify=F))))
   user  system elapsed 
   0.25    0.00    0.25 
> system.time(sapply2_million <- matrix(sapply(million,squish)))
   user  system elapsed 
   0.34    0.00    0.34 
> all.equal(loop_million,sapply_million)
[1] TRUE
> all.equal(loop_million,sapply2_million)
[1] TRUE
Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • you qualified your comparison with capital letters IF and that point is not lost on me. I need to report though that if I amp up the sample to 10 million, the loop is 2 seconds faster than both sapply tests. Clearly ifelse is best, but loop still appears to beat out built-in apply functions. If I have a different problem that ifelse() doesn't handle, I'm afraid I'm likely to favor the dreaded loop over apply. At least I won't take it on faith that apply will be better and I'll likely test for the best solution. – Milktrader Apr 04 '11 at 00:58
  • @Chase thanks to both for the system.time() and all.equal() tools. – Milktrader Apr 04 '11 at 01:01
  • @milktrader : With very long vectors, it becomes a matter of interial design. This can be seen on the difference in times between my tests and Chases as well. Now keep in mind there are other reasons to choose for apply. Chase gave you the link already in the comments. Also take a look at the difference between apply, sapply, lapply and friends, and the speed-up using options like USE.NAMES=F and simplify=F in sapply. – Joris Meys Apr 04 '11 at 08:25
7

You can use lapply or sapply on vectors if you want. However, why not use the appropriate tool for the job, in this case ifelse()?

> ptm <- proc.time()
> ifelse_million <- ifelse(million > 0,1,-1)
> proc.time() - ptm
   user  system elapsed 
  0.077   0.007   0.093 

> all.equal(ifelse_million, loop_million)
[1] TRUE

And for comparison's sake, here are the two comparable runs using the for loop and sapply:

> ptm <- proc.time()
> apply_million <- sapply(million, squish)
> proc.time() - ptm
   user  system elapsed 
  0.469   0.004   0.474 
> ptm <- proc.time()
> loop_million <- mash(million)
> proc.time() - ptm
   user  system elapsed 
  0.408   0.001   0.417 
Chase
  • 67,710
  • 18
  • 144
  • 161
  • sapply usage is clearly superior in this example, but the loop is still faster. Of course, there is no competition when ifelse participates. I may not have my terms correct, but aren't the apply family of functions considered mapping functions, and am I imagining that I read that mapping functions are preferred to for loops in R? – Milktrader Apr 03 '11 at 23:59
  • @Joris can you point out where there is vectorization in @Chase's answer? It's a concept I haven't grasped and comes up a lot. – Milktrader Apr 04 '11 at 00:09
  • @Milktrader : The function ifelse works on a vector using the internal loops in R. This is not the same as the for loop or any of the apply function. `ifelse()` takes a vector, so there's no need to use an explicit loop function. Hence, ifelse is a vectorized function. – Joris Meys Apr 04 '11 at 00:24
  • @Milktrader - good info on apply vs for loops in R: http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar – Chase Apr 04 '11 at 03:28
5

It is far faster in this case to do index-based replacement than either the ifelse(), the *apply() family, or the loop:

> million  <- million2 <- as.matrix(rnorm(100000))
> system.time(million3 <- ifelse(million > 0, 1, -1))
   user  system elapsed 
  0.046   0.000   0.044 
> system.time({million2[(want <- million2 > 0)] <- 1; million2[!want] <- -1}) 
   user  system elapsed 
  0.006   0.000   0.007 
> all.equal(million2, million3)
[1] TRUE

It is well worth having all these tools at your finger tips. You can use the one that makes the most sense to you (as you need to understand the code months or years later) and then start to move to more optimised solutions if compute time becomes prohibitive.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • 4
    Or more succinctly, and even faster, `(million > 0) * 2 - 1`. – hadley Apr 04 '11 at 12:13
  • thanks for the comparison. I'm understanding that ifelse() and indexing are vectorizations, or using C do run the loops. All vector operations use loops, but if you pass the work to C, you get it done faster. The explicit loop and apply family of functions are similar because they run the loop from within R. – Milktrader Apr 05 '11 at 00:21
3

Better example for speed advantage of for loop.

for_loop <- function(x){
    out <- vector(mode="numeric",length=NROW(x))
    for(i in seq(length(out)))
        out[i] <- max(x[i,])
    return(out)
    }

apply_loop <- function(x){
    apply(x,1,max)
}

million  <- matrix(rnorm(1000000),ncol=10)
> system.time(apply_loop(million))
  user  system elapsed 
  0.57    0.00    0.56 
> system.time(for_loop(million))
  user  system elapsed 
  0.32    0.00    0.33 

EDIT

Version suggested by Eduardo.

max_col <- function(x){
    x[cbind(seq(NROW(x)),max.col(x))]
}

By row

> system.time(for_loop(million))
   user  system elapsed 
   0.99    0.00    1.11 
> system.time(apply_loop(million))
  user  system elapsed 
   1.40    0.00    1.44 
> system.time(max_col(million))
  user  system elapsed 
  0.06    0.00    0.06 

By column

> system.time(for_loop(t(million)))
  user  system elapsed 
  0.05    0.00    0.05 
> system.time(apply_loop(t(million)))
  user  system elapsed 
  0.07    0.00    0.07 
> system.time(max_col(t(million)))
  user  system elapsed 
  0.04    0.00    0.06 
Wojciech Sobala
  • 7,431
  • 2
  • 21
  • 27