5

Good morning,

I have been developing for a few months in R and I have to make sure that the execution time of my code is not too long because I analyze big datasets.

Hence, I have been trying to use as much vectorized functions as possible.

However, I am still wondering something.

What is costly in R is not the loop itself right? I mean, the problem arises when you start modifying variables within the loop for example is that correct?

Hence I was thinking, what if you simply have to run a function on each element (you actually do not care about the result). For example to write data in a database. What should you do?

1) use mapply without storing the result anywhere?

2) do a loop over the vector and only apply f(i) to each element?

3) is there a better function I might have missed?

(that's of course assuming your function is not optimally vectorized).

What about the foreach package? Have you experienced any performance improvement by using it?

SRKX
  • 1,806
  • 1
  • 21
  • 42
  • 1
    I'll leave the answer to someone who's more expert than me, but in my practical experience the *apply functions usually (but not always) speed up the thing quite a bit. – nico Jun 28 '10 at 06:37
  • I guess so, because the loop is done "in C" and not directly through R. – SRKX Jun 28 '10 at 08:16
  • 1
    See this SO post on the apply family - http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar – csgillespie Jun 28 '10 at 09:28
  • @Colin: thanks for the link, very interesting indeed. – SRKX Jun 28 '10 at 09:57
  • That SO post is a terrible example Colin because it's showing almost nothing about loop speed. All the time is taken in the recursive function. The only thing one should take from that is that if your function takes a really long time it doesn't matter which family you use. The example here by nullglob is much better. – John Jun 28 '10 at 15:26
  • @John - The question is about vectorising operations and using the apply family in R. The link I posted concerns using the apply family and vectorising operations. Of course it doesn't answer JSmaga's question entirely - that's why I posted it as a comment and didn't use the phrase 'duplicate'. I do agree that nullgrob's example is very good though ;) – csgillespie Jun 29 '10 at 08:55

2 Answers2

6

Just a couple of comments. A for loop is roughly as fast as apply and its variants, and the real speed-ups come when you vectorise your function as much as possible (that is, using low-level loops, rather than apply, which just hides the for loop). I'm not sure if this is the best example, but consider the following:

> n <- 1e06
> sinI <- rep(NA,n)
> system.time(for(i in 1:n) sinI[i] <- sin(i))
   user  system elapsed 
  3.316   0.000   3.358 
> system.time(sinI <- sapply(1:n,sin))
   user  system elapsed 
  5.217   0.016   5.311 
> system.time(sinI <- unlist(lapply(1:n,sin),
+       recursive = FALSE, use.names = FALSE))
   user  system elapsed 
  1.284   0.012   1.303 
> system.time(sinI <- sin(1:n))
   user  system elapsed 
  0.056   0.000   0.057 

In one of the comments below, Marek points out that the time consuming part of the for loop above is actually the ]<- part:

> system.time(sinI <- unlist(lapply(1:n,sin),
+       recursive = FALSE, use.names = FALSE))
   user  system elapsed 
  1.284   0.012   1.303 

The bottlenecks which can't immediately be vectorised can be rewritten in C or Fortran, compiled with R CMD SHLIB, and then plugged in with .Call, .C or .Fortran.

Also, see these links for more info about loop optimisation in R. Also check out the article "How Can I Avoid This Loop or Make It Faster?" in R News.

Community
  • 1
  • 1
nullglob
  • 6,903
  • 1
  • 29
  • 31
  • isn't the apply function handling the loop better from its C implementation still? The question is in fact general, is it better to use Reduce that implementing a simple loop (for example) in your opinion? – SRKX Jun 28 '10 at 09:55
  • 3
    In `sapply` version most of time is spend on post-processing results. When you do ` system.time(sinI <- unlist(lapply(1:n,sin),FALSE,FALSE))` you should get fastest version (not from `sin(1:n)` of course). In `for` loop time-consuming is `[<-`, check `system.time(for(i in 1:n) sin(i))` (in this case is useless cause drop results). – Marek Jun 28 '10 at 10:01
4

vapply avoids the post-processing by requiring that you specify what the return value is. It turns out to be 3.4 times faster than the for-loop:

> system.time(for(i in 1:n) sinI[i] <- sin(i))
   user  system elapsed 
   2.41    0.00    2.39 

> system.time(sinI <- unlist(lapply(1:n,sin), recursive = FALSE, use.names = FALSE))
   user  system elapsed 
   1.46    0.00    1.45 

> system.time(sinI <- vapply(1:n,sin, numeric(1)))
   user  system elapsed 
   0.71    0.00    0.69 
Tommy
  • 39,997
  • 12
  • 90
  • 85