1

I just got a gig to help speed up a program in R by improving the efficiency of the algorithms used to calculate data. There are many loops that do different calculations, and I'm wondering which loops end up using the most resources. I want to know how can I count the amount of time it takes for a loop to completely finish. I can use that information to figure out which algorithms to optimize, or even to write a C extension that will handle the calculations.

GargantuChet
  • 5,691
  • 1
  • 30
  • 41
user1876508
  • 12,864
  • 21
  • 68
  • 105

3 Answers3

2

You can use:

  • Sys.time() or system.time()
  • The rbenchmark package
  • The microbenchmark package
  • Or a profiler (e.g. ?RProf)
GSee
  • 48,880
  • 13
  • 125
  • 145
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • Do you list them in order ? – agstudy Dec 04 '12 at 18:46
  • For the first bullet point, you can use `beg <- Sys.time(); { MyCode() }; Sys.time() - beg` _or_ `system.time({ MyCode() })` which is probably preferable. – GSee Dec 04 '12 at 18:59
2

I use Rprof to tell where to look. It generates a file of stack samples, and I just look at a small number of those, like 10, chosen randomly. Or I just make the time between samples large enough so I don't get too many samples to begin with.

There are 2 reasons this works.

1) By actually examining individual stack samples, with your own eyes, you can see problems that simple statistics don't expose, because by looking at the stack, you can see the reasons why things are being done. That tells you if you could get rid of it, and that's the essential information.

2) If you see such a pattern of activity that you could improve, you only have to see it on more than one sample to know it's worth fixing. All the extra samples, if they mean you cannot do (1), are actually detrimental.

GSee
  • 48,880
  • 13
  • 125
  • 145
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
1

here is an example of using benchmark from another SO questions which compared using tapply vs by vs data.table: Edited as per on comments

library(rbenchmark)  

           # Different tests being compared
benchmark( using.tapply = tapply(x[, 1], x[, "f"], mean),
           using.by = by(x[, 1], x[, "f"], mean), 
           using.dtable = dt[,mean(col1),by=key(dt)]), 

           # Number of reps. How results are.
           replications = 250, order = "relative"
          )   

#------------------------#
#         RESULTS        # 
#------------------------#


#   COMPARING data.table VS tapply VS by   #
#------------------------------------------#
#             test elapsed relative
#   2  using.dtable   0.168    1.000
#   1  using.tapply   2.396   14.262
#   3      using.by   8.566   50.988
Community
  • 1
  • 1
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • what's with all the `expression`s? – GSee Dec 04 '12 at 18:44
  • @ GSee. I like to save them as expressions so that I can easily change the input once across all my tests. (Also, I find it gives a nicer output in the `test` column) – Ricardo Saporta Dec 04 '12 at 18:47
  • `benchmark(using.tapply=tappyl(x[, 1], x[, "f"], mean), using.by=by(x[, 1], x[, "f"], mean), using.dtable=dt[, mean(col1),by=key(dt)], replications=10, order='relative')` ... whatever `x` and `dt` are – GSee Dec 04 '12 at 18:49
  • Just name the `...` arguments: `benchmark(using.tapply=tapply(x[, 1], x[, "f"], mean))`. – Joshua Ulrich Dec 04 '12 at 18:50
  • yes of course. But if I have 10~15 tests I am comparing, taking the actual code in and out is a lot more cumbersome than using a variable. Wrapping them in `expression` I find moves more quickly. Is there any downside to using `expression`? (ie, any risk to getting the wrong results?) – Ricardo Saporta Dec 04 '12 at 18:52
  • 2
    Put each test on its own line. Then you can easily add/remove them. The biggest problem I see with using `expression` is that you're probably going to confuse others who look at your code. – Joshua Ulrich Dec 04 '12 at 18:59
  • Plus `expression` isn't even the right function to use here - `quote` would be more appropriate. – hadley Dec 04 '12 at 21:03
  • @hadley: can you elaborate on why that is? – Ricardo Saporta Dec 04 '12 at 21:05
  • @RicardoSaporta `expression` produces (basically) a lists of calls. Expression lists are rarely needed outside of special cases, such as sourcing a file, and you're best off using the simplest quoted call, as produced by `quote`. – hadley Dec 04 '12 at 21:12
  • @hadley, please correct me if I'm wrong, but it appears that `expression` is indeed the way to go: http://stackoverflow.com/questions/13713116/benchmarking-using-expression-quote-or-neither – Ricardo Saporta Dec 04 '12 at 22:32
  • @RicardoSaporta that sounds like a bug with rbenchmark to me. – hadley Dec 05 '12 at 13:44