5

I have to do extensive data manipulation on a big data set (using data.table, RStudio mostly). I would like to monitor run time for each of my step without explicitly call system.time() on each step.

Is there a package or an easy way to show run time by default on each step?

Thank you.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
AdamNYC
  • 19,887
  • 29
  • 98
  • 154
  • 1
    you can see this http://stackoverflow.com/questions/8439957/multi-function-tester-alternative-to-system-time – agstudy Dec 01 '12 at 17:54
  • Thanks agstudy. I am aware of rbenchmark, but that is not what I need. I would like to have a "side effect" that produce run time for each of R function call. I am not sure if it is possible though – AdamNYC Dec 01 '12 at 17:55
  • This is a great question. Even more so because I've found that even if you use `system.time` on each step of a big project with many `source` calls and functions that the `system.time` requests within the sourced files and inner functions are ignored (perhaps there is some way to work around that, like with `print`, but I haven't confirmed that yet and this is part of my pre-question research for a potential Question on this exact topic). – Hack-R Aug 19 '14 at 14:22

2 Answers2

5

It's not exactly what you're asking for, but I've written time_file (https://gist.github.com/4183595) which source()s an R file, and runs the code, then rewrites the file, inserting comments containing how long each top-level statement took to run.

i.e. time_file() turns this:

{
  load_all("~/documents/plyr/plyr")
  load_all("~/documents/plyr/dplyr")
  library(data.table)
  data("baseball", package = "plyr")
  vars <- list(n = quote(length(id)), m = quote(n + 1))
}

# Baseline case: use ddply
a <- ddply(baseball, "id", summarise, n = length(id))

# New summary method: ~20x faster
b <- summarise_by(baseball, group("id"), vars)

# But still not as fast as specialised count, which is basically id + tabulate
# so maybe able to eke out a little more with a C loop ?
count(baseball, "id")

into this:

{
  load_all("~/documents/plyr/plyr")
  load_all("~/documents/plyr/dplyr")
  library(data.table)
  data("baseball", package = "plyr")
  vars <- list(n = quote(length(id)), m = quote(n + 1))
}

# Baseline case: use ddply
a <- ddply(baseball, "id", summarise, n = length(id))
#:    user  system elapsed
#:   0.451   0.003   0.453

# New summary method: ~20x faster
b <- summarise_by(baseball, group("id"), vars)
#:    user  system elapsed
#:   0.029   0.000   0.029

# But still not as fast as specialised count, which is basically id + tabulate
# so maybe able to eke out a little more with a C loop ?
count(baseball, "id")
#:    user  system elapsed
#:   0.008   0.000   0.008

It doesn't time code inside a top-level { block, so you can choose not to time stuff you're not interested in.

I don't think there's anyway to automatically add timing as a top-level effect without somehow modifying the way that you run the code - i.e. using something like time_file instead of source.

You might wonder the effect that timing every top-level operation has on the overall speed of your code. Well, that's easy to answer with a microbenchmark ;)

library(microbenchmark)
microbenchmark(
  runif(1e4), 
  system.time(runif(1e4)),
  system.time(runif(1e4), gc = FALSE)
)

So timing adds relatively little overhead (20µs on my computer), but the default gc adds about 27 ms per call. So unless you have thousands of top-level calls, you're unlikely to see much impact.

hadley
  • 102,019
  • 32
  • 183
  • 245
  • I wondered if checking time with _every_ function call would slow the whole endeavor to a crawl. – IRTFM Dec 01 '12 at 18:07
  • @DWin microbenchmarking added - about 20 ms per top-level function call, which is unlikely to have an appreciable impact on most scripts. – hadley Dec 01 '12 at 18:14
  • @Hadley does R has equivalent python decorator ? I ask this because using this feature wen can improve the clarity of your time profiling,. – agstudy Dec 01 '12 at 18:47
  • Thanks a lot, hadley. Although I was looking for something that can run in interactively (i.e., passing multiple statements directly to time_file() instead of R script file), your solution works wonderfully. – AdamNYC Dec 01 '12 at 19:03
  • Seems designed to take a file argument, so could there be a "text=" option as was added to read.* and scan functions? (I'm getting partial success by wrapping that in `textConnection` to simulate file input but get an error about missing `load_all` function despite loading stringr and plyr. – IRTFM Dec 01 '12 at 20:12
  • @agstudy decorators not needed since you can modify function objects directly – hadley Dec 02 '12 at 02:26
  • 1
    @AdamNYC you could probably build something on top of [evaluate](https://github.com/hadley/evaluate) if you wanted interactive – hadley Dec 02 '12 at 02:27
  • 1
    @DWin `load_all` is from devtools, but the example won't work you unless you happen to have the development versions of my packages installed in exactly the same place I do ;) – hadley Dec 02 '12 at 02:27
  • @hadley. Pray yield me enough credit to realize that my library path is different than you'rn. I didn't have the development versions, but did get sufficient level of success to think that there might be merit in putting a 'text=' argument. – IRTFM Dec 02 '12 at 05:36
0

I have to give full credit to @jbecker of Freenode's #R IRC channel for this extra answer, but for me the solution is here: http://adv-r.had.co.nz/Profiling.html

Here's just a little taste of it:

"To understand performance, you use a profiler. There are a number of different types of profilers. R uses a fairly simple type called a sampling or statistical profiler. A sampling profiler stops the execution of code every few milliseconds and records which function is currently executing (along with which function called that function, and so on). For example, consider f(), below:"

library(lineprof)
f <- function() {
  pause(0.1)
  g()
  h()
}
g <- function() {
  pause(0.1)
  h()
}
h <- function() {
  pause(0.1)
}
Hack-R
  • 22,422
  • 14
  • 75
  • 131