2

I've inherited R some code and it runs incredibly slowly. Most of the time is spent evaluating the functions of the form (there are about 15 such functions with different integrands G):

TMin <- 0.5

F <- function (t, d) {
    result <- ifelse(((d > 0) & (t > TMin)),
                     mapply(function(t, d) integrate(G, lower=0, upper=t, t, d)$value, t, d),
                     0)

    return(result)

}

For testing, I'm using the following dummy function, but in the real code the Gs are much more complicated involving exp(), log(), dlnorm(), plnorm() etc..

G <- function(x, t, d) {
    mean(rnorm(1e5))
    x + t - d
}   

F will be calculated around 2 million times in the worst case. The function gets called in 3 different ways, either:
t is a single number and d is a numeric vector or,
t is a numeric vector and d is a single number or,
t is a numeric vector and is a numeric vector

Is there a (simple) way to speed up this function?

For far I've tried variations along the lines of (to get rid of the ifelse loop):

F2 <- function (t,d) {
    TempRes <- mapply(function(t, d) integrate(G, lower=0, upper=t, t, d)$value, t, d)
    TempRes[(d <= 0) | (t <= TMin)] <- 0
    result <- TempRes

    return(result)
}

and

F3 <- function (t,d) {
    result <- rep(0, max(length(t),length(d)))
    test <- ((d > 0) & (t > TMin))
    result[test] <- mapply(function(t, d) integrate(G, lower=0, upper=t, t, d)$value, t, d)[test]

    return(result)
}

but they take almost exactly the same time.

mawir
  • 177
  • 2
  • 10
  • Are you sure that there are no closed-formed solutions for the integrals? Because there you have by far the best potential for improved performance. If your math skills are rusty, you could ask a CAS. – Roland Jun 20 '15 at 09:50
  • 1
    And, as always, if you are not happy with performance, profile your code. – Roland Jun 20 '15 at 09:51
  • I would be very surprised if there were a closed but, you're right, that's worth having a look at. – mawir Jun 20 '15 at 16:18
  • It may be a bad idea, but you could try merging your 15 functions into one vector-value function, and use `adaptIntegrate` from the cubature package. It's slower than R's integrate in one dimension, but has the advantage of dealing with vector-valued functions. And it [can be easily interfaced in C++ code](https://github.com/baptiste/cubature/blob/master/minimal.c), if you later want to write your integrands in a faster language. – baptiste Jun 20 '15 at 21:59
  • `result = …; return(result)` really doesn’t make sense in R: the last expression of a function is automatically the function’s result. No need to assign it to a variable, no need for `return`. – Konrad Rudolph Jun 20 '15 at 22:17

2 Answers2

2

You are performing a large number of independent integrations. You can speed things up by performing these integrations on separate cores simultaneously (if you have a multicore processor available). The problem is that R performs its calculations in a single threaded manner by default. However, there are a number packages available that allow multithreading support. I have recently answered a few similar questions here and here, with some additional info regarding the relevant packages and functions.

Additionally, As @Mike Dunlavey already mentioned, you should avoid performing the integrations for values of t and d that do not match your criteria. (You are currently performing unneeded function evaluations for these values and then you overwrite the result with 0 afterwards).

I have added a possible improvement below. Note that you will have to create a separate file with your function G included in order to evaluate it on the cluster nodes. In the code below it is assumed that this file is called functionG.R

The snippet:

library(doParallel)
F4 <- function(t,d) {
  results = vector(mode="numeric",max(length=length(t),length(d))) # Zero vector

  logicalVector <- ((d > 0) & (t > TMin))
  relevantT <- t[logicalVector]
  relevantD <- d[logicalVector] # when d is single element, NA values created

  if(length(relevantT) > 1 | length(relevantD) > 1)
  {
    if(length(d)==1) # d is only one element instead of vector --> replicate it
      relevantD <- rep(d,length(relevantT))
    if(length(t)==1) # t is only one element instead of vector --> replicate it
      relevantT <- rep(t,length(relevantD))

    cl <- makeCluster(detectCores()); 
    registerDoParallel(cl)
    clusterEvalQ(cl,eval(parse("functionG.R")))

    integrationResults <- foreach(i=1:length(relevantT),.combine="c") %dopar%
    {
      integrate(G,lower=0,upper=relevantT[i],relevantT[i],relevantD[i])$value;
    }
    stopCluster(cl)
    results[logicalVector] <- integrationResults
  }
  else if(length(relevantT==1)) # Cluster overhead not needd
  {
    results[logicalVector] = integrate(G,lower=0,upper=relevantT,relevantT,relevantD)$value;
  }

  return(results)
}

My CPU contains 6 physical cores with hyperthreading enabled (x2). These are the results:

> t = -5000:20000
> d = -5000:20000
> 
> start = Sys.time()
> testF3 = F3(t,d)
> timeNeededF3 = Sys.time()-start
> 
> start = Sys.time()
> testF4 = F4(t,d)
> timeNeededF4 = Sys.time()-start;

> timeNeededF3
Time difference of 3.452825 mins
> timeNeededF4
Time difference of 29.52558 secs
> identical(testF3,testF4)
[1] TRUE

It seems that the cores are constantly in use while running this code. However, you can potentially optimize this code further by presplitting the data more efficiently around the cores and then subsequently utilize an apply type functions on the separate cores.

If more optimization is required you could also take a deeper look at the integrate function. You can potentially play around with the settings and obtain a performance gain by allowing a less strict numerical approximation. As an alternative you could implement your own simple version of adaptive Simpson quadrature and play around with the discrete stepsizes. Most likely you could obtain massive performance increases like this (if you are able/willing to allow more error in the approximation).

EDIT: Updated code in order for it to work in all scenario's: d and/or t valid/invalid numbers or vectors

REPLY TO COMMENTS @mawir: you are correct. ifelse(test, yes, no) will return corresponding yes values for the rows of which test evaluates to TRUE, it will return the respective no values for the rows for which test evaluate to FALSE. However, it WILL first have to evaluate your yes expression in order to create the yes vector of length(test). This piece of code demonstrates this:

> t = -5000:5
> d = -5000:5
> 
> start = Sys.time()
> testF1 = F(t,d)
> timeNeededF1 = Sys.time()-start
> timeNeededF1
Time difference of 43.31346 secs
> 
> start = Sys.time()
> testF4 = F4(t,d)
> timeNeededF4 = Sys.time()-start
> timeNeededF4
Time difference of 2.284134 secs

Only the last 5 values of t and d are relevant in this scenario. However, inside the F1 function the ifelse evaluates the mapply on all d and t values first in order to create the yes vector. This is why the function execution takes so long. Next, it selects the elements for which the conditions are met, or 0 otherwise. The F4 function works around this issue.

Futhermore, you say that you obtain speedup in the scenario where t and d are non-vectors. However, in this scenario no parallelisation is used. You should normally obtain the maximum speedup in the senario's where one or both of t/d are vectors.

ANOTHER EDIT, in response to Roland's comment: You could potentially replace clusterEvalQ(cl,eval(parse("functionG.R"))) with clusterExport(cl,"G") if you prefer not to create a separate function file(s).

Community
  • 1
  • 1
Jellen Vermeir
  • 1,720
  • 10
  • 10
  • You could avoid `clusterEvalQ` by defining `G` inside the loop (with negligible performance cost). And I believe you could export it to the cores with the `.export` parameter (if that doesn't work automatically). – Roland Jun 21 '15 at 09:39
  • @Roland is their somewhere I can look up exactly what ifelse(test, yes, no) is doing (other than the source code)? I was under the impression it was effectively a encapsulated for loop, iterating through test and at each entry evaluating the relevant yes or no. But in elsewhere I've seen it stated that it fully evaluates both the yes and no expressions, and then iterates through selecting the relevant value, which is obviously much slower. Moreover F, F2 and F3 all take approximately the same time. Is this behaviour implementation dependant (I'm currently testing RGui on 64-bit windows)? – mawir Jun 21 '15 at 10:14
  • @Jellen Vermeir I've just tested your code and it is significantly faster in the cases where either t or d are single values (even without parallel calculation), if there's no objections from any one else, i'll accept you answer. – mawir Jun 21 '15 at 10:30
  • 1
    @mawir |Roland I have added a reply in regard to the ifelse issue in my answer below. – Jellen Vermeir Jun 21 '15 at 10:56
  • @Roland: You are correct about the evaluation of the function inside the for loop. However, I personally think it's cleaner and looks less cluttered to define the functions in a separate file, but this is a subjective opinion. Cluster export is used to export variables to the global environments of the subnodes. I just checked and you are correct. Apparantly it's possible to export functions as well. – Jellen Vermeir Jun 21 '15 at 11:06
0

As a generality, the place to look is in the innermost loop, and you can speed it up either by making it take less time or by calling it fewer times. You have an inner loop running mapply, but then you extract element [test] from it. Does that mean all the other elements are discarded? If so, why bother spending time to calculate the extra elements?

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • 1
    I don't understand this. It's at best a comment anyway. – Roland Jun 20 '15 at 15:46
  • I think Mike Dunlavey's point is that my attempted solutions must take at least as long as the original function, since they are in theory doing exactly the same calculation and then spending extra time filtering the result. I tried them because I don't know what ifelse() is doing 'under the hood' (and how much time it is taking to do it). – mawir Jun 20 '15 at 16:17
  • I know what `ifelse` is doing "under the hood" and it's not your bottleneck. You'd see this if you profiled your code. – Roland Jun 20 '15 at 16:39
  • @Roland: I was referring to the OP's function `F3`, which does a `mapply` followed by `[test]`. That looks like it's building up a list/array, taking one element, and throwing the rest away (along with the effort spent to create it). Or am I missing something? I do agree that this calls for some profiling - I use rprofile, and then just look at the stack traces it generates. – Mike Dunlavey Jun 20 '15 at 17:42