Most efficient way to calculate function with large number of parameter combinations

Question

Minimalist example of what I'm trying to do:

dX_i <- rnorm(100, 0, 0.0002540362)

p_vec <- seq(0, 1, 0.25)  
gamma_vec <- seq(1, 2, 0.25)     
a_vec <- seq(2, 6, 1)
sigma_hat_vec <- c(0.03201636, 0.05771143, 0.07932116, 0.12262327, 0.15074560)
delta_j_vec <- c(0.0000005850109, 0.0000011700217, 0.0000017550326, 0.0000035100651, 0.0000052650977)

parameters <- expand.grid("p" = p_vec, "gamma" = gamma_vec, "a" = a_vec, "sigma_hat" = sigma_hat_vec, "delta_j" = delta_j_vec)


result <- sapply(1:nrow(parameters), function(x) {
  tmp <- parameters[x,]
  p <- tmp$p
  a <- tmp$a
  gamma <- tmp$gamma
  sigma_hat <- tmp$sigma_hat
  delta_j <- tmp$delta_j

  B <- sum( (abs(dX_i)^p) * ( abs(dX_i) < gamma * a * sigma_hat * delta_j^(1/2) ))

  return(B)
})

Goal: I need to calculate B on vector dX given all combinations of p, a, gamma, sigma_hat, delta_j.

However, in reality the grid parameters has ~600k rows, and dX_i has length ~80k. Moreover, I have a list with ~1000 dX_i. Therefore, I want to make this calculation as efficient as possible. Other approaches, e.g. converting parameters to data.table and running sapply within that data.table did not seem to give a speedup.

I tried parallelizing the function (I am limited to running the script on a virtual Windows machine):

cl <- makePSOCKcluster(numCores)
num.iter <- 1:nrow(parameters)
parSapply(cl, num.iter, function(x, parameters, dX_i) {
  tmp <- parameters[x,]
  p <- tmp$p
  a <- tmp$a
  gamma <- tmp$gamma
  sigma_hat <- tmp$sigma_hat
  delta_j <- tmp$delta_j
  sum( (abs(dX_i)^p) * ( abs(dX_i) < gamma * a * sigma_hat * delta_j^(1/2) ))
}, parameters, dX_i)
stopCluster(cl)

While this gave me a speedup, I still feel like I'm not really solving this problem in the most efficient way and would appreciate any suggestions.

Just curious, how fast is it currently and how fast would be "fast enough"? — Jon Spring, Jan 06 '20 at 19:28
Do you really need *every* combination? What is your goal? If you're looking for a minimum or a maximum value, consider using an optimizer instead, which will be able to implement a smarter/more efficient search pattern than a grid search. e.g., `optim` or the `optimx` package. — Gregor Thomas, Jan 06 '20 at 19:52
Also, don't iterate over rows of a data frame. It probably won't help much relative to everything else, but switching to a matrix will speed up the row extraction. — Gregor Thomas, Jan 06 '20 at 19:56
"Fast enough" is just an arbitrary term to describe a situation where I want to be able to perform the calculations for my whole dataset at least overnight, but preferably faster. Right now, the whole script takes a full day to execute, but my dataset keeps growing. I am aware that there's a lot of redundancy, I am willing to tolerate it if I am able to understand and explain the function. I am completely rewriting a chunk of code that performs above calculation, but is quite slow on large datasets. Hence my goal was just to speed up the existing code — zonfl, Jan 06 '20 at 22:54
@YalDan That's a good start at answering Jon's question, but I think a big part of the clarification he's asking for is how much faster you need it to be? Is a 2x speed-up good? 10x? 100x? — Gregor Thomas, Jan 09 '20 at 19:49
@gregor-reinstate-monica I see. The speed-up through the edited solution was about 35x I'd say that's good because it allows me to execute the whole loop in ~30-40 mins instead of a full day. If I could get that another 10-30x faster that would be great of course. I think the bottleneck lies in the amount of iterations and the length of `dX_i` as you already mentioned, but unfortunately I need every combination since this is a statistical test I'm performing on a huge set of time series. — zonfl, Jan 09 '20 at 21:30

score 13 · Answer 1 · answered Jan 07 '20 at 08:01

@josliber's answer is very good. Yet, it makes it look like R is bad... and you have to switch to C++ for performance.

There are three tricks that are implemented in their answer:

precompute the vector of threshold
precompute the absolute value of dX_i
sort these values to stop the sum early

The first two tricks are just an R trick called "vectorization" -> basically do your operations (e.g. gamma * a * sigma_hat * delta_j^(1/2) or abs()) on the whole vectors instead of on a single element within a loop.

This is exactly what you do when using sum( dX_i^p * vec_boolean ); it is vectorized (the * and sum) so that it should be very fast.

If we implement just these two tricks (we can't really do the third one the same way because it breaks vectorization), it gives:

abs_dX_i <- abs(dX_i)
thresh <- with(parameters, gamma * a * sigma_hat * sqrt(delta_j))
p <- parameters$p
result3 <- sapply(1:nrow(parameters), function(i) {
  in_sum <- (abs_dX_i < thresh[i])
  sum(abs_dX_i[in_sum]^p[i])
})
all.equal(result, result3) # TRUE

If we benchmark all three solutions:

microbenchmark::microbenchmark(
  OP = {
    result <- sapply(1:nrow(parameters), function(x) {
      tmp <- parameters[x,]
      p <- tmp$p
      a <- tmp$a
      gamma <- tmp$gamma
      sigma_hat <- tmp$sigma_hat
      delta_j <- tmp$delta_j

      B <- sum( (abs(dX_i)^p) * ( abs(dX_i) < gamma * a * sigma_hat * delta_j^(1/2) ))

      return(B)
    })
  },
  RCPP = {
    result2 <- proc(sort(abs(dX_i)), parameters$gamma * parameters$a *
                      parameters$sigma_hat * parameters$delta_j^(1/2), parameters$p)
  },
  R_VEC = {
    abs_dX_i <- abs(dX_i)
    thresh <- with(parameters, gamma * a * sigma_hat * sqrt(delta_j))
    p <- parameters$p
    result3 <- sapply(1:nrow(parameters), function(i) {
      in_sum <- (abs_dX_i < thresh[i])
      sum(abs_dX_i[in_sum]^p[i])
    })
  },
  times = 10
)

We get:

Unit: milliseconds
  expr      min       lq      mean   median       uq      max neval
    OP 224.8414 235.4075 289.90096 270.2767 347.1727 399.3262    10
  RCPP  14.8172  15.4691  18.83703  16.3979  20.3829  29.6624    10
 R_VEC  28.3136  29.5964  32.82456  31.4124  33.2542  45.8199    10

It gives a huge speedup by just slightly modifying your original code in R. This is less than twice as slow as the Rcpp code and can be easily parallelized as you previously did with parSapply().

Thank you for the effort, it's encouraging to see that R can be quite fast with proper design choices! I implemented @josliber's approach in the original script that I tried to optimize and am able to perform all computations in about 4.5 hours (there's a large overhead involved). I will try your solution as well and see how fast it is, especially in combination with parallelization (although I'm not sure if it's beneficial, will depend on the total duration of one iteration) — zonfl, Jan 07 '20 at 13:50
Nice! I scaled up to the scale mentioned in the question (600k parameters and 80k values in `dX_i`) and the 2x ratio remained more or less unchanged (724 seconds for my code and 1518 seconds for yours). I would expect the Rcpp code to really shine in cases where thresholds are very small; then the ability to stop calculations once you reach the threshold is especially beneficial. For instance, when I multiply the threshold by 0.01, my code completes in 17 seconds while yours takes 221 seconds. — josliber, Jan 07 '20 at 15:23
You could probably get most of the speedup from early stopping by sorting `abs(dX_i)` like I do and then using `findInterval` to (quickly) identify the number of elements to sum within the for loop. [[Edit: confirmed: sorting and using `findInterval` gets your approach to 32 seconds in the updated example where I multiplied the thresholds by 0.01]] — josliber, Jan 07 '20 at 17:02
@josliber that's an interesting idea! I tried to see what happens if I do `dX_i <- sort(abs(dX_i))[1:findInterval(g * 2 * sigmahat * deltaj^(1/2), sort(abs(dX_i)) )]` before running `proc()`. This gives me another speedup. Together with parallelization, I'm able to perform all computations in ~80 minutes! I will now see what happens if I combine this with precomputing the threshold. It's amazing how much of a difference a few milliseconds can make. — zonfl, Jan 08 '20 at 21:51
@YalDan that doesn't look quite like the expression I was expecting -- make sure to check it's giving you the correct results. I was thinking something like `R_VEC` in this answer but replacing `abs_dX_i <- abs(dX_i)` with `abs_dX_i <- sort(abs(dX_i))`, computing threshold positions with `pos <- findInterval(thresh, abs_dX_i)`, and then just having `sum(head(abs_dX_i, pos[i])^p[i])` in the `sapply` call. — josliber, Jan 09 '20 at 14:44
@josliber there's something wrong with that expression, you're right. I'll try your approach now — zonfl, Jan 09 '20 at 16:02

josliber · Answer 2 · 2020-01-06T20:49:22.627

When I want to speed up hard-to-vectorize code, I often turn to Rcpp. At the end of the day you are trying to sum up abs(dX_i)^p, limiting to values of abs(dX_i) smaller than threshold gamma * a * sigma_hat * delta_j^(1/2). You want to do this for a bunch of pairs of p and a threshold. You could accomplish this with:

library(Rcpp)
cppFunction(
"NumericVector proc(NumericVector dX_i, NumericVector thresh, NumericVector p) {
  const int n = thresh.size();
  const int m = dX_i.size();
  NumericVector B(n);
  for (int i=0; i < n; ++i) {
    B[i] = 0;
    for (int j=0; j < m; ++j) {
      if (dX_i[j] < thresh[i]) {
        B[i] += pow(dX_i[j], p[i]);
      } else {
        break;
      }
    }
  }
  return B;
}"
)
result2 <- proc(sort(abs(dX_i)), parameters$gamma * parameters$a * parameters$sigma_hat * parameters$delta_j^(1/2), parameters$p)
all.equal(result, result2)
# [1] TRUE

Note that my code sorts the absolute value of dX_i so it can stop the calculation once it encounters the first value exceeding the threshold.

On my machine, I see a 20x speedup from 0.158 seconds for your code to 0.007 seconds for the Rcpp code (measured using system.time).

Thank you so much! The speedup is remarkable, using my code a single iteration took several minutes, while I get my result almost immediately when using your function. I didn't expect such a quick and easy solution, guess it's time to learn some C++ — zonfl, Jan 06 '20 at 20:34

josliber · Accepted Answer · 2020-01-10T15:34:46.000

One observation is that you actually have a huge number of repeats of each p value within your parameter set. You might separately process each p value; in this way, you only need to sum dX_i raised to a particular p value once.

result4 <- rep(NA, nrow(parameters))
sa_dX_i <- sort(abs(dX_i))
thresh <- parameters$gamma * parameters$a * parameters$sigma_hat * parameters$delta_j^(1/2)
loc <- findInterval(thresh, sa_dX_i)
loc[loc == 0] <- NA  # Handle threshold smaller than everything in dX_i
for (pval in unique(parameters$p)) {
  this.p <- parameters$p == pval
  cs_dX_i_p <- cumsum(sa_dX_i^pval)
  result4[this.p] <- cs_dX_i_p[loc[this.p]]
}
result4[is.na(result4)] <- 0  # Handle threshold smaller than everything in dX_i
all.equal(result, result4)
# [1] TRUE

To see this in action, let's scale up the original dataset to what is described in the question (~600k rows of parameters and ~80k values in dX_i):

set.seed(144)
dX_i <- rnorm(80000, 0, 0.0002540362)
p_vec <- seq(0, 1, 0.025)  
gamma_vec <- seq(1, 2, 0.025)     
a_vec <- seq(2, 6, 0.3)
sigma_hat_vec <- c(0.03201636, 0.05771143, 0.07932116, 0.12262327, 0.15074560)
delta_j_vec <- c(0.0000005850109, 0.0000011700217, 0.0000017550326, 0.0000035100651, 0.0000052650977)
parameters <- expand.grid("p" = p_vec, "gamma" = gamma_vec, "a" = a_vec, "sigma_hat" = sigma_hat_vec, "delta_j" = delta_j_vec)
dim(parameters)
# [1] 588350      5
length(unique(parameters$p))
# [1] 41

The speedup is rather dramatic -- this code takes 0.27 seconds on my computer, while the Rcpp code posted in my other answer to this question takes 655 seconds (a 2400x speedup, using pure R!). Clearly this speedup only works if there are relatively few p values (each repeated many times) in the parameters data frame. If each p value is unique, this would likely be much slower than the other approaches proposed.

This is incredible @josliber. I ran this code on my dataset of ~95 million rows and speed went from 1 day to 40 minutes to a stunning 12 seconds! Can't thank you enough for your help! Btw, `p` is usually not more than `seq(0, 6, 0.25)` so that shouldn't be an issue. — zonfl, Jan 12 '20 at 04:03

Most efficient way to calculate function with large number of parameter combinations

3 Answers3