Efficient way to calculate Hawk's process gradient

Question

I am interested in calculating the following quantity

B(i) = \sum_{j < i}(x_i-x_j)exp^{-\beta(x_i - x_j)}

which is part of computing the gradient wrt one of the parameters of a Hawk's process likelihood (more information can be found here: http://www.ism.ac.jp/editsec/aism/pdf/031_1_0145.pdf).

Beta is just a constant for the shake of the problem and x_i is my i-th data point.

I am trying to calculate the above quantity in RCPP, using the following chunk of code:

for( int i = 1; i< x.size();i++) {
    double temp=0;
    for(int j=0; j<=i-1;j++){
      temp+=(x[i]-x[j])*exp(-beta*(x[i]-x[j]));

    }

but it is highly inefficient and slow. Any suggestion on how this formula could be speeded-up?

How large is `x`? Have you considered parallel execution? – Ralf Stubner Jun 09 '18 at 21:37 — Ralf Stubner, Jun 09 '18 at 21:37

F. Privé · Accepted Answer · 2018-06-10T17:31:10.610

Standard operations are very fast in C++ (+, -, etc). Yet, exp is more complicated to compute, so slower.

So, if we want some performance improvement, the more likely would be to be able to precompute the exp computations.

Here, B(i) = \sum_{j < i}(x_i-x_j)exp^{-\beta(x_i - x_j)} is equivalent to B(i) = \sum_{j < i}(x_i-x_j) / exp^{\beta x_i} * exp^{\beta x_j} so that you can precompute the exp for each index only (and also put the one depending on i out of the loop). By refactoring it, you can do other precomputations. So, I put here the two previous solutions then my incremental solutions:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_org(Rcpp::NumericVector x, double beta = 3) {

  int n = x.size();
  Rcpp::NumericVector B = Rcpp::no_init( n - 1);

  for (int i = 1; i < n; i++) {

    double temp = 0;

    for (int j = 0; j <= i - 1; j++) {
      temp += (x[i] - x[j]) * exp(-beta * (x[i] - x[j]));
    }

    B(i - 1) = temp;
  }

  return B;
}

// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_cache(Rcpp::NumericVector x, double beta = 3) {

  int n = x.size();
  Rcpp::NumericVector B = Rcpp::no_init( n - 1);

  double x_i;
  for (int i = 1; i < n; ++i) {

    double temp = 0;
    x_i = x[i];

    for (int j = 0; j <= i - 1; ++j) {
      temp += (x_i - x[j]) * 1 / exp(beta * (x_i - x[j]));
    }

    B(i - 1) = temp;
  }

  return B;
}

// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_cache_2(Rcpp::NumericVector x, 
                                         double beta = 3) {

  int i, j, n = x.size();
  Rcpp::NumericVector B(n);
  Rcpp::NumericVector x_exp = exp(beta * x);

  double temp;
  for (i = 1; i < n; i++) {

    temp = 0;
    for (j = 0; j < i; j++) {
      temp += (x[i] - x[j]) * x_exp[j] / x_exp[i];
    }

    B[i] = temp;
  }

  return B;
}

// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_cache_3(Rcpp::NumericVector x, 
                                         double beta = 3) {

  int i, j, n = x.size();
  Rcpp::NumericVector B(n);
  Rcpp::NumericVector x_exp = exp(beta * x);

  double temp;
  for (i = 1; i < n; i++) {

    temp = 0;
    for (j = 0; j < i; j++) {
      temp += (x[i] - x[j]) * x_exp[j];
    }

    B[i] = temp / x_exp[i];
  }

  return B;
}

// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_cache_4(Rcpp::NumericVector x, 
                                         double beta = 3) {

  Rcpp::NumericVector exp_pre = exp(beta * x);
  Rcpp::NumericVector exp_pre_cumsum = cumsum(exp_pre);
  Rcpp::NumericVector x_exp_pre_cumsum = cumsum(x * exp_pre);
  return (x * exp_pre_cumsum - x_exp_pre_cumsum) / exp_pre;
}

// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_cache_5(Rcpp::NumericVector x, 
                                         double beta = 3) {

  int n = x.size();
  NumericVector B(n);
  double exp_pre, exp_pre_cumsum = 0, x_exp_pre_cumsum = 0;

  for (int i = 0; i < n; i++) {
    exp_pre = exp(beta * x[i]);
    exp_pre_cumsum += exp_pre;
    x_exp_pre_cumsum += x[i] * exp_pre;
    B[i] = (x[i] * exp_pre_cumsum - x_exp_pre_cumsum) / exp_pre;
  }

  return B;
}



/*** R
set.seed(111)

x = rnorm(1e3)

all.equal(
  hawk_process_org(x),
  hawk_process_cache(x)
)
all.equal(
  hawk_process_org(x),
  hawk_process_cache_2(x)[-1]
)
all.equal(
  hawk_process_org(x),
  hawk_process_cache_3(x)[-1]
)

all.equal(
  hawk_process_org(x),
  hawk_process_cache_4(x)[-1]
)

all.equal(
  hawk_process_org(x),
  hawk_process_cache_5(x)[-1]
)

microbenchmark::microbenchmark(
  hawk_process_org(x),
  hawk_process_cache(x),
  hawk_process_cache_2(x),
  hawk_process_cache_3(x),
  hawk_process_cache_4(x),
  hawk_process_cache_5(x)
) 
*/

Benchmark for x = rnorm(1e3):

Unit: microseconds
                    expr       min         lq        mean     median         uq       max neval   cld
     hawk_process_org(x) 19801.686 20610.0365 21017.89339 20816.1385 21157.4900 25548.042   100    d 
   hawk_process_cache(x) 20506.903 21062.1370 21534.47944 21297.8710 21775.2995 26030.106   100     e
 hawk_process_cache_2(x)  1895.809  2038.0105  2087.20696  2065.8220  2103.0695  3212.874   100   c  
 hawk_process_cache_3(x)   430.084   458.3915   494.09627   474.2840   503.0885  1580.282   100  b   
 hawk_process_cache_4(x)    50.657    55.2930    71.60536    57.6105    63.5700  1190.260   100 a    
 hawk_process_cache_5(x)    43.373    47.0155    60.43775    49.6640    55.6235   842.288   100 a

This is much more effective than trying to gain nanoseconds from small optimizations that are likely to get your code more difficult to read.

But still, let's try the optimizations proposed by @coatless on my very last solution:

// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_cache_6(Rcpp::NumericVector x, 
                                         double beta = 3) {

  int n = x.size();
  NumericVector B = Rcpp::no_init(n);
  double x_i, exp_pre, exp_pre_cumsum = 0, x_exp_pre_cumsum = 0;

  for (int i = 0; i < n; ++i) {
    x_i = x[i];
    exp_pre = exp(beta * x_i);
    exp_pre_cumsum += exp_pre;
    x_exp_pre_cumsum += x_i * exp_pre;
    B[i] = (x_i * exp_pre_cumsum - x_exp_pre_cumsum) / exp_pre;
  }

  return B;
}

Benchmark for x = rnorm(1e6):

Unit: milliseconds
                    expr      min       lq     mean   median       uq       max neval cld
 hawk_process_cache_5(x) 42.52886 43.53653 45.28427 44.46688 46.74129  57.38046   100   a
 hawk_process_cache_6(x) 42.14778 43.19054 45.93252 44.28445 46.51052 153.30447   100   a

Still not very convincing..

These answers are not equivalent. You negated subsetting inside the C++ function and left that in R. So, the benchmarks displayed are bogus. Though, I am sure there are gains over the original method. — coatless, Jun 10 '18 at 12:36
Also, I think you could _increase_ the speed by using `no_init(size)` to initialize the `NumericVectors` without the default fill of 0's. — coatless, Jun 10 '18 at 14:35
@coatless For the first comment, you are talking about the `[-1]`? For the second comment, I edited `hawk_process_cache_6` to use `no_init`. — F. Privé, Jun 10 '18 at 17:33
Yes, you are not subsetting within the _C++_ function. e.g. `hawk_process_org(x), != hawk_process_cache_5(x)`. You're benchmark is biased as a result. I think the speed up in terms of `cumsum` will still greatly improve the results as you reduce the exp() call from N^2 to N since there is a built in cache as a result. I would also suggest `no_init()` for your own variants. I'm sure that would speed up the process more. — coatless, Jun 10 '18 at 17:37
I computed a vector of `n` elements because I think this is what the OP needs. Actually, it is more computation than just `n - 1`. Also, using `no_init` in the last solution didn't really change the benchmark. I think those are not the optimizations that a developer should focus on. — F. Privé, Jun 10 '18 at 18:21
It's off-by-one with `n` as one place disappear since this is a differencing op with `x[i] - x[j]`. http://stat.ethz.ch/R-manual/R-devel/library/base/html/diff.html — coatless, Jun 10 '18 at 18:24
@F.Privé, it's is true that `no_init` has little effect here, but consider a matrix with a hundred million rows and 100s of columns.... I have seen having `no_init_matrix` decrease running time by 40% in such cases. Still though, nice answer. — Joseph Wood, Jun 10 '18 at 23:40

score 2 · Answer 2 · answered Jun 11 '18 at 09:52

Interesting question. In my tests combining the two answers does give a further performance boost (benchmarks further down):

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector hawk_process_cache_combined(NumericVector x, 
                                         double beta = 3) {

  int n = x.size();
  NumericVector B = Rcpp::no_init(n-1);
  double exp_pre(exp(beta * x[0]));
  double exp_pre_cumsum(exp_pre);
  double x_exp_pre_cumsum(x[0] * exp_pre);
  double x_i;

  for (int i = 1; i < n; ++i) {
    x_i = x[i];
    exp_pre = exp(beta * x_i);
    exp_pre_cumsum += exp_pre;
    x_exp_pre_cumsum += x_i * exp_pre;
    B[i-1] = (x_i * exp_pre_cumsum - x_exp_pre_cumsum) / exp_pre;
  }

  return B;
}

all.equal(
  hawk_process_org(x),
  hawk_process_cache_combined(x)
)
#> [1] TRUE

Now while the original formulation is "embarrassingly parallel", this is no longer the case for this expression. However, prefix scan algorithms like cumsum can also be parallelized. And libraries like ArrayFire provide interfaces to such algorithms using the GPU. Using RcppArrayFire one can write based on F. Privé's hawk_process_cached_4:

// [[Rcpp::depends(RcppArrayFire)]]
#include <RcppArrayFire.h>
// [[Rcpp::export]]
af::array hawk_process_af(RcppArrayFire::typed_array<f32> x, 
                          double beta = 3) {

  af::array exp_pre = exp(beta * x);
  af::array exp_pre_cumsum = af::accum(exp_pre);
  af::array x_exp_pre_cumsum = af::accum(x * exp_pre);
  af::array result = (x * exp_pre_cumsum - x_exp_pre_cumsum) / exp_pre;
  return result(af::seq(1, af::end));
}

Here the results are not exactly equal, since my driver/card only supports single precision floats:

all.equal(
  hawk_process_org(x),
  hawk_process_af(x)
)
#> [1] "Mean relative difference: 3.437819e-07"

With double precision one would write f64 above and obtain identical results. Now for the benchmarks:

set.seed(42)
x <- rnorm(1e3)
microbenchmark::microbenchmark(
  hawk_process_af(x),
  hawk_process_cache_combined(x),
  hawk_process_cache_5(x)[-1]
) 
#> Unit: microseconds
#>                            expr     min       lq      mean   median      uq      max neval
#>              hawk_process_af(x) 245.281 277.4625 338.92232 298.5410 346.576 1030.045   100
#>  hawk_process_cache_combined(x)  35.343  39.0120  43.69496  40.7770  45.264    84.242   100
#>     hawk_process_cache_5(x)[-1]  52.408  57.8580  65.55799  60.5265  67.965  125.864   100
x <- rnorm(1e6)
microbenchmark::microbenchmark(
  hawk_process_af(x),
  hawk_process_cache_combined(x),
  hawk_process_cache_5(x)[-1]
)
#> Unit: milliseconds
#>                            expr      min       lq     mean   median       uq       max neval
#>              hawk_process_af(x) 27.54936 28.42794 30.93452 29.20025 32.40667  49.41888   100
#>  hawk_process_cache_combined(x) 34.00380 36.84497 40.74862 39.03649 41.85902 111.51628   100
#>     hawk_process_cache_5(x)[-1] 47.02501 53.24702 57.94747 55.35018 58.42097 130.89737   100

So for small vectors, the combined approach is faster, while for longer once offloading to the GPU pays off. All this not with some high power GPU but simple on-board graphics:

RcppArrayFire::arrayfire_info()
#> ArrayFire v3.5.1 (OpenCL, 64-bit Linux, build 0a675e8)
#> [0] BEIGNET: Intel(R) HD Graphics Skylake ULT GT2, 4096 MB

score 1 · Answer 3 · answered Jun 09 '18 at 18:37

This is an O(N^2) operation without factoring in the cost of exp. Any tweaks are likely to yield minimal improvements.

A few quick suggestions:

cache the value of x[i] on the outer loop as you are repeatedly subsetting that in the inner loop.
switch from using exp(-beta * ..) to 1/exp(beta*(x ... ))
use ++i instead of i++ to avoid a slight performance hiccup since you avoid a copy of i that the latter does.

Original code:

#include<Rcpp.h>

// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_org(Rcpp::NumericVector x, double beta = 3) {

  int n = x.size();
  Rcpp::NumericVector B = Rcpp::no_init( n - 1);

  for (int i = 1; i < n; i++) {

    double temp = 0;

    for (int j = 0; j <= i - 1; j++) {
      temp += (x[i] - x[j]) * exp(-beta * (x[i] - x[j]));
    }

    B(i - 1) = temp;
  }

  return B;
}

Modified code:

#include<Rcpp.h>

// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_cache(Rcpp::NumericVector x, double beta = 3) {

  int n = x.size();
  Rcpp::NumericVector B = Rcpp::no_init( n - 1);

  double x_i;
  for (int i = 1; i < n; ++i) {

    double temp = 0;
    x_i = x[i];

    for (int j = 0; j <= i - 1; ++j) {
      temp += (x_i - x[j]) * 1 / exp(beta * (x_i - x[j]));
    }

    B(i - 1) = temp;
  }

  return B;
}

Test

set.seed(111)

x = rnorm(1e4)

all.equal(
  hawk_process_org(x),
  hawk_process_cache(x)
)
#> [1] TRUE

bench_func = microbenchmark::microbenchmark(
  hawk_process_org(x),
  hawk_process_cache(x)
)

bench_func
#> Unit:milliseconds
#>                 expr      min       lq     mean   median       uq      max   neval
#> hawk_process_org(x)   436.5349 465.9674 505.9606 481.4703 500.6652 894.7477   100
#> hawk_process_cache(x) 446.0499 454.9098 485.3830 468.6580 494.9457 799.0940   100

So, you get marginally better results under the recommendations.

Efficient way to calculate Hawk's process gradient

3 Answers3