Why is Boost implementation 5-10x slower than R's

Question

I am building an app that frequently computes the regularized incomplete beta function. The app is written in C++ and calls R::pbeta(). When I tried to multithread the app, some warning messages from R::pbeta() smashed the stack.

So I turned to boost::math::ibeta(). Everything worked fine until I measured the speed. The following C++ file whyIsBoostSlower.cpp implements the regularized incomplete beta function using either R::pbeta() or boost::math::ibeta().

// [[Rcpp::plugins(cpp17)]]
#include <boost/math/special_functions/beta.hpp>
// [[Rcpp::depends(BH)]]
#include <Rcpp.h>
using namespace Rcpp;


// Compute the regularized incomplete Beta function.
// [[Rcpp::export]]
NumericVector RIBF(NumericVector q, NumericVector a, NumericVector b, 
                   bool useboost = false)
{
  NumericVector rst(q.size());
  for (int i = 0, iend = q.size(); i < iend; ++i)
  {
    if (useboost) rst[i] = boost::math::ibeta( a[i], b[i], q[i] );
    else          rst[i] = R::pbeta( q[i], a[i], b[i], 1, 0 );
  }
  return rst;
}

In R, we measure the speed of calling the function 300000 times on random numbers:

Rcpp::sourceCpp("whyIsBoostSlower.cpp")


set.seed(123)
N = 300000L
q = runif(N) # Generate quantiles.
a = runif(N, 0, 10) # Generate a in (0, 10)
b = runif(N, 0, 10) # Generate b in (0, 10)


# Use R's pbeta(). This function calls a C wrapper of toms708.c:
#   https://svn.r-project.org/R/trunk/src/nmath/toms708.c
system.time({ Rrst = RIBF(q, a, b, useboost = F) })
# Windows 10 (seconds):
# user  system elapsed 
# 0.11    0.00    0.11 

# RedHat Linux:
# user  system elapsed 
# 0.097   0.000   0.097 


# Use Boost's implementation, which also depends on TOMS 708 by their claim:
#  https://www.boost.org/doc/libs/1_41_0/libs/math/doc/sf_and_dist/html/math_toolkit/special/sf_beta/ibeta_function.html
system.time({ boostRst = RIBF(q, a, b, useboost = T) })
# Windows 10:
# user  system elapsed 
# 0.52    0.00    0.52 

# RedHat Linux:
# user  system elapsed 
# 0.988   0.001   0.993 


range(Rrst - boostRst)
# -1.221245e-15  1.165734e-15

To reproduce the example, one needs to install R and package Rcpp. On Windows, one also needs to install Rtools which contains a GCC distribution. The optimization flag is default to -O2.

Both R::pbeta() and boost::math::ibeta() are based on ACM TOMS 708, yet boost::math::ibeta() is 5x slower on Windows and 10x slower on Linux.

I think it might have something to do with setting the Policy argument in boost::math::ibeta(), but how?

Thank you!

FYI, R::pbeta() is defined in R-4.2.3/src/nmath/pbeta.c. R::pbeta() calls bratio() which is defined in R-4.2.3/src/nmath/toms708.c, namely https://svn.r-project.org/R/trunk/src/nmath/toms708.c . Code inside is C translation of TOMS 708 Fortran code. The translation is done by R's core team.

In contrast, Boost states "This implementation is closely based upon "Algorithm 708; Significant digit computation of the incomplete beta function ratios", DiDonato and Morris, ACM, 1992." on boost::math::ibeta()

Actually comparable measurements and data along your question please.! — πάντα ῥεῖ, Jun 09 '23 at 17:20
Did you even check release builds of C++ with optimizations enabled. Is your test set big enough. In other words measurements are meaningless without context. And have you even checked that the library you use is threadsafe? — Pepijn Kramer, Jun 09 '23 at 17:22
@πάνταῥεῖ I don't understand, the test data are exactly in the code — user2961927, Jun 09 '23 at 17:23
@PepijnKramer The test data are in the code. They are in the same program compiled using g++ -O2 — user2961927, Jun 09 '23 at 17:24
I don't know why this question was closed for not having reproducible code — user20650, Jun 09 '23 at 17:39
@user20650 I guess not many people are familiar with both C++ and R. I modified the question so it should be much easier to understand and reproduce — user2961927, Jun 09 '23 at 17:44
@πάνταῥεῖ I modified the question such that the result can be easily reproduced — user2961927, Jun 09 '23 at 17:44
@PepijnKramer I modified the question such that the result can be easily reproduced — user2961927, Jun 09 '23 at 17:45
I would try `-O3` with `-ffast-math`. See [What does gcc's ffast-math actually do?](https://stackoverflow.com/questions/7420665/what-does-gccs-ffast-math-actually-do) — 273K, Jun 09 '23 at 18:07
Have you used a profiler before? This is usually the goto tool to understand bottlenecks. — Pepijn Kramer, Jun 09 '23 at 18:26
@Eugene That's extremely unlikely. Those Rcpp objects are just references to arrays living in R's memory space — user2961927, Jun 09 '23 at 18:46
@273K That might blow up the error as the incomplete beta function has so many ill-conditioned regions.. But I'll give an attempt.. — user2961927, Jun 09 '23 at 18:48
@PepijnKramer Trying it now. seems there's no quick answer and have to dig deep into the source code.. — user2961927, Jun 09 '23 at 18:49
@Eugene There is sone very small overhead as Rcpp ensures proper state saving and resetting of the RNG, wraps tryCatch etc but it is a miniscule cost that generally never affects any real run-time cost (unless you measure pathological empty functions). Also note we have close to 3000 question under the `Rcpp` tag here (even though I try hard to get nonsensical ones, or duplicates, delete). Also a few in the intersection with Boost and R's BH package. — Dirk Eddelbuettel, Jun 09 '23 at 19:09
@273K I just tried - no significant difference (also added `-march=native` because it defaulted to some generic `x86-64`) — sehe, Jun 09 '23 at 23:14
@sehe Thanks for the info. By no significant difference you meant for both the speed and error? I had bad experience with -ffast-math over those heavy math functions.. — user2961927, Jun 10 '23 at 00:22

Dirk Eddelbuettel · Accepted Answer · 2023-06-09T19:20:21.073

As there are Beta, incomplete Beta as well as a Beta distribution (which you hit with R::pbeta()) I want to ensure we compare apples with apples.

So a modified version of your code, here with two distinct functions for simplicity---as well as a comparison the GSL---and formal benchmark call:

Code

// [[Rcpp::depends(BH)]]
#include <boost/math/special_functions/beta.hpp>

// this also ensure linking with the GSL
// [[Rcpp::depends(RcppGSL)]]
#include <gsl/gsl_sf_gamma.h>

#include <Rcpp.h>
using namespace Rcpp;


// [[Rcpp::export]]
NumericVector bfR(NumericVector a, NumericVector b) {
    int n = a.size();
    NumericVector rst(n);
    for (int i = 0; i<n; i++) {
        rst[i] = R::beta(a[i], b[i]);
    }
    return rst;
}

// [[Rcpp::export]]
NumericVector bfB(NumericVector a, NumericVector b) {
    int n = a.size();
    NumericVector rst(n);
    for (int i = 0; i<n; i++) {
        rst[i] = boost::math::beta( a[i], b[i] );
    }
    return rst;
}

// [[Rcpp::export]]
NumericVector bfG(NumericVector a, NumericVector b) {
    int n = a.size();
    NumericVector rst(n);
    for (int i = 0; i<n; i++) {
        rst[i] = gsl_sf_beta( a[i], b[i] );
    }
    return rst;
}


/*** R

set.seed(123)
N <- 300000L
a <- runif(N, 0, 10) # Generate a in (0, 10)
b <- runif(N, 0, 10) # Generate b in (0, 10)
summary(bfR(a,b) - bfB(a,b))
summary(bfR(a,b) - bfG(a,b))
microbenchmark::microbenchmark(R = bfR(a, b), Boost = bfB(a, b), GSL = bfG(a, b), times=10)

*/

Output

When we Rcpp::sourceCpp() the 'marked' R code section also gets executed:

> Rcpp::sourceCpp("answer.cpp")

> set.seed(123)

> N <- 300000L

> a <- runif(N, 0, 10) # Generate a in (0, 10)

> b <- runif(N, 0, 10) # Generate b in (0, 10)

> summary(bfR(a,b) - bfB(a,b))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-3.64e-12  0.00e+00  0.00e+00 -5.00e-17  0.00e+00  1.36e-12 

> summary(bfR(a,b) - bfG(a,b))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-1.18e-11 -4.00e-17  0.00e+00  2.10e-16  0.00e+00  1.09e-11 

> microbenchmark::microbenchmark(R = bfR(a, b), Boost = bfB(a, b), GSL = bfG(a, b), times=10)
Unit: milliseconds
  expr      min       lq     mean   median       uq      max neval cld
     R  44.9314  45.2773  46.9782  46.1237  49.0056  50.6273    10 a  
 Boost 166.0146 167.2552 171.0441 169.5741 175.0108 180.6520    10  b 
   GSL  58.3259  58.5101  61.0364  59.6556  62.4862  67.5316    10   c
>

At this point I can only guess that Boost either does some extra hoops, or suffers some costs from abstraction as it looses to both R and the GSL. (And I note that on its documentation page the results are compared (for accurracy) to the GNU GSL as well as to R: https://www.boost.org/doc/libs/1_82_0/libs/math/doc/html/math_toolkit/sf_beta/beta_function.html)

Thanks Dirk! The link is helpful. So I guess it's probably because Boost runs much lengthier approximation series to push the error rate lower than R's — user2961927, Jun 09 '23 at 19:37
My pleasure. Might be worthwhile to clarify what you were after: Beta distribution, or function, or ... — Dirk Eddelbuettel, Jun 09 '23 at 19:53
Formally I was after the Regularized Incomplete Beta function, which happens to be the Beta CDF in R. Your code measures the speeds of "pure" Beta functions, but I think the reason for Boost being slower is the same. For my work the extreme precision offered by Boost is not worth that 5-10x deacceleration.. I am still working on a thread-safe workaround, probably settling for that "textbook algorithm" implemented in "Numeric Recipes".. — user2961927, Jun 10 '23 at 00:18

Why is Boost implementation 5-10x slower than R's

1 Answers1

Code

Output