I'm writing a small library for statistical sampling which needs to run as fast as possible. In profiling I discovered that around 40% of the time taken in the function is spent computing Stirling's approximation for the logarithm of the factorial. I'm focusing my optimization efforts on that piece of it. Here's my code (which uses MPFR):
const double AL[8] =
{ 0.0, 0.0, 0.6931471806, 1.791759469, 3.178053830, 4.787491743,
6.579251212, 8.525161361 };
void HGD::mpfr_afc(mpfr_t &ret, const mpfr_t &val){
if(mpfr_cmp_ui(val, 7) <= 0){
mpfr_set_d(ret, AL[mpfr_get_ui(val, MPFR_RNDN)], MPFR_RNDN);
}else{
mpfr_set(ret, val, MPFR_RNDN);
mpfr_add_d(ret, ret, 0.5, MPFR_RNDN);
mpfr_log(LV, val, MPFR_RNDN);
mpfr_mul(ret, LV, ret, MPFR_RNDN);
mpfr_sub(ret, ret, val, MPFR_RNDN);
mpfr_add_d(ret, ret, 0.399089934, MPFR_RNDN);
}
}
I have a couple different ideas:
- Precompute more than the first 8 inputs to the function.
- Optimize the math (use a coarser approximation for smaller precision)
- Use multiple threads to compute on different inputs in parallel
- Switch to native arithmetic when numbers can fit in machine data types
Are there other approaches I could take?