How to implement vectorize "exp" and "log" base-2 functions using AVX-512

Question

For a game that I'm working on - need the ability to run million of "exp" calls on vectors. Basically

void vector_exp(const double *x, const double *result, int n)
{
    for (int i=0 ; i<n ; i++) result[i] = exp(x[i]) ;
}

For my specific case, inputs are all -50..+50. Need double precision, with 8 decimals match current ‘exp’ - to pass test cases.

I have the same challenge also with 'log'. Input range is 1e-7 to 1e7.

Would like to utilize the AVX 512 instructions - which should be able to do (in theory) 8 double precision at a time. I've retrieved the glibc code (both the "C" version, and the ".S" version built for AVX) - but I'm not sure how to move forward from here.

https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/fpu

How accurate do you need the answer to be? Do you know anything about the range of possible inputs? — Nate Eldredge, May 21 '23 at 03:18
Similar questions: https://stackoverflow.com/questions/48863719/fastest-implementation-of-exponential-function-using-avx, https://stackoverflow.com/questions/40619431/where-can-i-find-an-avx-exponential-double-precision-function, https://stackoverflow.com/questions/53280563/how-to-generate-simd-code-for-math-function-exp-using-openmp, https://stackoverflow.com/questions/53872265/how-do-you-process-exp-with-sse2. There are several links to libraries that do this, including for instance https://www.agner.org/optimize/vectorclass.pdf — Nate Eldredge, May 21 '23 at 03:21
@NateEldredge: AVX-512 adds `_mm256_getmant_pd` to extract the mantissa, as an optimization over bithacks to separate exponent and mantissa manually, but yeah that's something you can generally drop in to existing exp / log implementations that do one thing with the exponent and a polynomial approximation over a mantissa range like [1.0, 2.0) or [0.5, 1.0) or whatever. — Peter Cordes, May 21 '23 at 03:25
See also [Efficient implementation of log2(\_\_m256d) in AVX2](https://stackoverflow.com/q/45770089). Nate's first link, [Fastest Implementation of Exponential Function Using AVX](https://stackoverflow.com/q/48863719) has some good (IIRC) code for single-precision `float`, including njuffa's which has an FMA version. — Peter Cordes, May 21 '23 at 03:28
GCC should be able to auto-vectorize this if you let it use a vector math library, like glibc's own libmvec. — Peter Cordes, May 21 '23 at 03:30
@NateEldredge. Must be accurate for all “normal” value. No need to handle nan, inf, …. Double precision a must. Test cases check against pre calculate value with exp, 7-8 digits should match — dash-o, May 21 '23 at 10:44
@PeterCordes. I tried gcc7 - best it did for me was to unroll, but it still made call to exp with individual value one at a time (-free-vector). — dash-o, May 21 '23 at 10:46
Apparently GCC and clang are only willing to auto-vectorize `exp` with `-ffast-math`. https://godbolt.org/z/4MGjcKcnh . (Also, `result` can't be `const`.) See also https://sourceware.org/glibc/wiki/libmvec — Peter Cordes, May 21 '23 at 10:53
@PeterCordes. I need the implementation of the log2 function you pointed to for ‘exp’ and ‘log’.. — dash-o, May 21 '23 at 10:54
Since you need high precision, I'd recommend Agner Fog's VCL implementations; they're MIT or Apache licensed these days. They use tricks to maintain close to full precision of `double`. Sounds weird that you say you only need that to pass test-cases, though; if your *real* use-case could afford less precision, you can use lower degree polynomials and run faster. It's directly a speed vs. precision tradeoff. The version I implemented used the 6-degree polynomial from http://jrfonseca.blogspot.ca/2008/09/fast-sse2-pow-tables-or-polynomials.html (for `float`, but would work for `double`.) — Peter Cordes, May 21 '23 at 11:03
@PeterCordes. I agree it’s weird (test cases). The test cases are very strict. The application is in production, so it’s easier to pass the tests (replicating old numbers) then to convince risk averse production team that it’s good enough. Part of life in corporate environment. — dash-o, May 21 '23 at 11:16
Well good luck if you need to *exactly* replicate glibc's scalar results; IDK if the `libmvec` implementation does that or not. You might have to dig into the scalar implementation and use its polynomial constants exactly if you want the same results in SIMD; I wouldn't be surprised if VCL and most other existing vector `exp()` examples you'd do at least something different from glibc's scalar implementation and get a result different in at least the low bit of the mantissa for some inputs. — Peter Cordes, May 21 '23 at 11:48
AVX512 has the intrinsic _mm512_exp2a23_pd which calculates 2^x. If you're using e^x and log(x) to calculate x^y, then you can do it by using base 2 instead. I don't think there's a single instruction for log_2 though. There's _mm512_getexp_pd which is floor(log_2(x)) which is a start, but you'll need to do some work for the fractional part. — Simon Goater, May 21 '23 at 12:42
If you’ll be implementing that function yourself, consider porting the version from OpenBSD, here’s the `exp` function: https://github.com/openbsd/src/blob/master/lib/libm/src/e_exp.c Unlike glibc, BSD standard library is written in C not assembly, and the source code is very readable. — Soonts, May 21 '23 at 15:51
@Soonts I looked at the bad. Code has many branches, which I do not how to implement in SIMD. — dash-o, May 21 '23 at 18:57
@SimonGoater good deal the code that I have is using exp based on external input, which I be — dash-o, May 21 '23 at 18:59
@dash-o Rework branches into conditional moves. For your use case, the conditional move is `_mm256_blendv_pd` instruction. For the example of that approach, FP32 cubic root function: https://github.com/Const-me/CbrtPs/blob/master/CbrtPs.hpp The branchy parts with these conditional moves were ported from OpenBSD’s `cbrtf()` function. — Soonts, May 21 '23 at 19:06

score 3 · Answer 1 · edited May 22 '23 at 04:22

I'm sure the other answers are better than mine - running a very quick and dirty polynomial approximation, I end up with these.

inline __m512d exp2(const __m512d x) {
    const __m512d a = _mm512_set1_pd(0.000217549227054);
    const __m512d b = _mm512_set1_pd(0.00124218531444);
    const __m512d c = _mm512_set1_pd(0.00968102455999);
    const __m512d d = _mm512_set1_pd(0.0554821818101);
    const __m512d e = _mm512_set1_pd(0.240230073528);
    const __m512d f = _mm512_set1_pd(0.693146979806);
    const __m512d g = _mm512_set1_pd(1.0);
    const __m512d fx = _mm512_floor_pd(x);  // integer part
    const __m512d X = _mm512_sub_pd(x, fx); // fractional part
    __m512d y = _mm512_fmadd_pd(a, X, b);
    y = _mm512_fmadd_pd(y, X, c);
    y = _mm512_fmadd_pd(y, X, d);
    y = _mm512_fmadd_pd(y, X, e);
    y = _mm512_fmadd_pd(y, X, f);
    y = _mm512_fmadd_pd(y, X, g);      // polynomial approximation over [0,1)
    return _mm512_scalef_pd(y, fx);    // scale by 2^integer
}

inline __m512d exp(const __m512d x) {
    return exp2(_mm512_mul_pd(x, _mm512_set1_pd(1.442695040888963387)));
}

inline __m512d log2(const __m512d x) {
    const __m512d m = _mm512_getmant_pd(x, _MM_MANT_NORM_1_2, _MM_MANT_SIGN_zero);
    const __m512d a = _mm512_set1_pd(0.0146498917256);
    const __m512d b = _mm512_set1_pd(-0.178725976271);
    const __m512d c = _mm512_set1_pd(0.953841083567);
    const __m512d d = _mm512_set1_pd(-2.92298892586);
    const __m512d e = _mm512_set1_pd(5.68725545823);
    const __m512d f = _mm512_set1_pd(-7.4092580291);
    const __m512d g = _mm512_set1_pd(7.09194627711);
    const __m512d h = _mm512_set1_pd(-3.23671917705);
    __m512d y = _mm512_fmadd_pd(a, m, b);
    y = _mm512_fmadd_pd(y, m, c);
    y = _mm512_fmadd_pd(y, m, d);
    y = _mm512_fmadd_pd(y, m, e);
    y = _mm512_fmadd_pd(y, m, f);
    y = _mm512_fmadd_pd(y, m, g);
    y = _mm512_fmadd_pd(y, m, h);  // poly approximation over [1,2) mantissa
    return _mm512_add_pd(y, _mm512_getexp_pd(x));
}

inline __m512d log(const __m512d x) {
    return _mm512_mul_pd(log2(x), _mm512_set1_pd(0.693147180559945286));
}

Out-of-order execution across independent exp2() or log2() operations can handle the FMA dependency chains from using Horner's rule for the 6th-order polynomials.

See also Agner Fog's VCL implementations which aim for high precision, close to full precision of double:

double-precision exp_d template, supporting a few bases including 2.0, and exp(x-1) vs. exp(x). (See the exp2 caller for the right template params).

Uses a 13th-order Taylor series, which comments in the code indicate is faster than the alternate version, using a Pade expansion: a ratio of two polynomials. One division per many FMAs isn't a disaster for throughput, especially if you have lots surrounding code that also does some non-division work with each result, but doing just this might be too much division per FMA.
double-precision log_d template. This does use a ratio of 5th-order polynomials for the mantissa. Template params support log(x) vs. log(x+1) to avoid losing precision. It only does natural log (base e), so you'll need to scale the result by 1/ln(2).

How to implement vectorize "exp" and "log" base-2 functions using AVX-512

1 Answers1