Fast SSE low precision exponential using double precision operations

Question

I am looking for for a fast-SSE-low-precision (~1e-3) exponential function.

I came across this great answer:

/* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */
__m128 FastExpSse (__m128 x)
{
    __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */
    __m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765);
    __m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b);
    return _mm_castsi128_ps (t);
}

Based on the work of Nicol N. Schraudolph: N. N. Schraudolph. "A fast, compact approximation of the exponential function." Neural Computation, 11(4), May 1999, pp.853-862.

Now I would need a "double precision" version: __m128d FastExpSSE (__m128d x). This is because I don't control the input and output precision, which happen to be double precision, and the two conversions double -> float, then float -> double is eating 50% of the CPU resources.

What changes would be needed?

I naively tried this:

__m128i double_to_uint64(__m128d x) {
    x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
    return _mm_xor_si128(
        _mm_castpd_si128(x),
        _mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
    );
}

__m128d FastExpSseDouble(__m128d x) {

    #define S 52
    #define C (1llu << S) / log(2)

    __m128d a = _mm_set1_pd(C); /* (1 << 52) / log(2) */
    __m128i b = _mm_set1_epi64x(127 * (1llu << S) - 298765llu << 29);

    auto y = double_to_uint64(_mm_mul_pd(a, x));

    __m128i t = _mm_add_epi64(y, b);
    return _mm_castsi128_pd(t);
}

Of course this returns garbage as I don't know what I'm doing...

edit:

About the 50% factor, it is a very rough estimation, comparing the speedup (with respect to std::exp) converting a vector of single precision numbers (great) to the speedup with a list of double precision numbers (not so great).

Here is the code I used:

// gives the result in place
void FastExpSseVector(std::vector<double> & v) { //vector with several millions elements

    const auto I = v.size();

    const auto N = (I / 4) * 4;

    for (int n = 0; n < N; n += 4) {

        float a[4] = { float(v[n]), float(v[n + 1]), float(v[n + 2]), float(v[n + 3]) };

        __m128 x;
        x = _mm_load_ps(a);

        auto r = FastExpSse(x);

        _mm_store_ps(a, r);

        v[n]     = a[0];
        v[n + 1] = a[1];
        v[n + 2] = a[2];
        v[n + 3] = a[3];
    }

    for (int n = N; n < I; ++n) {
        v[n] = FastExp(v[n]);
    }

}

And here is what I would do if I had this "double precision" version:

void FastExpSseVectorDouble(std::vector<double> & v) {

    const auto I = v.size();

    const auto N = (I / 2) * 2;

    for (int n = 0; n < N; n += 2) {
        __m128d x;
        x = _mm_load_pd(&v[n]);
        auto r = FastExpSseDouble(x);

        _mm_store_pd(&v[n], r);
    }

    for (int n = N; n < I; ++n) {
        v[n] = FastExp(v[n]);
    }
}

How did you measure that double->float and float->double is taking 50% of your CPU time? You aren't doing those with separate load/store/convert loops are you?? So did you use a profiler and find that `cvtpd2ps` + `cvtps2pd` instructions had 50% of the clock cycle event samples for `FastExpSse(__m128d)` with on-the-fly conversion? (Not that that would be efficient, though! Unlike `pack` / `unpack`, you'd get a vector of 2 `float`s so it would be pure overhead.) — Peter Cordes, Jun 05 '18 at 10:59
Anyway, `127` is the exponent bias in https://en.wikipedia.org/wiki/Single-precision_floating-point_format, which uses an 8-bit exponent. I'm not sure what the `298765` is from. You should leave a comment on Nic's answer on the single-precision question with a link to this question, if you haven't already. (The guy who wrote the paper is an SO user :) — Peter Cordes, Jun 05 '18 at 11:06
@PeterCordes thanks! About the 50% factor, please see my edit. — ThreeStarProgrammer57, Jun 05 '18 at 12:22
No wonder it's slow if you convert like *that*! gcc7.3 `-O3` https://godbolt.org/g/G1MGs9 does manage to vectorize the double->float and avoid actually going through `a[]` in memory for that step. But the float -> double step uses 4 separate scalar float->double conversion instructions. (It does manage to avoid bouncing through memory, though.) The obvious way would be to use `_mm_cvtpd_ps` (http://felixcloutier.com/x86/CVTPD2PS.html) twice and feed `FastExpSse` a vector with garbage in the upper 2 elements. Or convert x2 / `unpcklpd` / FastExpSse / convert / `unpckhpd` / convert. — Peter Cordes, Jun 05 '18 at 12:53
Probably a working `FastExpSseDouble` would be even faster though, if `double_to_int64` isn't too slow without AVX512 packed double <-> 64-bit integer instructions. (BTW, the float version is using float-> `signed int`, epi32 not epu32. I think you need to handle both positive and negative numbers.) — Peter Cordes, Jun 05 '18 at 12:57
@PeterCordes Oh, I see. And then, would the double precision version bring any speedup ? edit: looks like you anticipated my question :) — ThreeStarProgrammer57, Jun 05 '18 at 12:58
BTW, if you `_mm_load_pd(&v[n])`, gcc keeps reloading `v.data` because it doesn't know that `v[n]` doesn't alias the control block; unlike with your loop where type-based aliasing works for vectors of non-pointers. Getting `&v[0]` into a local solves the problem. https://godbolt.org/g/d3G9P5 does 2 values per iteration the simple way, but `cvtpd2ps` and the inverse each cost a shuffle + an FMA uop, and of course you only get half the work done per vector, so it's a serious bottleneck. But might be 2x as fast as your loop. — Peter Cordes, Jun 05 '18 at 13:12
Just tested it (with fastexp from chtz answer), unfortunately this change almost nothing (maybe just sligthly slower). I'm testing with MSVC though, I'll try it with icc 17. — ThreeStarProgrammer57, Jun 05 '18 at 13:22
Are you building for AVX2+FMA? If so, IDK if MSVC will automatically contract addpd(mulpd(x,y), z) into an FMA, so you might need to do that manually. (And you could do 256 bits at a time.) You are enabling optimization, right? — Peter Cordes, Jun 05 '18 at 13:31
My bad, I actually was using icc 17 with full optmization. I am building for SSE2. — ThreeStarProgrammer57, Jun 05 '18 at 13:35

chtz · Accepted Answer · 2018-06-05T13:04:40.597

4

Something like this should do the job. You need to tune the 1.05 constant to get a lower maximal error -- I'm too lazy to do that:

__m128d fastexp(const __m128d &x)
{
    __m128d scaled = _mm_add_pd(_mm_mul_pd(x, _mm_set1_pd(1.0/std::log(2.0)) ), _mm_set1_pd(3*1024.0-1.05));

    return _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(scaled), 11));
}

This just gets about 2.5% relative precision -- for better precision you may need to add a second term.

Also, for values which overflow or underflow this will result in unspecified values, you can avoid this by clamping the scaled value to some values.

edited Jun 05 '18 at 13:04

answered Jun 05 '18 at 12:59

chtz

17,329
4
26
56

Wow, this is great, thanks! Bench-marking this vs the original FastExp function using an union, it's "only" 20% faster. Am I doing something wrong? – ThreeStarProgrammer57 Jun 05 '18 at 13:15
(I'm testing with MSVC though, I'll try it with icc 17) – ThreeStarProgrammer57 Jun 05 '18 at 13:23
You may get significantly faster by unrolling your loop. `_mm_add_pd` and `_mm_mul_pd` have a relatively high latency. But perhaps your CPU already does this internally (doing out-of-order execution). – chtz Jun 05 '18 at 13:32
I tried the version above with MSVC on godbolt, and it failed to compile-time compute `_mm_set1_pd(1.0/std::log(2.0))`. Replacing this by the hand-computed `_mm_set1_pd(1.4426950408889634)` gave much better code (one more magical number of course). And I guess I just don't know the correct options to give to MSVC. – chtz Jun 05 '18 at 13:54
I actually was using icc 17 with full optmization, my bad :( – ThreeStarProgrammer57 Jun 05 '18 at 13:56
For the record, I am feeding the functions with a 4e6 double prec elements vector, and I get this results: std::exp: 103ms, FastExp: 67ms, SSE FastExp (single prec) 127ms, SSE FastExp (double prec) 59ms – ThreeStarProgrammer57 Jun 05 '18 at 13:58
Generally, I don't think a low-precision `fastExpDouble` will have much use cases. If you don't care about precision, you should work with single precision as soon and as long as possible. And doing just a single operation on a `std::vector` may result in memory bandwidth being a limiting factor (try doing more operations without storing intermediate results to memory). – chtz Jun 05 '18 at 14:04
yeah, but I'm just doing a gaussian blur on a structure which stores doubles, so I can't switch to floats. Anyway, thanks a lot! – ThreeStarProgrammer57 Jun 05 '18 at 14:07
1

Doing a Gaussian Blur is much more computation than computing `exp`, which can be joined into this computation. But that would be a different question. – chtz Jun 05 '18 at 14:33
Yes, you'r right, for each discrete point that I want to "blur", in each boxes, I'm computing a squared distance (dx²+dy²+dz²), two multiplications, and an addition (and the exponential obviously). Maybe I could convert a few double to floats, do everything else in floats and convert back the sum to double in the end. But that's maybe too much work and obfuscation for the benefit... – ThreeStarProgrammer57 Jun 05 '18 at 14:41
1

@ThreeStarProgrammer57: That would probably be worth converting two vectors of `double` and packing into one vector of `float`, unlike for just `exp`. The conversion results are packed into the low 2 `float` elements of a `__m128` (http://felixcloutier.com/x86/CVTPD2PS.html), so like I said you can shuffle them together with `shufpd`, `unpcklpd`, or `movlhps`. That would add 2x convert + 1x shuffle + 2x convert per 4 elements, but cut the rest of the work in half by having twice as many elements per vector. (Or better, if FastExpSSE can be done more efficiently with `float`) – Peter Cordes Jun 05 '18 at 18:55
1

@PeterCordes True, but the thing is for doing a Gaussian blur, you barely need a lot of `exp` calculations. You could just precalculate a 1D look-up table and use the `exp(a+b)==exp(a)*exp(b)` identity -- assuming `dx, dy, dz` are discrete. Or if the convolution kernel is relatively big, consider doing a 3D-FastFourierTransform (but this is getting away from the original question). – chtz Jun 06 '18 at 07:14

Fast SSE low precision exponential using double precision operations

edit:

1 Answers1

Linked