6

I am looking for for a fast-SSE-low-precision (~1e-3) exponential function.

I came across this great answer:

/* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */
__m128 FastExpSse (__m128 x)
{
    __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */
    __m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765);
    __m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b);
    return _mm_castsi128_ps (t);
}

Based on the work of Nicol N. Schraudolph: N. N. Schraudolph. "A fast, compact approximation of the exponential function." Neural Computation, 11(4), May 1999, pp.853-862.

Now I would need a "double precision" version: __m128d FastExpSSE (__m128d x). This is because I don't control the input and output precision, which happen to be double precision, and the two conversions double -> float, then float -> double is eating 50% of the CPU resources.

What changes would be needed?

I naively tried this:

__m128i double_to_uint64(__m128d x) {
    x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
    return _mm_xor_si128(
        _mm_castpd_si128(x),
        _mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
    );
}

__m128d FastExpSseDouble(__m128d x) {

    #define S 52
    #define C (1llu << S) / log(2)

    __m128d a = _mm_set1_pd(C); /* (1 << 52) / log(2) */
    __m128i b = _mm_set1_epi64x(127 * (1llu << S) - 298765llu << 29);

    auto y = double_to_uint64(_mm_mul_pd(a, x));

    __m128i t = _mm_add_epi64(y, b);
    return _mm_castsi128_pd(t);
}

Of course this returns garbage as I don't know what I'm doing...

edit:

About the 50% factor, it is a very rough estimation, comparing the speedup (with respect to std::exp) converting a vector of single precision numbers (great) to the speedup with a list of double precision numbers (not so great).

Here is the code I used:

// gives the result in place
void FastExpSseVector(std::vector<double> & v) { //vector with several millions elements

    const auto I = v.size();

    const auto N = (I / 4) * 4;

    for (int n = 0; n < N; n += 4) {

        float a[4] = { float(v[n]), float(v[n + 1]), float(v[n + 2]), float(v[n + 3]) };

        __m128 x;
        x = _mm_load_ps(a);

        auto r = FastExpSse(x);

        _mm_store_ps(a, r);

        v[n]     = a[0];
        v[n + 1] = a[1];
        v[n + 2] = a[2];
        v[n + 3] = a[3];
    }

    for (int n = N; n < I; ++n) {
        v[n] = FastExp(v[n]);
    }

}

And here is what I would do if I had this "double precision" version:

void FastExpSseVectorDouble(std::vector<double> & v) {

    const auto I = v.size();

    const auto N = (I / 2) * 2;

    for (int n = 0; n < N; n += 2) {
        __m128d x;
        x = _mm_load_pd(&v[n]);
        auto r = FastExpSseDouble(x);

        _mm_store_pd(&v[n], r);
    }

    for (int n = N; n < I; ++n) {
        v[n] = FastExp(v[n]);
    }
}
ThreeStarProgrammer57
  • 2,906
  • 2
  • 16
  • 24
  • 2
    How did you measure that double->float and float->double is taking 50% of your CPU time? You aren't doing those with separate load/store/convert loops are you?? So did you use a profiler and find that `cvtpd2ps` + `cvtps2pd` instructions had 50% of the clock cycle event samples for `FastExpSse(__m128d)` with on-the-fly conversion? (Not that that would be efficient, though! Unlike `pack` / `unpack`, you'd get a vector of 2 `float`s so it would be pure overhead.) – Peter Cordes Jun 05 '18 at 10:59
  • Anyway, `127` is the exponent bias in https://en.wikipedia.org/wiki/Single-precision_floating-point_format, which uses an 8-bit exponent. I'm not sure what the `298765` is from. You should leave a comment on Nic's answer on the single-precision question with a link to this question, if you haven't already. (The guy who wrote the paper is an SO user :) – Peter Cordes Jun 05 '18 at 11:06
  • @PeterCordes thanks! About the 50% factor, please see my edit. – ThreeStarProgrammer57 Jun 05 '18 at 12:22
  • 1
    No wonder it's slow if you convert like *that*! gcc7.3 `-O3` https://godbolt.org/g/G1MGs9 does manage to vectorize the double->float and avoid actually going through `a[]` in memory for that step. But the float -> double step uses 4 separate scalar float->double conversion instructions. (It does manage to avoid bouncing through memory, though.) The obvious way would be to use `_mm_cvtpd_ps` (http://felixcloutier.com/x86/CVTPD2PS.html) twice and feed `FastExpSse` a vector with garbage in the upper 2 elements. Or convert x2 / `unpcklpd` / FastExpSse / convert / `unpckhpd` / convert. – Peter Cordes Jun 05 '18 at 12:53
  • Probably a working `FastExpSseDouble` would be even faster though, if `double_to_int64` isn't too slow without AVX512 packed double <-> 64-bit integer instructions. (BTW, the float version is using float-> `signed int`, epi32 not epu32. I think you need to handle both positive and negative numbers.) – Peter Cordes Jun 05 '18 at 12:57
  • @PeterCordes Oh, I see. And then, would the double precision version bring any speedup ? edit: looks like you anticipated my question :) – ThreeStarProgrammer57 Jun 05 '18 at 12:58
  • BTW, if you `_mm_load_pd(&v[n])`, gcc keeps reloading `v.data` because it doesn't know that `v[n]` doesn't alias the control block; unlike with your loop where type-based aliasing works for vectors of non-pointers. Getting `&v[0]` into a local solves the problem. https://godbolt.org/g/d3G9P5 does 2 values per iteration the simple way, but `cvtpd2ps` and the inverse each cost a shuffle + an FMA uop, and of course you only get half the work done per vector, so it's a serious bottleneck. But might be 2x as fast as your loop. – Peter Cordes Jun 05 '18 at 13:12
  • Just tested it (with fastexp from chtz answer), unfortunately this change almost nothing (maybe just sligthly slower). I'm testing with MSVC though, I'll try it with icc 17. – ThreeStarProgrammer57 Jun 05 '18 at 13:22
  • Are you building for AVX2+FMA? If so, IDK if MSVC will automatically contract addpd(mulpd(x,y), z) into an FMA, so you might need to do that manually. (And you could do 256 bits at a time.) You are enabling optimization, right? – Peter Cordes Jun 05 '18 at 13:31
  • 1
    My bad, I actually was using icc 17 with full optmization. I am building for SSE2. – ThreeStarProgrammer57 Jun 05 '18 at 13:35

1 Answers1

4

Something like this should do the job. You need to tune the 1.05 constant to get a lower maximal error -- I'm too lazy to do that:

__m128d fastexp(const __m128d &x)
{
    __m128d scaled = _mm_add_pd(_mm_mul_pd(x, _mm_set1_pd(1.0/std::log(2.0)) ), _mm_set1_pd(3*1024.0-1.05));

    return _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(scaled), 11));
}

This just gets about 2.5% relative precision -- for better precision you may need to add a second term.

Also, for values which overflow or underflow this will result in unspecified values, you can avoid this by clamping the scaled value to some values.

chtz
  • 17,329
  • 4
  • 26
  • 56
  • Wow, this is great, thanks! Bench-marking this vs the original FastExp function using an union, it's "only" 20% faster. Am I doing something wrong? – ThreeStarProgrammer57 Jun 05 '18 at 13:15
  • (I'm testing with MSVC though, I'll try it with icc 17) – ThreeStarProgrammer57 Jun 05 '18 at 13:23
  • You may get significantly faster by unrolling your loop. `_mm_add_pd` and `_mm_mul_pd` have a relatively high latency. But perhaps your CPU already does this internally (doing out-of-order execution). – chtz Jun 05 '18 at 13:32
  • I tried the version above with MSVC on godbolt, and it failed to compile-time compute `_mm_set1_pd(1.0/std::log(2.0))`. Replacing this by the hand-computed `_mm_set1_pd(1.4426950408889634)` gave much better code (one more magical number of course). And I guess I just don't know the correct options to give to MSVC. – chtz Jun 05 '18 at 13:54
  • I actually was using icc 17 with full optmization, my bad :( – ThreeStarProgrammer57 Jun 05 '18 at 13:56
  • For the record, I am feeding the functions with a 4e6 double prec elements vector, and I get this results: std::exp: 103ms, FastExp: 67ms, SSE FastExp (single prec) 127ms, SSE FastExp (double prec) 59ms – ThreeStarProgrammer57 Jun 05 '18 at 13:58
  • Generally, I don't think a low-precision `fastExpDouble` will have much use cases. If you don't care about precision, you should work with single precision as soon and as long as possible. And doing just a single operation on a `std::vector` may result in memory bandwidth being a limiting factor (try doing more operations without storing intermediate results to memory). – chtz Jun 05 '18 at 14:04
  • yeah, but I'm just doing a gaussian blur on a structure which stores doubles, so I can't switch to floats. Anyway, thanks a lot! – ThreeStarProgrammer57 Jun 05 '18 at 14:07
  • 1
    Doing a Gaussian Blur is much more computation than computing `exp`, which can be joined into this computation. But that would be a different question. – chtz Jun 05 '18 at 14:33
  • Yes, you'r right, for each discrete point that I want to "blur", in each boxes, I'm computing a squared distance (dx²+dy²+dz²), two multiplications, and an addition (and the exponential obviously). Maybe I could convert a few double to floats, do everything else in floats and convert back the sum to double in the end. But that's maybe too much work and obfuscation for the benefit... – ThreeStarProgrammer57 Jun 05 '18 at 14:41
  • 1
    @ThreeStarProgrammer57: That would probably be worth converting two vectors of `double` and packing into one vector of `float`, unlike for just `exp`. The conversion results are packed into the low 2 `float` elements of a `__m128` (http://felixcloutier.com/x86/CVTPD2PS.html), so like I said you can shuffle them together with `shufpd`, `unpcklpd`, or `movlhps`. That would add 2x convert + 1x shuffle + 2x convert per 4 elements, but cut the rest of the work in half by having twice as many elements per vector. (Or better, if FastExpSSE can be done more efficiently with `float`) – Peter Cordes Jun 05 '18 at 18:55
  • 1
    @PeterCordes True, but the thing is for doing a Gaussian blur, you barely need a lot of `exp` calculations. You could just precalculate a 1D look-up table and use the `exp(a+b)==exp(a)*exp(b)` identity -- assuming `dx, dy, dz` are discrete. Or if the convolution kernel is relatively big, consider doing a 3D-FastFourierTransform (but this is getting away from the original question). – chtz Jun 06 '18 at 07:14