How to convert scalar code of the double version of VDT's Pade Exp fast_ex() approx into SSE2?

Question

Here's the code I'm trying to convert: the double version of VDT's Pade Exp fast_ex() approx (here's the old repo resource):

inline double fast_exp(double initial_x){
    double x = initial_x;
    double px=details::fpfloor(details::LOG2E * x +0.5);

    const int32_t n = int32_t(px);

    x -= px * 6.93145751953125E-1;
    x -= px * 1.42860682030941723212E-6;

    const double xx = x * x;

    // px = x * P(x**2).
    px = details::PX1exp;
    px *= xx;
    px += details::PX2exp;
    px *= xx;
    px += details::PX3exp;
    px *= x;

    // Evaluate Q(x**2).
    double qx = details::QX1exp;
    qx *= xx;
    qx += details::QX2exp;
    qx *= xx;
    qx += details::QX3exp;
    qx *= xx;
    qx += details::QX4exp;

    // e**x = 1 + 2x P(x**2)/( Q(x**2) - P(x**2) )
    x = px / (qx - px);
    x = 1.0 + 2.0 * x;

    // Build 2^n in double.
    x *= details::uint642dp(( ((uint64_t)n) +1023)<<52);

    if (initial_x > details::EXP_LIMIT)
      x = std::numeric_limits<double>::infinity();
    if (initial_x < -details::EXP_LIMIT)
      x = 0.;

    return x; 
}

I got this:

__m128d PExpSSE_dbl(__m128d x) {
    __m128d initial_x = x;

    __m128d half = _mm_set1_pd(0.5);
    __m128d one = _mm_set1_pd(1.0);
    __m128d log2e = _mm_set1_pd(1.4426950408889634073599);
    __m128d p1 = _mm_set1_pd(1.26177193074810590878E-4);
    __m128d p2 = _mm_set1_pd(3.02994407707441961300E-2);
    __m128d p3 = _mm_set1_pd(9.99999999999999999910E-1);
    __m128d q1 = _mm_set1_pd(3.00198505138664455042E-6);
    __m128d q2 = _mm_set1_pd(2.52448340349684104192E-3);
    __m128d q3 = _mm_set1_pd(2.27265548208155028766E-1);
    __m128d q4 = _mm_set1_pd(2.00000000000000000009E0);

    __m128d px = _mm_add_pd(_mm_mul_pd(log2e, x), half);
    __m128d t = _mm_cvtepi64_pd(_mm_cvttpd_epi64(px));  
    px = _mm_sub_pd(t, _mm_and_pd(_mm_cmplt_pd(px, t), one));

    __m128i n = _mm_cvtpd_epi64(px);

    x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(6.93145751953125E-1)));
    x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(1.42860682030941723212E-6)));
    __m128d xx = _mm_mul_pd(x, x);

    px = _mm_mul_pd(xx, p1);
    px = _mm_add_pd(px, p2);
    px = _mm_mul_pd(px, xx);
    px = _mm_add_pd(px, p3);
    px = _mm_mul_pd(px, x);

    __m128d qx = _mm_mul_pd(xx, q1);
    qx = _mm_add_pd(qx, q2);
    qx = _mm_mul_pd(xx, qx);
    qx = _mm_add_pd(qx, q3);
    qx = _mm_mul_pd(xx, qx);
    qx = _mm_add_pd(qx, q4);

    x = _mm_div_pd(px, _mm_sub_pd(qx, px));
    x = _mm_add_pd(one, _mm_mul_pd(_mm_set1_pd(2.0), x));

    n = _mm_add_epi64(n, _mm_set1_epi64x(1023));
    n = _mm_slli_epi64(n, 52);

    // return?
}

But I'm not able to finish the last lines - i.e. this code:

    if (initial_x > details::EXP_LIMIT)
      x = std::numeric_limits<double>::infinity();
    if (initial_x < -details::EXP_LIMIT)
      x = 0.;

    return x;

How would you convert in SSE2?

Than of course I need to check the whole, since I'm not quite sure I've converted it correctly.

EDIT: I found the SSE conversion of float exp - i.e. from this:

/* multiply by power of 2 */
z *= details::uint322sp((n + 0x7f) << 23);

if (initial_x > details::MAXLOGF) z = std::numeric_limits<float>::infinity();
if (initial_x < details::MINLOGF) z = 0.f;

return z;

to this:

n = _mm_add_epi32(n, _mm_set1_epi32(0x7f));
n = _mm_slli_epi32(n, 23);

return _mm_mul_ps(z, _mm_castsi128_ps(n));

the link points to the documentation of a moderately old release, the code is maintained in the vdt project: [here](https://github.com/dpiparo/vdt/blob/master/include/exp.h#L70) — pseyfert, Jan 25 '19 at 13:03
@pseyfert: thanks! This specific code seems to be the same though — markzzz, Jan 25 '19 at 13:12
Updated my answer; the first version was a quick post that I didn't take time to go into detail with. — Peter Cordes, Jan 26 '19 at 19:38
32-bit left-shift by 23 is obviously stuffing bits into the exponent field of a single-precision `float`, not `double`. i.e. multiplying a float by `1< — Peter Cordes, Jan 28 '19 at 08:11
@PeterCordes yeah basically I just write how someone translate the float version to simd. Its pretty similar the code between float and double, isn't? — markzzz, Jan 28 '19 at 08:19
Yes, of course it's similar. IEEE binary32 and binary64 are identical except for the field widths. https://en.wikipedia.org/wiki/Single-precision_floating-point_format vs. https://en.wikipedia.org/wiki/Double-precision_floating-point_format. I'm not sure if there's a new part of the question, or why you're quoting that. Is it just for the `_mm_mul_ps` / `_pd` line that you're missing, with the type-pun `_mm_cast` intrinsic? Or are you saying that someone's float version left out the range-check? That's very possible if they decided they didn't want the overhead of range-checking. — Peter Cordes, Jan 28 '19 at 08:22
@PeterCordes: oh no :) I mean that the float version seems to translate the whole last part of code in 1 line of code, while it seems you are suggestiing lots of lines. Or am I getting wrong? — markzzz, Jan 28 '19 at 08:26
@PeterCordes: yes it seems it ignore out of range :) I would say I can ignore as well... — markzzz, Jan 28 '19 at 08:29
`_mm_mul_ps(z, _mm_castsi128_ps(n))` just implements the `*=` part of `z *= details::uint322sp((n + 0x7f) << 23);`. I thought that was obvious. Of course it would take several intrinsics to faithfully implement the range checking, especially if you don't have SSE4 for `blendv`. It would be easy and cheaper to implement it by forcing the result to NaN, though. But still not as cheap as nothing at all. — Peter Cordes, Jan 28 '19 at 08:34
@PeterCordes in my case the range come from a lookup table, where I already endure it won't be out of range (hopefully :P). — markzzz, Jan 28 '19 at 08:47
I was trying to see whether you need double->double rounding to nearest integer, or if you only ever need it during the process of converting to integer (like [How to floor/int in double using only SSE2?](//stackoverflow.com/q/54406161)). I notice the result of `x = _mm_sub_pd(t, _mm_and_pd(_mm_cmplt_pd(px, t), one));` is never used; you later do `x = _mmstuff(px, ... px)`. So that's weird. — Peter Cordes, Jan 28 '19 at 23:20
@PeterCordes: that's the price of my inexperience in translate scalar code into vectorized, sorry. There was an error: its `x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(6.93145751953125E-1))); ..`. I've fixed the code. — markzzz, Jan 29 '19 at 08:07

Peter Cordes · Accepted Answer · 2019-01-29T20:31:40.250

Yup, dividing two polynomials can often give you a better tradeoff between speed and precision than one huge polynomial. As long as there's enough work to hide the divpd throughput. (The latest x86 CPUs have pretty decent FP divide throughput. Still bad vs. multiply, but it's only 1 uop so it doesn't stall the pipeline if you use it rarely enough, i.e. mixed with lots of multiplies. Including in the surrounding code that uses exp)

However, _mm_cvtepi64_pd(_mm_cvttpd_epi64(px)); won't work with SSE2. Packed-conversion intrinsics to/from 64-bit integers requires AVX512DQ.

To do packed rounding to the nearest integer, ideally you'd use SSE4.1 _mm_round_pd(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC), (or truncation towards zero, or floor or ceil towards -+Inf).

But we don't actually need that.

The scalar code ends up with int n and double px both representing the same numeric value. It uses the bad/buggy floor(val+0.5) idiom instead of rint(val) or nearbyint(val) to round to nearest, and then converts that already-integer double to an int (with C++'s truncation semantics, but that doesn't matter because the double value's already an exact integer.)

With SIMD intrinsics, it appears to be easiest to just convert to 32-bit integer and back.

__m128i n  = _mm_cvtpd_epi32( _mm_mul_pd(log2e, x) );   // round to nearest
__m128d px = _mm_cvtepi32_pd( n );

Rounding to int with the desired mode, then converting back to double, is equivalent to double->double rounding and then grabbing an int version of that like the scalar version does. (Because you don't care what happens for doubles too large to fit in an int.)

cvtsd2si and si2sd instructions are 2 uops each, and shuffle the 32-bit integers to packed in the low 64 bits of a vector. So to set up for 64-bit integer shifts to stuff the bits into a double again, you'll need to shuffle. The top 64 bits of n will be zeros, so we can use that to create 64-bit integer n lined up with the doubles:

n = _mm_shuffle_epi32(n, _MM_SHUFFLE(3,1,2,0));   // 64-bit integers

But with just SSE2, there are workarounds. Converting to 32-bit integer and back is one option: you don't care about inputs too small or too large. But packed-conversion between double and int costs at least 2 uops on Intel CPUs each way, so a total of 4. But only 2 of those uops need the FMA units, and your code probably doesn't bottleneck on port 5 with all those multiplies and adds.

Or add a very large number and subtract it again: large enough that each double is 1 integer apart, so normal FP rounding does what you want. (This works for inputs that won't fit in 32 bits, but not double > 2^52. So either way that would work.) Also see How to efficiently perform double/int64 conversions with SSE/AVX? which uses that trick. I couldn't find an example on SO, though.

Fastest Implementation of Exponential Function Using AVX and Fastest Implementation of Exponential Function Using SSE have versions with other speed / precision tradeoffs, for _ps (packed single-precision float).
Fast SSE low precision exponential using double precision operations is at the other end of the spectrum, but still for double.
How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU? discusses some existing libraries like SVML, and Agner Fog's VCL (GPL licensed). And glibc's libmvec.

Then of course I need to check the whole, since I'm not quite sure I've converted it correctly.

iterating over all 2^64 double bit-patterns is impractical, unlike for float where there are only 4 billion, but maybe iterating over all doubles that have the low 32 bits of their mantissa all zero would be a good start. i.e. check in a loop with

bitpatterns = _mm_add_epi64(bitpatterns, _mm_set1_epi64x( 1ULL << 32 ));
doubles = _mm_castsi128_pd(bitpatterns);

https://randomascii.wordpress.com/2014/01/27/theres-only-four-billion-floatsso-test-them-all/

For those last few lines, correcting the input for out-of-range inputs:

The float version you quote just leaves out the range-check entirely. This is obviously the fastest way, if your inputs will always be in range or if you don't care about what happens for out-of-range inputs.

Alternate cheaper range-checking (maybe only for debugging) would be to turn out-of-range values into NaN by ORing the packed-compare result into the result. (An all-ones bit-pattern represents a NaN.)

__m128d out_of_bounds = _mm_cmplt_pd( limit, abs(initial_x) );  // abs = mask off the sign bit
result = _mm_or_pd(result, out_of_bounds);

In general, you can vectorize simple condition setting of a value using branchless compare + blend. Instead of if(x) y=0;, you have the SIMD equivalent of y = (condition) ? 0 : y;, on a per-element basis. SIMD compares produce a mask of all-zero / all-one elements so you can use it to blend.

e.g. in this case cmppd the input and blendvpd the output if you have SSE4.1. Or with just SSE2, and/andnot/or to blend. See SSE intrinsics for comparison (_mm_cmpeq_ps) and assignment operation for a _ps version of both, _pd is identical.

In asm it will look like this:

; result in xmm0  (in need of fixups for out of range inputs)
; initial_x in xmm2
; constants:
;     xmm5 = limit
;     xmm6 = +Inf
cmpltpd  xmm2, xmm5    ; xmm2 = input_x < limit ? 0xffff... : 0
andpd    xmm0, xmm2    ; result = result or 0
andnpd   xmm2, xmm6    ; xmm2 =  0 or +Inf   (In that order because we used ANDN)
orpd     xmm0, xmm2    ; result |= 0 or +Inf
; xmm0 = (input < limit) ? result : +Inf

(In an earlier version of the answer, I thought I was maybe saving a movaps to copy a register, but this is just a bog-standard blend. It destroys initial_x, so the compiler needs to copy that register at some point while calculating result, though.)

Optimizations for this special condition

Or in this case, 0.0 is represented by an all-zero bit-pattern, so do a compare that will produce true if in-range, and AND the output with that. (To leave it unchanged or force it to +0.0). This is better than _mm_blendv_pd, which costs 2 uops on most Intel CPUs (and the AVX 128-bit version always costs 2 uops on Intel). And it's not worse on AMD or Skylake.

+-Inf is represented by a bit-pattern of significand=0, exponent=all-ones. (Any other value in the significand represents +-NaN.) Since too-large inputs will presumably still leave non-zero significands, we can't just AND the compare result and OR that into the final result. I think we need to do a regular blend, or something as expensive (3 uops and a vector constant).

It adds 2 cycles of latency to the final result; both the ANDNPD and ORPD are on the critical path. The CMPPD and ANDPD aren't; they can run in parallel with whatever you do to compute the result.

Hopefully your compiler will actually use ANDPS and so on, not PD, for everything except the CMP, because it's 1 byte shorter but identical because they're both just bitwise ops. I wrote ANDPD just so I didn't have to explain this in comments.

You might be able to shorten the critical path latency by combining both fixups before applying to the result, so you only have one blend. But then I think you also need to combine the compare results.

Or since your upper and lower bounds are the same magnitude, maybe you can compare the absolute value? (mask off the sign bit of initial_x and do _mm_cmplt_pd(abs_initial_x, _mm_set1_pd(details::EXP_LIMIT))). But then you have to sort out whether to zero or set to +Inf.

If you had SSE4.1 for _mm_blendv_pd, you could use initial_x itself as the blend control for the fixup that might need applying, because blendv only cares about the sign bit of the blend control (unlike with the AND/ANDN/OR version where all bits need to match.)

__m128d  fixup = _mm_blendv_pd( _mm_setzero_pd(), _mm_set1_pd(INFINITY), initial_x );  // fixup = (initial_x signbit) ? 0 : +Inf
 // see below for generating fixup with an SSE2 integer arithmetic-shift

const  signbit_mask = _mm_castsi128_pd(_mm_set1_epi64x(0x7fffffffffffffff));  // ~ set1(-0.0)
__m128d  abs_init_x = _mm_and_pd( initial_x, signbit_mask );

__m128d out_of_range = _mm_cmpgt_pd(abs_init_x, details::EXP_LIMIT);

// Conditionally apply the fixup to result
result = _mm_blendv_pd(result, fixup, out_of_range);

Possibly use cmplt instead of cmpgt and rearrange if you care what happens for initial_x being a NaN. Choosing the compare so false applies the fixup instead of true will mean that an unordered comparison results in either 0 or +Inf for an input of -NaN or +NaN. This still doesn't do NaN propagation. You could _mm_cmpunord_pd(initial_x, initial_x) and OR that into fixup, if you want to make that happen.

Especially on Skylake and AMD Bulldozer/Ryzen where SSE2 blendvpd is only 1 uop, this should be pretty nice. (The VEX encoding, vblendvpd is 2 uops, having 3 inputs and a separate output.)

You might still be able to use some of this idea with only SSE2, maybe creating fixup by doing a compare against zero and then _mm_and_pd or _mm_andnot_pd with the compare result and +Infinity.

Using an integer arithmetic shift to broadcast the sign bit to every position in the double isn't efficient: psraq doesn't exist, only psraw/d. Only logical shifts come in 64-bit element size.

But you could create fixup with just one integer shift and mask, and a bitwise invert

__m128i  ix = _mm_castsi128_pd(initial_x);
__m128i ifixup = _mm_srai_epi32(ix, 11);               // all 11 bits of exponent field = sign bit
ifixup = _mm_and_si128(ifixup, _mm_set1_epi64x(0x7FF0000000000000ULL) );  // clear other bits
// ix = the bit pattern for 0 (non-negative x) or +Inf (negative x)  

__m128d fixup = _mm_xor_si128(ifixup, _mm_set1_epi32(-1));  // bitwise invert

Then blend fixup into result for out-of-range inputs as normal.

Cheaply checking abs(initial_x) > details::EXP_LIMIT

If the exp algorithm was already squaring initial_x, you could compare against EXP_LIMIT squared. But it's not, xx = x*x only happens after some calculation to create x.

If you have AVX512F/VL, VFIXUPIMMPD might be handy here. It's designed for functions where the special case outputs are from "special" inputs like NaN and +-Inf, negative, positive, or zero, saving a compare for those cases. (e.g. for after a Newton-Raphson reciprocal(x) for x=0.)

But both of your special cases need compares. Or do they?

If you square your input and subtract, it only costs one FMA to do initial_x * initial_x - details::EXP_LIMIT * details::EXP_LIMIT to create a result that's negative for abs(initial_x) < details::EXP_LIMIT, and non-negative otherwise.

Agner Fog reports that vfixupimmpd is only 1 uop on Skylake-X.

I feel like there should be something you can do with an integer compare on a bit pattern you already have at some point, to check the magnitude of `initial_x`. (The integer value of the bit-pattern increases monotonically as the `double` increases, except for sign/magnitude vs. two's complement), so `nextafter` can be done with an integer increment.) This is why the exponent field is biased. — Peter Cordes, Jan 26 '19 at 21:45
Like maybe compare a right-shifted `ifixup` with a right-shifted `details::EXP_LIMIT`. But with `srai`/`and`, you're *not* stripping the sign bit until you also throw away all the bits other than the copies of the sign bit in the exponent field. — Peter Cordes, Jan 26 '19 at 21:46
please check the last edit I've made for this question: does it could help? Not quite sure you are suggesting to me doing this, but before learn the awesome reply you give to me, I'd be sure I can "quickly" doing somethings like that :) — markzzz, Jan 28 '19 at 08:03
YES! You were right: `_mm_cvtepi64_pd(_mm_cvttpd_epi64(px));` crash. How can I fix it? Should I open a new question? — markzzz, Jan 28 '19 at 11:27
@markzzz: Oh right, you're using MSVC which lets you use intrinsic for instruction sets you haven't enabled. (GCC wouldn't let it compile without `-march=skylake-avx512`). Anyway, yes, open a new question. I searched but didn't find an exact duplicate for how to truncate a `__m128d` to the next smallest-magnitude integer without SSE4.1 for `roundpd`. In your case only small-magnitude inputs are relevant, so adding a huge number and subtracting it again will work, like my answer says, so you can probably post an answer for the new question if you can get the right constant. — Peter Cordes, Jan 28 '19 at 11:36
considering the original double function will work in the range of +708/-708, probably _mm_cvtpd_epi32 is enough, without any kind of loss precision... — markzzz, Jan 28 '19 at 18:13
I meant that also on original code https://github.com/dpiparo/vdt/blob/master/include/exp.h#L75 it cast double to int32, so probably `_mm_cvtpd_epi32` is enough (since it doesn't exist `_mm_cvtpd_epi64` on `SSE2`). As well with `_mm_cvtepi32_pd(_mm_cvttpd_epi32(px))` should be enough, right? — markzzz, Jan 29 '19 at 09:13
@markzzz: yes, but you should probably be using the same conversion for both cases, not truncation for one and nearest for the other. Isn't the algorithm trying to split the integer and fractional parts? In the scalar version, it's casting to int (i.e. with truncation) on a value that's already an integer from `floor`, so it doesn't matter what rounding mode is used. So you should probably do `n = _mm_cvtpd_epi32(something);` (round to nearest) / `px = _mm_cvtepi32_pd(n)` (convert back). Thus a total of 2 conversions. — Peter Cordes, Jan 29 '19 at 09:34
It would be nice to avoid the extra shuffling that converting to 32-bit int does, but I don't think we can. Unless maybe Mysticial's IEEE tricks for double->int are more efficient overall. But you do still need the fractional part as a double, too, so you either have to convert back or expand the tricky stuff. — Peter Cordes, Jan 29 '19 at 09:35
Not sure what you mean. I think it uses `floor(d + 0.5)` approch instead of `round()` because the latter depends by the system's rounding mode. Instead, with `floor(d + 0.5)` you ensure that the rounding will round `to-nearest` every time. Thus, `_mm_cvtepi32_pd(_mm_cvttpd_epi32(px))` is needed (and not only `_mm_cvtepi32_pd`). If you only do `_mm_cvtepi32_pd` you are not sure the rounding is `to-nearest` (because, as said, depends of `fegetround()`). Am I wrong? — markzzz, Jan 29 '19 at 10:24
@markzzz: you can assume that the rounding mode is nearest unless *you* changed it. Normal FP code always assumes the rounding mode is the default. Compiler optimizations and compile-time evaluation / constant-propagation assume this unless you use `#pragma STDC FENV_ACCESS ON` and/or compile with special options. To put it another way, nobody expects you can set the rounding mode and then call other functions and expect them to work. Equally importantly, truncation towards 0 is not the same as floor toward -Infinity, for negative numbers. `-1.3` floors to `-2`, but truncates to `-1`. — Peter Cordes, Jan 29 '19 at 10:31
@markzzz: and most importantly of all, `px = floor(...);` `n = int32_t(px);` guarantees that `px` and `n` represent the same integer value, not off-by-1 from each other if they rounded differently. (Like if you truncated the `LOG2E * x +0.5` to get `n`, but floored it to get `px`.) **Therefore, if your replacement for `floor` involves converting to integer and back, you should keep that integer value and use it as `n`.** The only questions are what convert-to-int method you use, and what input do you feed it. `floor(foo + 0.5)` is a sloppy round-to-nearest, so use `_mm_cvtpd_epi32`, not tt. — Peter Cordes, Jan 29 '19 at 10:35
So you suggest to simply convert `double px = details::fpfloor(details::LOG2E * x + 0.5);` to `__m128d px = _mm_cvtepi32_pd(_mm_cvttpd_epi32(_mm_mul_pd(log2e, x)));`? But this will give different results from the library, which use `floor(x + 0.5)` on https://github.com/dpiparo/vdt/blob/master/include/exp.h#L73 . I got that your suggestion will round "better" using native rounding to nearest, but what I'm looking for is traslate library to simd :) — markzzz, Jan 29 '19 at 11:15
No, read carefully. `n = _mm_cvtpd_epi32(_mm_mul_pd(log2e, x)));` using nearest, not `_mm_cvttpd_epi32` truncation, so it *is* the same as what `floor(stuff + 0.5)` is doing, except without being wrong for a couple corner cases like `0.49999999999999994`. See [round() for float in C++](//stackoverflow.com/a/24348037). Anyway, then `px = _mm_cvtepi32_pd(n);` gives you that as packed-double again. **`(int)floor(val + 0.5)` is a common but bad and slightly buggy idiom for round-to-nearest.** I'm suggesting you emulate that without the bugs. — Peter Cordes, Jan 29 '19 at 11:18
But I also need to calculate `px` as `floor(stuff + 0.5)` :) Not that this https://github.com/dpiparo/vdt/blob/master/include/exp.h#L77 will use `px`, which is calculated with `floor(x + 0.5)`. That's why I also need to calculate it: can't ignore it!! And yeah, I know that `(int)floor(val + 0.5) is a common but bad and slightly buggy idiom for round-to-nearest.` — markzzz, Jan 29 '19 at 11:35
@markzzz: Right, so you do `px = _mm_cvtepi32_pd(n)`, like I've said at least 3 times now. Rounding to int with the correct mode, then converting back to double, is equivalent to double->double rounding and then grabbing an int version of that like the scalar version does. (Because you don't care what happens for doubles too large to fit in an `int`.) — Peter Cordes, Jan 29 '19 at 11:59
:) But `n = _mm_cvtpd_epi32(_mm_mul_pd(log2e, x)));` not seems to me the "Rounding to int with the correct mode". Try with `value = 0.499999999999999944488848768742172978818416595458984375`: `floor(value + 0.5)` != `round(value)`. I'm going to have comparisons of scalar vs vector. I know that's corner case, but as I said, I'd like to obtain exactly the same :) Sorry for being so pedantic... — markzzz, Jan 29 '19 at 12:51
@markzzz: `0.49999999999999994` is less than 0.5, so the correct round-to-nearest result is 0, which you get from `_mm_cvtpd_epi32`, right? I already linked [round() for float in C++](//stackoverflow.com/a/24348037) which explains that `floor(value + 0.5)` *is not the correctly rounded result* for that input. They're different because it's the `floor` emulation of `round()` that sucks, not the actual round-to-nearest version. (And BTW, `floor(val+0.5)` doesn't really emulate the away-from-zero tiebreak behaviour of `round()`, it tie-breaks towards +Inf I think.) — Peter Cordes, Jan 29 '19 at 20:14
of course ;) But the original Pade double function doesn't use round to nearest: so my conversion in simd code would output different results. I know it will be better, but I'm not after improve the results, but to get the same valida in simd code, given the same inputs... — markzzz, Jan 29 '19 at 20:25
@markzzz: you really want to insist on being bug-compatible / bit-exact with the scalar version? Why? If you want to have both in your code, why not fix the scalar version and use `rint( val )` in it? — Peter Cordes, Jan 29 '19 at 20:26
I don't want to "touch" the original code; I'm making some benchmark based on different library, and I need to calibrate performance with accuracy. I won't "fake" benchmark replacing code, that's all (even because I would use it as library, etc). So, for the range I can have in input, doing `__m128d px = _mm_add_pd(_mm_mul_pd(log2e, x), half); __m128d t = _mm_cvtepi32_pd(_mm_cvttpd_epi32(px)); px = _mm_sub_pd(t, _mm_and_pd(_mm_cmplt_pd(px, t), one));` is equivalent in your opinion? — markzzz, Jan 30 '19 at 10:28
@markzzz: yeah, if you insist on emulating the stupid `floor` exactly without SSE4 `_mm_floor_pd`, then I think that emulation based on truncation looks right. And that's probably better than actually changing the MXCSR rounding mode. But for benchmarking, why not use SSE4.1 `_mm_floor_pd`? Or are you testing on ancient CPUs like first-gen Core 2? — Peter Cordes, Jan 30 '19 at 10:37
@markzzz: Is that `cmplt_pd` supposed to correct for floor (-Inf) vs. trunc (0) for negative inputs? Why is it comparing `px` and `t`, instead of `px` against 0? I guess it needs to avoid changing the value if it was already a negative exact integer? Otherwise you could do the -1 / -0 fixup in integer, with `_mm_cmpgt_epi32`, before converting back to `_pd`. Then you already have your `__m128i n` instead of having to convert back to integer yet again. Plus, `_mm_cmpgt_epi32`'s all-ones / all-zeroes result *is* 0 or -1 so you can just add it. — Peter Cordes, Jan 30 '19 at 10:42
Because I'll release "audio plugin" which need to work also on "old" pc. In this field, SSE2 seems to be the right req. Anyway, the code seems to not work. Or at least: it returns the same values of even position within m128d (check my updated answer with the last version I have). Maybe those `_mm_add_epi64 _mm_set1_epi64x` in the end? I've changed it with the epi32 version, still doesn't works. Uhm.,... — markzzz, Jan 30 '19 at 10:44
Yeah... debugging it seems that `n` become 0.0 using `n = _mm_slli_epi32(n, 52);` — markzzz, Jan 30 '19 at 10:50
@markzzz: right but for now you're just benchmarking vs. the library. When you actually release a plugin, surely you want to use round to nearest because it's faster and (presumably) more accurate. Although I guess you'd maybe want to see if the min/max error shifted any because of sometimes having different fraction bits for some inputs. — Peter Cordes, Jan 30 '19 at 10:56
@markzzz: I updated this answer a while ago with the shuffle you need to turn a `_mm_cvt[t]pd_epi32` result into 64-bit integers lined up with your `double`s. And yes, SSE shifts saturate the shift count, so a 52-bit shift on 32-bit elements zeroes them. This shuffling to the bottom is why `cvtpd_epi32` is slower than `cvtps_epi32`: the convert instruction decodes to an extra shuffle uop. — Peter Cordes, Jan 30 '19 at 10:59
Oh I see. I miss that, sorry. Need to learn what it in fact does; tried on the fly `n = _mm_shuffle_epi32(n, _MM_SHUFFLE(3, 1, 2, 0)); n = _mm_add_epi32(n, _mm_set1_epi32(1023)); n = _mm_slli_epi32(n, 52);` but it fails :) — markzzz, Jan 30 '19 at 11:08
@markzzz: `epi32` is 32-bit element size. You want `epi64`. Like I already said, a 52-bit shift on 32-bit elements zeroes them. — Peter Cordes, Jan 30 '19 at 11:11
Oh wait, keeping epi64 seems to works :) `n = _mm_shuffle_epi32(n, _MM_SHUFFLE(3, 1, 2, 0)); n = _mm_add_epi64(n, _mm_set1_epi64x(1023)); n = _mm_slli_epi64(n, 52);` — markzzz, Jan 30 '19 at 11:11
@markzzz: yes, setting up for `_epi64` is the entire point of the shuffle, obviously. — Peter Cordes, Jan 30 '19 at 11:12

How to convert scalar code of the double version of VDT's Pade Exp fast_ex() approx into SSE2?

1 Answers1

For those last few lines, correcting the input for out-of-range inputs: