27

SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers.

  • _mm_cvtps_epi32()
  • _mm_cvtepi32_ps()

But there are no equivalents for double-precision and 64-bit integers. In other words, they are missing:

  • _mm_cvtpd_epi64()
  • _mm_cvtepi64_pd()

It seems that AVX doesn't have them either.

What is the most efficient way to simulate these intrinsics?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
plasmacel
  • 8,183
  • 7
  • 53
  • 101
  • Which compiler are you asking about? Because Clang seems to have them: http://clang.llvm.org/doxygen/avx512vldqintrin_8h.html#a5c8e4d8e91c4ff640a1908e3ee7b89a0 – John Zwinck Dec 14 '16 at 14:13
  • 4
    @JohnZwinck assuming AVX512 support is perhaps a bit premature at this point – harold Dec 14 '16 at 14:14
  • @JohnZwinck No, it's not a clang specific exntension, but `AVX512`. The question is about implementing such functions with `SSE2`. – plasmacel Dec 14 '16 at 14:15
  • 1
    It's probably just cost/benefit - since most modern CPUs have two FP ALUs there isn't much to be gained from using 128 bit SIMD for operations on 2 x doubles, and the cost of adding 64 bit operations is likely to be relatively high (cost = silicon area). It probably makes more sense for 512 bit SIMD, where you have 8 x 64 bit elements, which may be why you see such instructions in AVX512. – Paul R Dec 14 '16 at 14:39
  • @PaulR That's true, but with AVX, for 4 doubles it would be good. I asked for SSE2 implementation, because the AVX variant would be straightforward from that. – plasmacel Dec 14 '16 at 14:44
  • 2
    @plasmacel: yes, unfortunately though AVX/AVX2 is really little more than two SSE units bolted together with a little additional glue and some elastic bands. AVX512 is a re-design, so it doesn't inherit a lot of the limitations of SSE/AVX. – Paul R Dec 14 '16 at 14:46
  • Does it have to work in 32bit code too? – harold Dec 14 '16 at 15:33
  • @harold That would be an advantage, but not necessarily. – plasmacel Dec 14 '16 at 15:37
  • 3
    AFAIK the most efficient implementation would be using scalar [CVTSD2SI r64, xmm](http://www.felixcloutier.com/x86/CVTSD2SI.html), with shuffles to get each element into the low 64. There is no hardware support for packed int64_t to/from float or double. Interestingly, x87 has always supported 64-bit integers with FIST, and that's what gcc uses with `-m32` even with `-mfpmath=sse` when it means copying a value from an XMM register into ST0 (via memory). – Peter Cordes Dec 14 '16 at 17:06
  • @PeterCordes And the `int64_t`/`double` direction? Btw 4x`VCVTSD2SI` with 4x`VSHUFPD` on an AVX vector of 4 doubles would take ~24 latency. That's surprisingly slow for a simple type conversion - I understand there is no better way, just saying. – plasmacel Dec 14 '16 at 17:16
  • @PeterCordes Never mind, I realized that the inverse direction would be `cvtsi2sd`. – plasmacel Dec 14 '16 at 17:22
  • 1
    @plasmacel: if you're latency-bound, you're doing it wrong. Unpack the original 3 different ways, so all 4 conversions can run in parallel, up to the limit of hardware throughput. (requires AVX2 for VPERMPD to get the top element of the upper lane in one shuffle. Otherwise get it from the VEXTRACTF128 that brings the low element of the upper lane down to the bottom of an XMM). On the way back from scalar to vector, probably two pairs of MOVQ / PINSRQ and then a VINSERTI128. Of course, all this only applies if you can't use Mysticial's FP bit-pattern trick! – Peter Cordes Dec 14 '16 at 18:03
  • 2
    @PeterCordes Back in like 2007-ish, I had a performance issue that stemmed from double -> int64 conversions taking >100 cycles on x86 due to a library call. After digging around, I randomly came across a primitive version of this trick in the Glucas source code. Once I understood how it worked, I realized it could be generalized to a lot of other things. My initial versions of the trick took 3-4 instructions in SSE and multiple constants. But over time, I got them down to the way it is now. Two instructions + 1 constant for both directions and for both signed and unsigned. – Mysticial Dec 14 '16 at 18:31
  • @PeterCordes The reciprocal throughput of `VCVTSD2SI` on Haswell is 0.8, while the latency is 5. The overall approximated latency ~24 with the 4 shuffles seems reasonable for me. Do I miss something? – plasmacel Dec 14 '16 at 18:36
  • 1
    Peter is saying that, while the instruction does have a high latency, *latency is not your bottleneck*. That's what is meant by not being "latency-bound". – Cody Gray - on strike Dec 14 '16 at 18:53
  • 1
    @plasmacel: I think on Haswell you can start a VCVTSD2SI in cycles 1, 2, 4, and 5 (for an input vector ready on cycle 1). Not on cycle 3 because VEXTRACTF128 starting on cycle 1 only has a result ready for a convert starting on cycle 4. – Peter Cordes Dec 14 '16 at 18:55
  • 2
    The last of those conversions finishes on cycle 10. Two VMOVQs and a VPINSRQ should already be done or in-flight at that point, so the latency to an integer vector being ready is just the final VPINSRQ (2 cycles) + VINSERTI128 (3 cycles), so you can have an int64 vector ready on cycle 15, assuming no resource-conflicts delay the critical path. And yes, what @Cody said is exactly what I meant. – Peter Cordes Dec 14 '16 at 18:59
  • I've been reading all this avidly, because my profiler still was pointing to my double2int64 routine, which is: long long _inline double2int(const double d) { return _mm_cvtsd_si64(*(__m128d*)&d); } and works out to: movsd mmword ptr [rbp-28h],xmm6 cvtsd2si rdx,mmword ptr [rbp-28h] It is SSE2, see https://learn.microsoft.com/en-us/previous-versions/bb514009(v=vs.120). But perhaps it is not as efficient as the solutions given here. Who knows? – Jan Nov 04 '19 at 23:52

3 Answers3

41

There's no single instruction until AVX512, which added conversion to/from 64-bit integers, signed or unsigned. (Also support for conversion to/from 32-bit unsigned). See intrinsics like _mm512_cvtpd_epi64 and the narrower AVX512VL versions, like _mm256_cvtpd_epi64.

If you only have AVX2 or less, you'll need tricks like below for packed-conversion. (For scalar, x86-64 has scalar int64_t <-> double or float from SSE2, but scalar uint64_t <-> FP requires tricks until AVX512 adds unsigned conversions. Scalar 32-bit unsigned can be done by zero-extending to 64-bit signed.)


If you're willing to cut corners, double <-> int64 conversions can be done in only two instructions:

  • If you don't care about infinity or NaN.
  • For double <-> int64_t, you only care about values in the range [-2^51, 2^51].
  • For double <-> uint64_t, you only care about values in the range [0, 2^52).

double -> uint64_t

//  Only works for inputs in the range: [0, 2^52)
__m128i double_to_uint64(__m128d x){
    x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
    return _mm_xor_si128(
        _mm_castpd_si128(x),
        _mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
    );
}

double -> int64_t

//  Only works for inputs in the range: [-2^51, 2^51]
__m128i double_to_int64(__m128d x){
    x = _mm_add_pd(x, _mm_set1_pd(0x0018000000000000));
    return _mm_sub_epi64(
        _mm_castpd_si128(x),
        _mm_castpd_si128(_mm_set1_pd(0x0018000000000000))
    );
}

uint64_t -> double

//  Only works for inputs in the range: [0, 2^52)
__m128d uint64_to_double(__m128i x){
    x = _mm_or_si128(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)));
    return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0010000000000000));
}

int64_t -> double

//  Only works for inputs in the range: [-2^51, 2^51]
__m128d int64_to_double(__m128i x){
    x = _mm_add_epi64(x, _mm_castpd_si128(_mm_set1_pd(0x0018000000000000)));
    return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0018000000000000));
}

Rounding Behavior:

  • For the double -> uint64_t conversion, rounding works correctly following the current rounding mode. (which is usually round-to-even)
  • For the double -> int64_t conversion, rounding will follow the current rounding mode for all modes except truncation. If the current rounding mode is truncation (round towards zero), it will actually round towards negative infinity.

How does it work?

Despite this trick being only 2 instructions, it's not entirely self-explanatory.

The key is to recognize that for double-precision floating-point, values in the range [2^52, 2^53) have the "binary place" just below the lowest bit of the mantissa. In other words, if you zero out the exponent and sign bits, the mantissa becomes precisely the integer representation.

To convert x from double -> uint64_t, you add the magic number M which is the floating-point value of 2^52. This puts x into the "normalized" range of [2^52, 2^53) and conveniently rounds away the fractional part bits.

Now all that's left is to remove the upper 12 bits. This is easily done by masking it out. The fastest way is to recognize that those upper 12 bits are identical to those of M. So rather than introducing an additional mask constant, we can simply subtract or XOR by M. XOR has more throughput.

Converting from uint64_t -> double is simply the reverse of this process. You add back the exponent bits of M. Then un-normalize the number by subtracting M in floating-point.

The signed integer conversions are slightly trickier since you need to deal with the 2's complement sign-extension. I'll leave those as an exercise for the reader.

Related: A fast method to round a double to a 32-bit int explained


Full Range int64 -> double:

After many years, I finally had a need for this.

  • 5 instructions for uint64_t -> double
  • 6 instructions for int64_t -> double

uint64_t -> double

__m128d uint64_to_double_full(__m128i x){
    __m128i xH = _mm_srli_epi64(x, 32);
    xH = _mm_or_si128(xH, _mm_castpd_si128(_mm_set1_pd(19342813113834066795298816.)));          //  2^84
    __m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0xcc);   //  2^52
    __m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(19342813118337666422669312.));     //  2^84 + 2^52
    return _mm_add_pd(f, _mm_castsi128_pd(xL));
}

int64_t -> double

__m128d int64_to_double_full(__m128i x){
    __m128i xH = _mm_srai_epi32(x, 16);
    xH = _mm_blend_epi16(xH, _mm_setzero_si128(), 0x33);
    xH = _mm_add_epi64(xH, _mm_castpd_si128(_mm_set1_pd(442721857769029238784.)));              //  3*2^67
    __m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0x88);   //  2^52
    __m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(442726361368656609280.));          //  3*2^67 + 2^52
    return _mm_add_pd(f, _mm_castsi128_pd(xL));
}

These work for the entire 64-bit range and are correctly rounded to the current rounding behavior.

These are similar wim's answer below - but with more abusive optimizations. As such, deciphering these will also be left as an exercise to the reader.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Mysticial
  • 464,885
  • 45
  • 335
  • 332
  • This is impressive. Can you elaborate a bit on why is the range only (-2^51, 2^51) and [0, 2^52)? As far as I know the biggest consecutive integer a double can store without losing precision is 2^53. – plasmacel Dec 14 '16 at 17:28
  • 3
    The unsigned case is easier to understand, so I'll start with that. Double-precision values in the range `[2^52, 2^53)` have the "binary place" exactly lined up below the lowest bit of the mantissa. So if you mask out the upper bits, you get exactly the integer representation. The idea with adding `2^52` is to force the value into that range. Hence why it only works when the number is between `[0, 2^52)`. – Mysticial Dec 14 '16 at 17:35
  • 2
    The signed case is very similar. Again, you normalize the number into the range `[2^52, 2^53)`. But you adjust the magic constant so that it handles a different input range. Again the range of numbers you can handle is still only `2^52`. But this time, it's split up across positive/negative, hence `(-2^51, 2^51)`. – Mysticial Dec 14 '16 at 17:38
  • 2
    TBH, it almost makes me sad that AVX512 has the `double <-> int64` conversions. Because the 2-instruction work-around that I've been using for so many years is too awesome to let go. That said, I don't consider this trick dead with AVX512. Because of the flexibility of the magic constant, this approach generalizes to more than just simple conversions. And the exposed fp-add for `double -> int` can be fused with any preceding multiplies. – Mysticial Dec 14 '16 at 17:54
  • I think it's good to have a standard full range, and a lesser range but probably faster `double/int64` conversion. They give perfect balance between safety and performance. – plasmacel Dec 14 '16 at 18:04
  • On the other hand AVX512's `vcvtpd2qq` (`double`/`int64` conversion) just like SSE2's `cvtpd2dq` also perform rounding with the conversion, which follows the current rounding direction. So this and your workaround cover different situations. – plasmacel Dec 14 '16 at 18:27
  • 2
    @plasmacel The `double -> int64` conversions here in my answer follow the current rounding direction. The normalization step (add by constant) pushes all the fractional bits out of the mantissa which are rounded away in the current direction. – Mysticial Dec 14 '16 at 18:35
  • I didn't realized that, but it's great. Maybe mention it at the end of the answer for future users. – plasmacel Dec 14 '16 at 18:42
  • 2
    @Mysticial I think it would make sense to add a remark that the "current rounding mode" would normally be "round-to-nearest-or-even", so that this "conversion by addition of magic constant" does not normally result in the floating-point to integer conversion result prescribed by C and C++ (which specifies truncation). – njuffa Dec 15 '16 at 01:11
  • @Mysticial Your code is also suitable as a building bock for 64 bit range integer to double conversion (without cutting corners) , see [my answer](http://stackoverflow.com/a/41223013/2439725) – wim Dec 19 '16 at 14:52
  • @Mysticial: Great update! It turns out that accurate `int64_t -> double` conversion is also possible with only 5 instructions, see my update. – wim Jun 14 '19 at 22:57
  • @wim Nice work! So there *was* a way to get it lower! – Mysticial Jun 15 '19 at 04:00
  • @Mysticial - any way you are aware of to modify it you actually _want_ truncation? – BeeOnRope Jul 31 '20 at 03:33
  • @BeeOnRope Not without changing the rounding mode or adding `0.5` - which is pretty sketchy with respect to corner cases. – Mysticial Jul 31 '20 at 04:03
  • @Mysticial - thanks. What's the worst that can happen with adding 0.5? I generally expect my results to be in the range -1000,1000 and I guess mis-rounding at one value the breakpoint is fine (i.e., I don't especially care about 500.0 ending up as 499 as integers like 500.0 are not special in my application). – BeeOnRope Jul 31 '20 at 04:06
  • Changing the rounding mode seems hard since it's in the middle of a bunch of other calcs where I want the default rounding mode. You can also mostly ignore these comments since it seems like `vcvttpd2dq` is going to do the trick (but the latency is yuck, I guess due to lane crossing). – BeeOnRope Jul 31 '20 at 04:08
  • 1
    Adding 0.5 will have double rounding. Also, I think some of the tie-breaking cases might be incorrect if you expect correct rounding. – Mysticial Jul 31 '20 at 08:09
  • @Mysticial - in this case I'm fine with many ULPs of error. I'm actually just trying to avoid 7 (!!) cycles of latency for the truncating version of the `cvt` instructions, for `double` -> `uint32_t` conversion, but think that's not going to fly because adding the 0.5 needs to separate from the `por` thing (I think), so in the end I'm going to have 4 (add) + 4 (add) + 1 (xor) latency anyway, which is obv worse... Great answer tho in any case :). – BeeOnRope Aug 01 '20 at 23:48
19

This answer is about 64 bit integer to double conversion, without cutting corners. In a previous version of this answer (see paragraph Fast and accurate conversion by splitting ...., below), it was shown that it is quite efficient to split the 64-bit integers in a 32-bit low and a 32-bit high part, convert these parts to double, and compute low + high * 2^32.

The instruction counts of these conversions were:

  • int64_to_double_full_range 9 instructions (with mul and add as one fma)
  • uint64_to_double_full_range 7 instructions (with mul and add as one fma)

Inspired by Mysticial's updated answer, with better optimized accurate conversions, I further optimized the int64_t to double conversion:

  • int64_to_double_fast_precise: 5 instructions.
  • uint64_to_double_fast_precise: 5 instructions.

The int64_to_double_fast_precise conversion takes one instruction less than Mysticial's solution. The uint64_to_double_fast_precise code is essentially identical to Mysticial's solution (but with a vpblendd instead of vpblendw). It is included here because of its similarities with the int64_to_double_fast_precise conversion: The instructions are identical, only the constants differ:


#include <stdio.h>
#include <immintrin.h>
#include <stdint.h>

__m256d int64_to_double_fast_precise(const __m256i v)
/* Optimized full range int64_t to double conversion           */
/* Emulate _mm256_cvtepi64_pd()                                */
{
    __m256i magic_i_lo   = _mm256_set1_epi64x(0x4330000000000000);                /* 2^52               encoded as floating-point  */
    __m256i magic_i_hi32 = _mm256_set1_epi64x(0x4530000080000000);                /* 2^84 + 2^63        encoded as floating-point  */
    __m256i magic_i_all  = _mm256_set1_epi64x(0x4530000080100000);                /* 2^84 + 2^63 + 2^52 encoded as floating-point  */
    __m256d magic_d_all  = _mm256_castsi256_pd(magic_i_all);

    __m256i v_lo         = _mm256_blend_epi32(magic_i_lo, v, 0b01010101);         /* Blend the 32 lowest significant bits of v with magic_int_lo                                                   */
    __m256i v_hi         = _mm256_srli_epi64(v, 32);                              /* Extract the 32 most significant bits of v                                                                     */
            v_hi         = _mm256_xor_si256(v_hi, magic_i_hi32);                  /* Flip the msb of v_hi and blend with 0x45300000                                                                */
    __m256d v_hi_dbl     = _mm256_sub_pd(_mm256_castsi256_pd(v_hi), magic_d_all); /* Compute in double precision:                                                                                  */
    __m256d result       = _mm256_add_pd(v_hi_dbl, _mm256_castsi256_pd(v_lo));    /* (v_hi - magic_d_all) + v_lo  Do not assume associativity of floating point addition !!                        */
            return result;                                                        /* With gcc use -O3, then -fno-associative-math is default. Do not use -Ofast, which enables -fassociative-math! */
                                                                                  /* With icc use -fp-model precise                                                                                */
}


__m256d uint64_to_double_fast_precise(const __m256i v)                    
/* Optimized full range uint64_t to double conversion          */
/* This code is essentially identical to Mysticial's solution. */
/* Emulate _mm256_cvtepu64_pd()                                */
{
    __m256i magic_i_lo   = _mm256_set1_epi64x(0x4330000000000000);                /* 2^52        encoded as floating-point  */
    __m256i magic_i_hi32 = _mm256_set1_epi64x(0x4530000000000000);                /* 2^84        encoded as floating-point  */
    __m256i magic_i_all  = _mm256_set1_epi64x(0x4530000000100000);                /* 2^84 + 2^52 encoded as floating-point  */
    __m256d magic_d_all  = _mm256_castsi256_pd(magic_i_all);

    __m256i v_lo         = _mm256_blend_epi32(magic_i_lo, v, 0b01010101);         /* Blend the 32 lowest significant bits of v with magic_int_lo                                                   */
    __m256i v_hi         = _mm256_srli_epi64(v, 32);                              /* Extract the 32 most significant bits of v                                                                     */
            v_hi         = _mm256_xor_si256(v_hi, magic_i_hi32);                  /* Blend v_hi with 0x45300000                                                                                    */
    __m256d v_hi_dbl     = _mm256_sub_pd(_mm256_castsi256_pd(v_hi), magic_d_all); /* Compute in double precision:                                                                                  */
    __m256d result       = _mm256_add_pd(v_hi_dbl, _mm256_castsi256_pd(v_lo));    /* (v_hi - magic_d_all) + v_lo  Do not assume associativity of floating point addition !!                        */
            return result;                                                        /* With gcc use -O3, then -fno-associative-math is default. Do not use -Ofast, which enables -fassociative-math! */
                                                                                  /* With icc use -fp-model precise                                                                                */
}


int main(){
    int i;
    uint64_t j;
    __m256i j_4;
    __m256d v;
    double x[4];
    double x0, x1, a0, a1;

    j = 0ull;
    printf("\nAccurate int64_to_double\n");
    for (i = 0; i < 260; i++){
        j_4= _mm256_set_epi64x(0, 0, -j, j);

        v  = int64_to_double_fast_precise(j_4);
        _mm256_storeu_pd(x,v);
        x0 = x[0];
        x1 = x[1];
        a0 = _mm_cvtsd_f64(_mm_cvtsi64_sd(_mm_setzero_pd(),j));
        a1 = _mm_cvtsd_f64(_mm_cvtsi64_sd(_mm_setzero_pd(),-j));
        printf(" j =%21li   v =%23.1f   v=%23.1f   -v=%23.1f   -v=%23.1f   d=%.1f   d=%.1f\n", j, x0, a0, x1, a1, x0-a0, x1-a1);
        j  = j+(j>>2)-(j>>5)+1ull;
    }
    
    j = 0ull;
    printf("\nAccurate uint64_to_double\n");
    for (i = 0; i < 260; i++){
        if (i==258){j=-1;}
        if (i==259){j=-2;}
        j_4= _mm256_set_epi64x(0, 0, -j, j);

        v  = uint64_to_double_fast_precise(j_4);
        _mm256_storeu_pd(x,v);
        x0 = x[0];
        x1 = x[1];
        a0 = (double)((uint64_t)j);
        a1 = (double)((uint64_t)-j);
        printf(" j =%21li   v =%23.1f   v=%23.1f   -v=%23.1f   -v=%23.1f   d=%.1f   d=%.1f\n", j, x0, a0, x1, a1, x0-a0, x1-a1);
        j  = j+(j>>2)-(j>>5)+1ull;
    }
    return 0;
}

The conversions may fail if unsafe math optimization options are enabled. With gcc, -O3 is safe, but -Ofast may lead to wrong results, because we may not assume associativity of floating point addition here (the same holds for Mysticial's conversions). With icc use -fp-model precise.



Fast and accurate conversion by splitting the 64-bit integers in a 32-bit low and a 32-bit high part.

We assume that both the integer input and the double output are in 256 bit wide AVX registers. Two approaches are considered:

  1. int64_to_double_based_on_cvtsi2sd(): as suggested in the comments on the question, use cvtsi2sd 4 times together with some data shuffling. Unfortunately both cvtsi2sd and the data shuffling instructions need execution port 5. This limits the performance of this approach.

  2. int64_to_double_full_range(): we can use Mysticial's fast conversion method twice in order to get an accurate conversion for the full 64 bit integer range. The 64-bit integer is split in a 32-bit low and a 32-bit high part ,similarly as in the answers to this question: How to perform uint32/float conversion with SSE? . Each of these pieces is suitable for Mysticial's integer to double conversion. Finally the high part is multiplied by 2^32 and added to the low part. The signed conversion is a little bit more complicted than the unsigned conversion (uint64_to_double_full_range()), because srai_epi64() doesn't exist.

Code:

#include <stdio.h>
#include <immintrin.h>
#include <stdint.h>

/* 
gcc -O3 -Wall -m64 -mfma -mavx2 -march=broadwell cvt_int_64_double.c
./a.out A
time ./a.out B
time ./a.out C
etc.
*/


inline __m256d uint64_to_double256(__m256i x){                  /*  Mysticial's fast uint64_to_double. Works for inputs in the range: [0, 2^52)     */
    x = _mm256_or_si256(x, _mm256_castpd_si256(_mm256_set1_pd(0x0010000000000000)));
    return _mm256_sub_pd(_mm256_castsi256_pd(x), _mm256_set1_pd(0x0010000000000000));
}

inline __m256d int64_to_double256(__m256i x){                   /*  Mysticial's fast int64_to_double. Works for inputs in the range: (-2^51, 2^51)  */
    x = _mm256_add_epi64(x, _mm256_castpd_si256(_mm256_set1_pd(0x0018000000000000)));
    return _mm256_sub_pd(_mm256_castsi256_pd(x), _mm256_set1_pd(0x0018000000000000));
}


__m256d int64_to_double_full_range(const __m256i v)
{
    __m256i msk_lo       =_mm256_set1_epi64x(0xFFFFFFFF);
    __m256d cnst2_32_dbl =_mm256_set1_pd(4294967296.0);                 /* 2^32                                                                    */

    __m256i v_lo         = _mm256_and_si256(v,msk_lo);                  /* extract the 32 lowest significant bits of v                             */
    __m256i v_hi         = _mm256_srli_epi64(v,32);                     /* 32 most significant bits of v. srai_epi64 doesn't exist                 */
    __m256i v_sign       = _mm256_srai_epi32(v,32);                     /* broadcast sign bit to the 32 most significant bits                      */
            v_hi         = _mm256_blend_epi32(v_hi,v_sign,0b10101010);  /* restore the correct sign of v_hi                                        */
    __m256d v_lo_dbl     = int64_to_double256(v_lo);                    /* v_lo is within specified range of int64_to_double                       */ 
    __m256d v_hi_dbl     = int64_to_double256(v_hi);                    /* v_hi is within specified range of int64_to_double                       */ 
            v_hi_dbl     = _mm256_mul_pd(cnst2_32_dbl,v_hi_dbl);        /* _mm256_mul_pd and _mm256_add_pd may compile to a single fma instruction */
    return _mm256_add_pd(v_hi_dbl,v_lo_dbl);                            /* rounding occurs if the integer doesn't exist as a double                */   
}


__m256d int64_to_double_based_on_cvtsi2sd(const __m256i v)
{   __m128d zero         = _mm_setzero_pd();                            /* to avoid uninitialized variables in_mm_cvtsi64_sd                       */
    __m128i v_lo         = _mm256_castsi256_si128(v);
    __m128i v_hi         = _mm256_extracti128_si256(v,1);
    __m128d v_0          = _mm_cvtsi64_sd(zero,_mm_cvtsi128_si64(v_lo));
    __m128d v_2          = _mm_cvtsi64_sd(zero,_mm_cvtsi128_si64(v_hi));
    __m128d v_1          = _mm_cvtsi64_sd(zero,_mm_extract_epi64(v_lo,1));
    __m128d v_3          = _mm_cvtsi64_sd(zero,_mm_extract_epi64(v_hi,1));
    __m128d v_01         = _mm_unpacklo_pd(v_0,v_1);
    __m128d v_23         = _mm_unpacklo_pd(v_2,v_3);
    __m256d v_dbl        = _mm256_castpd128_pd256(v_01);
            v_dbl        = _mm256_insertf128_pd(v_dbl,v_23,1);
    return v_dbl;
}


__m256d uint64_to_double_full_range(const __m256i v)                    
{
    __m256i msk_lo       =_mm256_set1_epi64x(0xFFFFFFFF);
    __m256d cnst2_32_dbl =_mm256_set1_pd(4294967296.0);                 /* 2^32                                                                    */

    __m256i v_lo         = _mm256_and_si256(v,msk_lo);                  /* extract the 32 lowest significant bits of v                             */
    __m256i v_hi         = _mm256_srli_epi64(v,32);                     /* 32 most significant bits of v                                           */
    __m256d v_lo_dbl     = uint64_to_double256(v_lo);                   /* v_lo is within specified range of uint64_to_double                      */ 
    __m256d v_hi_dbl     = uint64_to_double256(v_hi);                   /* v_hi is within specified range of uint64_to_double                      */ 
            v_hi_dbl     = _mm256_mul_pd(cnst2_32_dbl,v_hi_dbl);        
    return _mm256_add_pd(v_hi_dbl,v_lo_dbl);                            /* rounding may occur for inputs >2^52                                     */ 
}



int main(int argc, char **argv){
  int i;
  uint64_t j;
  __m256i j_4, j_inc;
  __m256d v, v_acc;
  double x[4];
  char test = argv[1][0];

  if (test=='A'){               /* test the conversions for several integer values                                       */
    j = 1ull;
    printf("\nint64_to_double_full_range\n");
    for (i = 0; i<30; i++){
      j_4= _mm256_set_epi64x(j-3,j+3,-j,j);
      v  = int64_to_double_full_range(j_4);
      _mm256_storeu_pd(x,v);
      printf("j =%21li    v =%23.1f    -v=%23.1f    v+3=%23.1f    v-3=%23.1f  \n",j,x[0],x[1],x[2],x[3]);
      j  = j*7ull;
    }

    j = 1ull;
    printf("\nint64_to_double_based_on_cvtsi2sd\n");
    for (i = 0; i<30; i++){
      j_4= _mm256_set_epi64x(j-3,j+3,-j,j);
      v  = int64_to_double_based_on_cvtsi2sd(j_4);
      _mm256_storeu_pd(x,v);
      printf("j =%21li    v =%23.1f    -v=%23.1f    v+3=%23.1f    v-3=%23.1f  \n",j,x[0],x[1],x[2],x[3]);
      j  = j*7ull;
    }

    j = 1ull;                       
    printf("\nuint64_to_double_full_range\n");
    for (i = 0; i<30; i++){
      j_4= _mm256_set_epi64x(j-3,j+3,j,j);
      v  = uint64_to_double_full_range(j_4);
      _mm256_storeu_pd(x,v);
      printf("j =%21lu    v =%23.1f   v+3=%23.1f    v-3=%23.1f \n",j,x[0],x[2],x[3]);
      j  = j*7ull;    
    }
  }
  else{
    j_4   = _mm256_set_epi64x(-123,-4004,-312313,-23412731);  
    j_inc = _mm256_set_epi64x(1,1,1,1);  
    v_acc = _mm256_setzero_pd();
    switch(test){

      case 'B' :{                  
        printf("\nLatency int64_to_double_cvtsi2sd()\n");      /* simple test to get a rough idea of the latency of int64_to_double_cvtsi2sd()     */
        for (i = 0; i<1000000000; i++){
          v  =int64_to_double_based_on_cvtsi2sd(j_4);
          j_4= _mm256_castpd_si256(v);                         /* cast without conversion, use output as an input in the next step                 */
        }
        _mm256_storeu_pd(x,v);
      }
      break;

      case 'C' :{                  
        printf("\nLatency int64_to_double_full_range()\n");    /* simple test to get a rough idea of the latency of int64_to_double_full_range()    */
        for (i = 0; i<1000000000; i++){
          v  = int64_to_double_full_range(j_4);
          j_4= _mm256_castpd_si256(v);
        }
        _mm256_storeu_pd(x,v);
      }
      break;

      case 'D' :{                  
        printf("\nThroughput int64_to_double_cvtsi2sd()\n");   /* simple test to get a rough idea of the throughput of int64_to_double_cvtsi2sd()   */
        for (i = 0; i<1000000000; i++){
          j_4   = _mm256_add_epi64(j_4,j_inc);                 /* each step a different input                                                       */
          v     = int64_to_double_based_on_cvtsi2sd(j_4);
          v_acc = _mm256_xor_pd(v,v_acc);                      /* use somehow the results                                                           */
        }
        _mm256_storeu_pd(x,v_acc);
      }
      break;

      case 'E' :{                  
        printf("\nThroughput int64_to_double_full_range()\n"); /* simple test to get a rough idea of the throughput of int64_to_double_full_range() */
        for (i = 0; i<1000000000; i++){
          j_4   = _mm256_add_epi64(j_4,j_inc);  
          v     = int64_to_double_full_range(j_4);
          v_acc = _mm256_xor_pd(v,v_acc);           
        }    
        _mm256_storeu_pd(x,v_acc);
      }
      break;

      default : {}
    }  
    printf("v =%23.1f    -v =%23.1f    v =%23.1f    -v =%23.1f  \n",x[0],x[1],x[2],x[3]);
  }

  return 0;
}

The actual performance of these functions may depend on the surrounding code and the cpu generation.

Timing results for 1e9 conversions (256 bit wide) with simple tests B, C, D, and E in the code above, on an intel skylake i5 6500 system:

Latency experiment int64_to_double_based_on_cvtsi2sd()      (test B)  5.02 sec.
Latency experiment int64_to_double_full_range()             (test C)  3.77 sec.
Throughput experiment int64_to_double_based_on_cvtsi2sd()   (test D)  2.82 sec.
Throughput experiment int64_to_double_full_range()          (test E)  1.07 sec.

The difference in throughput between int64_to_double_full_range() and int64_to_double_based_on_cvtsi2sd() is larger than I expected.

Community
  • 1
  • 1
wim
  • 3,702
  • 19
  • 23
  • An another excellent answer! Did you try the same full precision logic to convert two uint64 to doubles with SSE2? – plasmacel Dec 19 '16 at 17:31
  • 2
    I did some experiments with a similar code, but with 128 bit wide vectors and with instructions up to SSE4, but the results were very disappointing. Conversion with one `movq`, one `pextrq', one `unpcklpd`, and two `cvtsi2sd` turned out to be much faster than the other approach. – wim Dec 19 '16 at 21:20
  • Note that in principle it is possible to use `-Ofast` in combination with the function attribute `__attribute__ ((optimize("no-fast-math")))`, But that might lead to inefficient code, see [this Godbolt link](https://godbolt.org/z/OLgHUs). – wim Jun 15 '19 at 10:03
  • Here's what I don't get. I'm running a Skylake-X 36-thread CPU. I plugged your "int64_fast_precise" method into a google bench mark and compared it to the native _mm256_cvtepi64_pd. To my surprise, your function was faster than the native function. What's going on here? `Benchmark Time CPU Iterations benchmark_wim_int64_to_double_fast_precise_256 0.708 ns 0.719 ns 1000000000 benchmark_native_int64_to_double 2.27 ns 2.29 ns 320000000` – HesNotTheStig Aug 11 '21 at 04:59
  • Pardon the formatting. To interpret, the native implementation took 2.27ns while your implementation took 0.708ns. – HesNotTheStig Aug 11 '21 at 05:02
  • I reimplemented all of the examples here to operate on a single int64_t in a __m128i vector and convert it to a single double (scalar). I then compared it to `_mm_cvtsi64_sd(__m128d,long long)` and found that in this case, on my work laptop (which runs an i7 Skylake), that it that native implementation ran fastest in google benchmark. Native was 0.683ns vs the "fast-precise" method that ran in 0.890ns. – HesNotTheStig Aug 11 '21 at 17:31
  • Scratch that. I redid all my implementations and benchmarks and got results that make sense. No idea what was wrong. For 256-bit width vectors, wim's fast-precise impl takes 0.644ns while the native instruction (_mm256_cvtepi64_pd) takes 0.257ns. – HesNotTheStig Aug 12 '21 at 02:29
0

Thanks @mysticial and @wim for the full-range i64->f64. I came up with a full-range truncating f64->i64 for the Highway SIMD wrapper.

The first version tried to change the rounding mode, but Clang reorders them and ignores asm volatile, memory/cc clobbers, and even atomic fence. It's not clear to me how to make that safe; NOINLINE works but causes lots of spilling.

A second version (Compiler Explorer link) emulates FP renormalization and turns out to be faster according to llvm-mca (8-10 cycles rthroughput/total).

// Full-range F64 -> I64 conversion
#include <hwy/highway.h>

namespace hwy {
namespace HWY_NAMESPACE {

HWY_API Vec256<int64_t> I64FromF64(Full256<int64_t> di, const Vec256<double> v) {
  const RebindToFloat<decltype(di)> dd;
  using VD = decltype(v);
  using VI = decltype(Zero(di));

  const VI k0 = Zero(di);
  const VI k1 = Set(di, 1);
  const VI k51 = Set(di, 51);

  // Exponent indicates whether the number can be represented as int64_t.
  const VI biased_exp = ShiftRight<52>(BitCast(di, v)) & Set(di, 0x7FF);
  const VI exp = biased_exp - Set(di, 0x3FF);
  const auto in_range = exp < Set(di, 63);

  // If we were to cap the exponent at 51 and add 2^52, the number would be in
  // [2^52, 2^53) and mantissa bits could be read out directly. We need to
  // round-to-0 (truncate), but changing rounding mode in MXCSR hits a
  // compiler reordering bug: https://gcc.godbolt.org/z/4hKj6c6qc . We instead
  // manually shift the mantissa into place (we already have many of the
  // inputs anyway).
  const VI shift_mnt = Max(k51 - exp, k0);
  const VI shift_int = Max(exp - k51, k0);
  const VI mantissa = BitCast(di, v) & Set(di, (1ULL << 52) - 1);
  // Include implicit 1-bit; shift by one more to ensure it's in the mantissa.
  const VI int52 = (mantissa | Set(di, 1ULL << 52)) >> (shift_mnt + k1);
  // For inputs larger than 2^52, insert zeros at the bottom.
  const VI shifted = int52 << shift_int;
  // Restore the one bit lost when shifting in the implicit 1-bit.
  const VI restored = shifted | ((mantissa & k1) << (shift_int - k1));

  // Saturate to LimitsMin (unchanged when negating below) or LimitsMax.
  const VI sign_mask = BroadcastSignBit(BitCast(di, v));
  const VI limit = Set(di, LimitsMax<int64_t>()) - sign_mask;
  const VI magnitude = IfThenElse(in_range, restored, limit);

  // If the input was negative, negate the integer (two's complement).
  return (magnitude ^ sign_mask) - sign_mask;
}

void Test(const double* pd, int64_t* pi) {
    Full256<int64_t> di;
    Full256<double> dd;
    for (size_t i = 0; i < 256; i += Lanes(di)) {
      Store(I64FromF64(di, Load(dd, pd + i)), di, pi + i);
    }
}

}
}

If anyone sees any potential for simplifying the algorithm, please leave a comment.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    clang `-frounding-math` tells gcc/clang's optimizers not to assume the default rounding mode. https://gcc.godbolt.org/z/xrYv9WMsE shows that makes `_mm_getcsr` / `_mm_setcsr` work the way you want. (So you don't need the ifdefs / inline asm.) Possibly also `#pragma STDC FENV_ACCESS on` should be used, although I think at least GCC doesn't really fully support that pragma. However, https://gcc.gnu.org/wiki/FloatingPointMath points out https://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678 GCC doesn't always fully respect changes to FP rounding mode for different math in one function. – Peter Cordes Jun 29 '21 at 16:57
  • Thanks for sharing the links! Wow, I hadn't seen that bug report yet, ongoing since 10 years. That's quite scary, and I'm now very inclined to avoid non-default rounding modes. – Jan Wassenberg Jun 30 '21 at 07:26
  • Please rewrite the function in terms of pure X86 intrinsics. The original question is about SSE/AVX, not a third party library, so as is, it is not related here. – plasmacel Jun 30 '21 at 15:29
  • The Compiler Explorer link includes assembly output :) – Jan Wassenberg Jun 30 '21 at 17:01
  • 1
    @JanWassenberg Assembly is not C++. As is, this post doesn't answer the original question. It doesn't even contain a single SSE/AVX intrinsic. It uses a third party library, which acts like a pseudo code here. – plasmacel Jun 30 '21 at 23:57
  • I can only improve the assembly. First, `msk = pcmpgtq ZERO, B; B = pblendvb B, ZERO, msk` sequences can be replaced with `msk = pcmpgtq ZERO, B; B = pandn msk, B`. That replaces two 2c latency `pblendvb` instructions with 2 1c latency instructions. Next, These 2 places also use `pcmpgtq` which somehow has 3c latency. Luckily we only want to know if the number is negative, which can be done with a 2c latency `psrad A 31; pshufd A 245` chain. Now we don't need a ZERO register. Next we can eliminate the -1 register by replacing add -1 with sub 1, since we also have a 1 register. – MrUnbelievable92 Jan 23 '22 at 18:34
  • After that, ternlog instructions could fold a `pand; por` secquence and replace another `pblendvb` instruction. After all these assembly improvements, i would look at the possibility of doing all the exponent calculations by AND-ing out the mantissa and sign in the beginning instead of shifting and anding (which can btw be done in a 1 shift left and then a 53 shift right, eliminating another constant register) - doing all the calculations there and possibly reducing instructions – MrUnbelievable92 Jan 23 '22 at 18:38
  • Finally, btw - at least in C# the standard dictates that overflow during f64->i64 conversion results in i64s min value. According to that standard your solution fails. Instead `LimitsMax` we use the minimum limit, perform 2's complement before the `IfThenElse` - the `pblendv` - and then select between the converted values and the overflow value. That removes the ` - sign_mask;` instruction but probably doesn't reduce latency. – MrUnbelievable92 Jan 23 '22 at 23:40