How to perform uint32/float conversion with SSE?

Question

In SSE there is a function _mm_cvtepi32_ps(__m128i input) which takes input vector of 32 bits wide signed integers (int32_t) and converts them into floats.

Now, I want to interpret input integers as not signed. But there is no function _mm_cvtepu32_ps and I could not find an implementation of one. Do you know where I can find such a function or at least give a hint on the implementation? To illustrate the the difference in results:

unsigned int a = 2480160505; // 10010011 11010100 00111110 11111001   
float a1 = a; // 01001111 00010011 11010100 00111111;  
float a2 = (signed int)a; // 11001110 11011000 01010111 10000010

If you are using 32-bit `int` then the maximum positive value is `2147483647` so your number `2480160505` cannot be represented as `signed int`. — Weather Vane, Dec 03 '15 at 12:49
If your values are known to be in a limited range (I think `[0 .. 2^23]`, since single-precision has a 23-bit mantissa) try a 32-bit version of Mysticial's [64-bit unsigned to/from double](http://stackoverflow.com/a/41148578/224132). He explains the method well enough that you should be able to find the right constants to make the IEEE-754 bit-pattern trick work. — Peter Cordes, Dec 28 '16 at 04:31

score 8 · Answer 1 · edited May 23 '17 at 11:46

With Paul R's solution and with my previous solution the difference between the rounded floating point and the original integer is less than or equal to 0.75 ULP (Unit in the Last Place). In these methods at two places rounding may occur: in _mm_cvtepi32_ps and in _mm_add_ps. This leads to results that are not as accurate as possible for some inputs.

For example, with Paul R's method 0x2000003=33554435 is converted to 33554432.0, but 33554436.0 also exists as a float, which would have been better here. My previous solution suffers from similar inaccuracies. Such inaccurate results may also occur with compiler generated code, see here.

Following the approach of gcc (see Peter Cordes' answer to that other SO question), an accurate conversion within 0.5 ULP is obtained:

inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
    __m128i msk_lo    = _mm_set1_epi32(0xFFFF);
    __m128  cnst65536f= _mm_set1_ps(65536.0f);

    __m128i v_lo      = _mm_and_si128(v,msk_lo);          /* extract the 16 lowest significant bits of v                                   */
    __m128i v_hi      = _mm_srli_epi32(v,16);             /* 16 most significant bits of v                                                 */
    __m128  v_lo_flt  = _mm_cvtepi32_ps(v_lo);            /* No rounding                                                                   */
    __m128  v_hi_flt  = _mm_cvtepi32_ps(v_hi);            /* No rounding                                                                   */
            v_hi_flt  = _mm_mul_ps(cnst65536f,v_hi_flt);  /* No rounding                                                                   */
    return              _mm_add_ps(v_hi_flt,v_lo_flt);    /* Rounding may occur here, mul and add may fuse to fma for haswell and newer    */
}                                                         /* _mm_add_ps is guaranteed to give results with an error of at most 0.5 ULP     */

Note that other high bits/low bits partitions are possible as long as _mm_cvt_ps can convert both pieces to floats without rounding. For example, a partition with 20 high bits and 12 low bits will work equally well.

_mm_cvtepu32_ps exists in the AVX512VL instruction set. If your CPU has AVX512 but your compiler does not have _mm_cvtepu32_ps then you can use _mm512_castps512_ps128(_mm512_cvtepu32_ps(_mm512_castsi128_si512(a))) — A Fog, Apr 17 '19 at 12:54
@AFog: Good point. AVX512/AVX512VL has filled many gaps in the SSE/AVX instruction set indeed. Your `_mm_cvtepu32_ps` workaround is particularly useful for Knights Landing and Knights Mill CPUs, which don't have AVX512VL. — wim, Apr 18 '19 at 10:59
More upvotes to this answer! This is the only one that gives *exactly* the same answer as c++ conversion: f = float(v) would. Tested with all inputs in range [0, 0xffffffff]. — t0rakka, Apr 16 '20 at 16:18

score 5 · Accepted Answer · edited May 23 '17 at 12:33

5

This functionality exists in AVX-512, but if you can't wait until then the only thing I can suggest is to convert the unsigned int input values into pairs of smaller values, convert these, and then add them together again, e.g.

inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
    __m128i v2 = _mm_srli_epi32(v, 1);     // v2 = v / 2
    __m128i v1 = _mm_sub_epi32(v, v2);     // v1 = v - (v / 2)
    __m128 v2f = _mm_cvtepi32_ps(v2);
    __m128 v1f = _mm_cvtepi32_ps(v1);
    return _mm_add_ps(v2f, v1f); 
}

UPDATE

As noted by @wim in his answer, the above solution fails for an input value of UINT_MAX. Here is a more robust, but slightly less efficient solution, which should work for the full uint32_t input range:

inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
    __m128i v2 = _mm_srli_epi32(v, 1);                 // v2 = v / 2
    __m128i v1 = _mm_and_si128(v, _mm_set1_epi32(1));  // v1 = v & 1
    __m128 v2f = _mm_cvtepi32_ps(v2);
    __m128 v1f = _mm_cvtepi32_ps(v1);
    return _mm_add_ps(_mm_add_ps(v2f, v2f), v1f);      // return 2 * v2 + v1
}

edited May 23 '17 at 12:33

Community

1
1

answered Dec 03 '15 at 13:57

Paul R

208,748
37
389
560

Thank you. Btw, why don't you use reference val in the routine arguments? – Kirill Lykov Dec 03 '15 at 14:28
Do you mean pass `v` to `_mm_cvtepu32_ps` as a C++ reference or via a C pointer or something, rather than by value ? In practice it doesn't make any difference so long as the function is `inline`. – Paul R Dec 03 '15 at 14:37
1

@KirillLykov: For the rare cases of functions that don't get inlined, but have vector args, passing by value is more efficient. Instead of passing a pointer in an integer register, the vector is passed in a vector register. So it doesn't have to get spilled by the caller and reloaded by the callee. (Actually, in the SysV ABI, it would have to get spilled if the caller wants to keep the old value, because there are no call-preserved vector regs in that ABI. It would save the called function from having to load it, though.) – Peter Cordes Dec 03 '15 at 15:51
1

Huh, I was going to ask if it could be better to zero-extend to 64bit first and then shuffle the results of two conversions back into a float vector, but there aren't any functions to convert FP to/from anything but signed DWORDs. (until AVX512. Neat.) – Peter Cordes Dec 03 '15 at 15:57

score 4 · Answer 3 · edited Jun 20 '20 at 09:12

I think Paul's answer is nice, but it fails for v=4294967295U (=2^32-1). In that case v2=2^31-1 and v1=2^31. Intrinsic _mm_cvtepi32_ps converts 2^31 to -2.14748365E9 . v2=2^31-1 is converted to 2.14748365E9 and consequently _mm_add_ps returns 0 (due to rounding v1f and v2f are the exact opposite of each other).

The idea of the solution below is to copy the most significant bit of v to v_high. The other bits of v are copied to v_low. v_high is converted to 0 or 2.14748365E9 .

inline __m128 _mm_cvtepu32_v3_ps(const __m128i v)
{
__m128i msk0=_mm_set1_epi32(0x7FFFFFFF);
__m128i zero=_mm_xor_si128(msk0,msk0);
__m128i cnst2_31=_mm_set1_epi32(0x4F000000); /* IEEE representation of float 2^31 */

__m128i v_high=_mm_andnot_si128(msk0,v);
__m128i v_low=_mm_and_si128(msk0,v);
__m128  v_lowf=_mm_cvtepi32_ps(v_low);
__m128i msk1=_mm_cmpeq_epi32(v_high,zero);
__m128  v_highf=_mm_castsi128_ps(_mm_andnot_si128(msk1,cnst2_31));  
__m128  v_sum=_mm_add_ps(v_lowf,v_highf);
return v_sum;

}

Update

It was possible to reduce the number of instructions:

inline __m128 _mm_cvtepu32_v4_ps(const __m128i v)
{
__m128i msk0=_mm_set1_epi32(0x7FFFFFFF);
__m128i cnst2_31=_mm_set1_epi32(0x4F000000);

__m128i msk1=_mm_srai_epi32(v,31);
__m128i v_low=_mm_and_si128(msk0,v);
__m128  v_lowf=_mm_cvtepi32_ps(v_low);
__m128  v_highf=_mm_castsi128_ps(_mm_and_si128(msk1,cnst2_31));  
__m128  v_sum=_mm_add_ps(v_lowf,v_highf);
return v_sum;
}

Intrinsic _mm_srai_epi32 shifts the most significant bit of v to the right, while shifting in sign bits, which turns out to be quite useful here.

Good catch - I hadn't considered the boundary case of `UINT_MAX` - I wonder if there's a more efficient way of doing this though? — Paul R, Dec 09 '15 at 14:57
@Paul R: There seems to be a slightly more efficient way indeed, see update. Thanks for your remark. — wim, Dec 09 '15 at 22:18
This solution is slightly inaccurate in a few cases because of accumulating rounding errors. Wim's solution is better. — A Fog, Apr 17 '19 at 12:52

How to perform uint32/float conversion with SSE?

3 Answers3

Update

Linked