Strange uint32_t to float array conversion

Question

I have the following code snippet:

#include <cstdio>
#include <cstdint>

static const size_t ARR_SIZE = 129;

int main()
{
  uint32_t value = 2570980487;

  uint32_t arr[ARR_SIZE];
  for (int x = 0; x < ARR_SIZE; ++x)
    arr[x] = value;

  float arr_dst[ARR_SIZE];
  for (int x = 0; x < ARR_SIZE; ++x)
  {
    arr_dst[x] = static_cast<float>(arr[x]);
  }

  printf("%s\n", arr_dst[ARR_SIZE - 1] == arr_dst[ARR_SIZE - 2] ? "OK" : "WTF??!!");

  printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 2]);
  printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 1]);
  return 0;
}

If I compile it under MS Visual Studio 2015 I can see that the output is:

WTF??!!
magic = 2570980352.0000000000
magic = 2570980608.0000000000

So the last arr_dst element is different from the previous one, yet these two values were obtained by converting the same value, which populates the arr array! Is it a bug?

I noticed that if I modify the conversion loop in the following manner, I get the "OK" result:

for (int x = 0; x < ARR_SIZE; ++x)
{
  if (x == 0)
    x = 0;
  arr_dst[x] = static_cast<float>(arr[x]);
}

So this probably is some issue with vectorizing optimisation.

This behavior does not reproduce on gcc 4.8. Any ideas?

nevermind, my previous comment was a bit stupid. can you generate and send the assembly? what results do you get with optimizations off? — asu, Oct 19 '16 at 17:53
To try out Bollingers answer reduce ARR_SIZE to even vectorization width, e.g. 128. See if it changes the result. — Andreas, Oct 19 '16 at 18:09
@Asu That's what VS2015 outputs: https://gist.github.com/senyai/3e4b6a9118418d1536476218459cd12d — Senyai, Oct 19 '16 at 21:01
you compare float with int. As float is not garantued to be precise, you are comparing pears with apples. https://en.wikipedia.org/wiki/Single-precision_floating-point_format — Mat, Nov 04 '16 at 14:39
@Mat `you[Senyai] compare float with int` - no: `arr_dst[ARR_SIZE - 1] == arr_dst[ARR_SIZE - 2]` - last two elements of one and the same `float` array. — greybeard, Nov 06 '16 at 07:37
Welcome to limited precision and rounding modes: your `float` doesn't have as many mantissa bits as useful for your `int` literal. With rounding mode not set explicitly, any given C++ standard may or may not guarantee one. Your compiler (which may or may not follow any given standard) seems to use random, _and not eliminate the conversion as a common sub-expression_. (Did you try different "optimisation options"? Please report. Try to find out how many different values are present, vary the literal, …) `probably…some issue with vectorizing optimisation` - I'd be surprised. — greybeard, Nov 06 '16 at 07:39

John Bollinger · Answer 1 · 2016-10-19T18:09:20.337

5

A 32-bit IEEE-754 binary float, such as MSVC++ uses, provides only 6-7 decimal digits of precision. Your starting value is well within the range of that type, but it seems not to be exactly representable by that type, as indeed is the case for most values of type uint32_t.

At the same time, the floating-point unit of an x86 or x86_64 processor uses a wider representation even than MSVC++'s 64-bit double. It seems likely that after the loop exits, the last-computed array element remains in an FPU register, in its extended precision form. The program may then use that value directly from the register instead of reading it back from memory, which it is obligated to do with previous elements.

If the program performs the == comparison by promoting the narrower representation to the wider instead of the other way around, then the two values might indeed compare unequal, as the round-trip from extended precision to float and back loses precision. In any event, both values are converted to type double when passed to printf(); if indeed they compared unequal, then it is likely that the results of those conversions differ, too.

I'm not up on MSVC++ compile options, but very likely there is one that would quash this behavior. Such options sometimes go by names such as "strict math" or "strict fp". Be aware, however, that turning on such an option (or turning off its opposite) can be very costly in an FP-heavy program.

edited Oct 19 '16 at 18:09

answered Oct 19 '16 at 17:57

John Bollinger

160,171
8
81
157

I'm sure you're on the right track, since the two values output from the program are the two closest IEEE 32-bit floats to the original integer. But I can't imagine a scenario, even with different width floating point types, where rounding would occur inconsistently. – Mark Ransom Oct 19 '16 at 19:00
@MarkRansom, I haven't suggested any inconsistent rounding. I've suggested that the two values actually used result from different sequences of conversions (because of optimizations applied by the compiler), and that leaves no basis for inconsistency because there is no reason to expect the same rounding in the first place. I can't imagine a scenario where the reported result is *not* a result of such program behavior. – John Bollinger Oct 20 '16 at 06:10
1

I'd buy that argument if either of the printed values was equal to the integer you started with, but *both* of them are rounded. Thus my comment about inconsistent rounding. Generally rounding in IEEE floating point is very tightly specified. – Mark Ransom Oct 20 '16 at 18:20
@MarkRansom: `unsigned` -> `float` conversion is hard on x86, unlike signed int, and isn't doable in one step. MSVC uses two different conversion strategies for vectorized vs. scalar. The scalar one involves converting to double, but the vector one goes straight to float (and apparently has more than 0.5ulp error in this case). – Peter Cordes Oct 23 '16 at 09:47

score 5 · Answer 2 · edited May 23 '17 at 12:07

Converting between unsigned and float is not simple on x86; there's no single instruction for it (until AVX512). A common technique is to convert as signed and then fixup the result. There are multiple ways of doing this. (See this Q&A for some manually-vectorized methods with C intrinsics, not all of which have perfectly-rounded results.)

MSVC vectorizes the first 128 with one strategy, and then uses a different strategy (which wouldn't vectorize) for the last scalar element, which involves converting to double and then from double to float.

gcc and clang produce the 2570980608.0 result from their vectorized and scalar methods. 2570980608 - 2570980487 = 121, and 2570980487 - 2570980352 = 135 (with no rounding of inputs/outputs), so gcc and clang produce the correctly rounded result in this case (less than 0.5ulp of error). IDK if that's true for every possible uint32_t (but there are only 2^32 of them, we could exhaustively check). MSVC's end result for the vectorized loop has slightly more than 0.5ulp of error, but the scalar method is correctly rounded for this input.

IEEE math demands that + - * / and sqrt produce correctly rounded results (less than 0.5ulp of error), but other functions (like log) don't have such a strict requirement. IDK what the requirements are on rounding for int->float conversions, so IDK if what MSVC does is strictly legal (if you didn't use /fp:fast or anything).

See also Bruce Dawson's Floating-Point Determinism blog post (part of his excellent series about FP math), although he doesn't mention integer<->FP conversions.

We can see in the asm linked by the OP what MSVC did (stripped down to only the interesting instructions and commented by hand):

; Function compile flags: /Ogtp
# assembler macro constants
_arr_dst$ = -1040                   ; size = 516
_arr$ = -520                        ; size = 516
_main   PROC                        ; COMDAT

  00013      mov     edx, 129
  00018      mov     eax, -1723986809   ; this is your unsigned 2570980487
  0001d      mov     ecx, edx
  00023      lea     edi, DWORD PTR _arr$[esp+1088]  ; edi=arr
  0002a      rep stosd             ; memset in chunks of 4B
  # arr[0..128] = 2570980487 at this point

  0002c      xor     ecx, ecx      ; i = 0
  # xmm2 = 0.0 in each element (i.e. all-zero)
  # xmm3 = __xmm@4f8000004f8000004f8000004f800000  (a constant repeated in each of 4 float elements)


  ####### The vectorized unsigned->float conversion strategy:
  $LL7@main:                                       ; do{
  00030      movups  xmm0, XMMWORD PTR _arr$[esp+ecx*4+1088]  ; load 4 uint32_t
  00038      cvtdq2ps xmm1, xmm0                 ; SIGNED int to Single-precision float
  0003b      movaps  xmm0, xmm1
  0003e      cmpltps xmm0, xmm2                  ; xmm0 = (xmm0 < 0.0)
  00042      andps   xmm0, xmm3                  ; mask the magic constant
  00045      addps   xmm0, xmm1                  ; x += (x<0.0) ? magic_constant : 0.0f;
   # There's no instruction for converting from unsigned to float, so compilers use inconvenient techniques like this to correct the result of converting as signed.
  00048      movups  XMMWORD PTR _arr_dst$[esp+ecx*4+1088], xmm0 ; store 4 floats to arr_dst
  ; and repeat the same thing again, with addresses that are 16B higher (+1104)
  ; i.e. this loop is unrolled by two

  0006a      add     ecx, 8         ;  i+=8 (two vectors of 4 elements)
  0006d      cmp     ecx, 128
  00073      jb  SHORT $LL7@main    ; }while(i<128)

 #### End of vectorized loop
 # and then IDK what MSVC smoking; both these values are known at compile time.  Is /Ogtp not full optimization?
 # I don't see a branch target that would let execution reach this code
 #  other than by falling out of the loop that ends with ecx=128
  00075      cmp     ecx, edx
  00077      jae     $LN21@main     ; if(i>=129): always false

  0007d      sub     edx, ecx       ; edx = 129-128 = 1

... some more ridiculous known-at-compile-time jumping later ...

 ######## The scalar unsigned->float conversion strategy for the last element
$LC15@main:
  00140      mov     eax, DWORD PTR _arr$[esp+ecx*4+1088]
  00147      movd    xmm0, eax
  # eax = xmm0[0] = arr[128]
  0014b      cvtdq2pd xmm0, xmm0        ; convert the last element TO DOUBLE
  0014f      shr     eax, 31            ; shift the sign bit to bit 1, so eax = 0 or 1
     ; then eax indexes a 16B constant, selecting either 0 or 0x41f0... (as whatever double that represents)
  00152      addsd   xmm0, QWORD PTR __xmm@41f00000000000000000000000000000[eax*8]
  0015b      cvtpd2ps xmm0, xmm0        ; double -> float
  0015f      movss   DWORD PTR _arr_dst$[esp+ecx*4+1088], xmm0  ; and store it

  00165      inc     ecx            ;   ++i;
  00166      cmp     ecx, 129       ; } while(i<129)
  0016c      jb  SHORT $LC15@main
  # Yes, this is a loop, which always runs exactly once for the last element

By way of comparison, clang and gcc also don't optimize the whole thing away at compile time, but they do realize that they don't need a cleanup loop, and just do a single scalar store or convert after the respective loops. (clang actually fully unrolls everything unless you tell it not to.)

See the code on the Godbolt compiler explorer.

gcc just converts the upper and lower 16b halves to float separately, and combines them with a multiply by 65536 and add.

Clang's unsigned -> float conversion strategy is interesting: it never uses a cvt instruction at all. I think it stuffs the two 16-bit halves of the unsigned integer into the mantissa of two floats directly (with some tricks to set the exponents (bitwise boolean stuff and an ADDPS), then adds the low and high half together like gcc does.

Of course, if you compile to 64-bit code, the scalar conversion can just zero-extend the uint32_t to 64-bit and convert that as a signed int64_t to float. Signed int64_t can represent every value of uint32_t, and x86 can convert a 64-bit signed int to float efficiently. But that doesn't vectorize.

Indeed gcc produces the correctly rounded result for every possible uint32_t (no need to check them all), because rounding only takes place in the final addition of the high and low 16 bit halves, see also [here](http://stackoverflow.com/a/40766669/2439725). This addition is within 0.5 ULP according to the IEEE-754 standard. — wim, Nov 23 '16 at 21:30

Andreas · Accepted Answer · 2016-10-25T17:30:00.597

2

I did an investigation on a PowerPC imeplementation (Freescale MCP7450) as they IMHO are far better documented than any voodoo Intel comes up with.

As it turns out the floating point unit, FPU, and vector unit may have different rounding for floating point operations. The FPU can be configured to use one of four rounding modes; round to nearest (default), truncate, towards positive infinity and towards negative infinity. The vector unit however is only able to round to nearest, with a few select instructions having specific rounding rules. The internal precision of the FPU is 106-bit. The vector unit fulfills IEEE-754 but the documentation does not state much more.

Looking at your result the conversion 2570980608 is closer to the original integer, suggesting the FPU has better internal precision than the vector unit OR different rounding modes.

edited Oct 25 '16 at 17:30

answered Oct 22 '16 at 15:35

Andreas

5,086
3
16
36

1

That's it. The values are truly equal after calling `_control87(_MCW_RC, RC_DOWN);` or `_control87(_MCW_RC, _RC_UP);`. I wonder what is the best practice for setting the FPU state, seems like nobody is using it. – Senyai Oct 22 '16 at 20:26
@Senyai: What does that do to the asm? Neither the vectorized (2570980352) nor the scalar (2570980608) conversion in the asm you linked use x87 for anything; it's all SSE math. Or does `_control87` also set the MXCSR rounding mode? – Peter Cordes Oct 23 '16 at 22:04
@Senyai: The "best practice" for setting the rounding mode is to leave it on the default round-to-nearest (with even as a tiebreak). That's what Bruce Dawson suggests in his [article on floating-point determinism and FP settings](https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/). – Peter Cordes Oct 23 '16 at 22:08
Also, the last paragraph is nonsense. Intel's x86 manuals are very detailed and available for free as PDF without any kind of registration requirement. Links in the [x86 tag wiki](http://stackoverflow.com/tags/x86/info), specifically [this link to Intel's manuals](https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html). BTW, your answer was not a bad guess, it could have been that the 129th conversion was using scalar x87 (with 80-bit internal precision), as well as doing the unsigned conversion with a different strategy. – Peter Cordes Oct 23 '16 at 22:14
1

@PeterCordes Thank you for the links for x86/Intel manuals. Guess I´ve been looking at the wrong places. Reading the Intel manuals they too have possibly different rounding modes. Intel however may set rounding mode for the SSE/SSE2 to any of the four setting the MXCSR register, section 4.8.4.1. Reading further makes me stand by the claim that I´ve seen much better manuals. The MXCSR register default is "1F80H". I´m not even gonna bother decipher that. – Andreas Oct 25 '16 at 17:28
@PeterCordes In case you missed it: We are making the assumption all elements are calculated using SSE except the last element which is calculated by x87. Prove that to be wrong and the post is moot for whatever platform you tested it on. – Andreas Oct 25 '16 at 17:32
From the asm linked by the OP, both vector and scalar conversions are done with SSE, no x87. See my answer where I commented the asm and described both strategies. re: MXCSR: the default rounding mode in C is always round-to-nearest (with even as a tie-break), and that's what the default MXCSR setting does. The other default bit are all exceptions masked, and I think no exception flags already raised. – Peter Cordes Oct 26 '16 at 01:36
Anyway, the imperfect rounding of MSVC's vectorized `unsigned`->`float` conversion is because there's no single instruction that does that, and it has to hack a conversion together out of some magic constants. All the math it does is done with 32-bit IEEE floats, with correct round-to-nearest. – Peter Cordes Oct 26 '16 at 01:38

Strange uint32_t to float array conversion

3 Answers3

Linked