TL:DR use lrint(x)
or (int)rint(x)
to convert from float to int with round-to-nearest instead of truncation. Unfortunately not all compilers efficiently inline the same math functions, though. See round() for float in C++
gcc -mno-sse2
has to use x87 for double
, even in 64-bit code. x87 registers have an internal precision of 80 bits, but SSE2 uses the IEEE binary64 (aka double
) format natively in XMM registers, so all the temporaries are rounded to 64-bit double
at each step.
The problem isn't anything as interesting as the double rounding problem (80 bit -> 64 bit, then to integer). It's also not from gcc -O0
(the default: no extra optimizations) rounding when storing temporaries to memory, because you did the whole thing in one C statement so it does just use x87 registers for the whole expression.
It's simply that 80-bit precision leads to a result that just below 242.0 and is truncated to 241 by C's float->int semantics, while SSE2 produces a result just above 242.0 which truncates to 242. For x87, rounding down to the next lower integer happens consistently, not just 242, for any input from 1 to 65535. (I made a version of your program using atoi(argv[1])
so I could test other values, and with -O3
).
Remember that int foo = 123.99999
is 123, because C uses the "truncation" rounding mode (towards zero). For non-negative numbers, this is the same as floor
(which rounds towards -Infinity). https://en.wikipedia.org/wiki/Floating-point_arithmetic#Rounding_modes.
double
can't represent the coefficients exactly: I printed them with gdb
and got: {0.21260000000000001, 0.71519999999999995, 0.0722}
. Those decimal representations are probably not exact representations of the base-2 floating point values. But they're close enough to see that the coefficients add up to 0.99999999999999996
(using an arbitrary-precision calculator).
We get consist rounding down because x87 internal precision is higher than the precision of the coefficients, so the sum rounding errors in n * rec709_luma_coeff[0]
and so on, and in summing up the results, is ~2^11
smaller than the difference between the sum of the coefficients and 1.0. (64-bit significand vs. 53 bits).
The real question is how the SSE2 version managed to work! Presumably round to nearest-even on the temporaries happens to go upward in enough cases, at least for 242. It happens to produce the original input for more cases than not, but it produces input-1 for 5, 7, 10, 13, 14, 20... (252 of the first 1000 numbers from 1..1000 are "munged" by the SSE2 version, so it's not like it always works either.)
With -O3
for your source, it does the calculation at compile time with extended precision and produces the exact result. i.e. it compiles the same as printf("%u\n", n);
.
And BTW you should use static
const
for you constants so gcc can optimize better. static
is much better than plain global, though, because the compiler can see that nothing in the compilation unit writes the values or passes their address anywhere, so it can treat them as if they were const
.