6

I'm doing a project where I do RGB to luma conversions, and I have some rounding issues with the -mno-sse2 flag:

Here's the test code:

#include <stdio.h>
#include <stdint.h>

static double rec709_luma_coeff[3] = {0.2126, 0.7152, 0.0722};

int main()
{
    uint16_t n = 242 * rec709_luma_coeff[0] + 242 * rec709_luma_coeff[1] + 242 * rec709_luma_coeff[2];

    printf("%u\n", n);
    return 0;
}

And here's what I get:

user@gentoo>gcc -mno-sse2 test.c -o test && ./test
241
user@gentoo> gcc test.c -o test && ./test
242

I suppose that gcc uses sse2 optimizations for double multiplications, but what I don't get is why the optimized version would be the correct one.

Also, what do you recommend I use to get more consistent results, ceil() or floor()?

mrflash818
  • 930
  • 13
  • 24
user3618511
  • 105
  • 5
  • 5
    It has nothing to do with optimization. No SSE2 means use of old x87 FPU, which is **wider** than SSE2. In a sense, x87 results are done with higher precision, but results might be different from one done using SSE2 – Severin Pappadeux Jan 28 '16 at 18:29
  • Strangely, when compiling with the `-O2` flag, the problem goes away... – perror Jan 28 '16 at 18:33
  • If you enable optimization, you get also 242 with `-mno-sse2` – jofel Jan 28 '16 at 18:33
  • Well, thanks Severin. I guess I can only wait for the clusterfuck of retro compatibility that is x86 to be replaced by something better (if only we had POWER8/9 consumer CPUs). – user3618511 Jan 28 '16 at 18:37
  • 2
    I would suggest `round()` or `nearbyint()` instead of either `ceil()` or `floor()`. The semantics of the former are more likely what you want. Also, all of those risk a bit of instability near their discontinuities, but for `round()` and `nearbyint()` those occur at half-integers, whereas for `ceil()` and `floor()` they occur at integers. – John Bollinger Jan 28 '16 at 18:37
  • Ah, looking at the assembly code, it seems that the `-mno-sse2` is using the fpu, and the other is using MMX instructions. No SSE2 instruction in sight. – perror Jan 28 '16 at 18:39
  • 2
    @user3618511 why are you using doubles for a colour space transformation in the first place though? That's epic overkill. – harold Jan 28 '16 at 18:40
  • @harold Well, I wasn't sure if float was enough, sooooo better too much than too little. – user3618511 Jan 28 '16 at 18:54
  • 1
    @user3618511 float is also overkill – harold Jan 28 '16 at 19:07
  • @harold Huh? What am I supposed to use then. These are the official luma coeff. – user3618511 Jan 28 '16 at 19:17
  • 4
    @user3618511 this sort of thing is almost always done with fixed point math. For example, `luma = (2126 * r + 7152 * g + 722 * b + 5000) / 10000`. If anything that will be more exact (you can make it work with floats, but you actually need some hackery with the rounding bias). It can also be reasonably approximated with 16bit binary fixed point arithmetic, which is more efficient even for scalar code, and infinitely easier to use with SIMD. – harold Jan 28 '16 at 19:33
  • @harold That's what I thought too, but it seemed a bit ugly. Thanks anyway (you're right about the SIMD-bility). – user3618511 Jan 28 '16 at 19:36
  • @perror: It's not strange that `-O3` changes the result: the compiler does the math at compile time using infinite precision for constant-propagation, and it turns into `printf("%u\n", 242);`. Also, there are no MMX instructions in the asm output. No idea what you're smoking: `mulsd xmm0,xmm2` is an SSE2 instruction: multiply scalar double using XMM registers, on the SSE FPU. MMX instructions are integer-only, and use `mm0..7`. – Peter Cordes Dec 19 '17 at 04:34

1 Answers1

0

TL:DR use lrint(x) or (int)rint(x) to convert from float to int with round-to-nearest instead of truncation. Unfortunately not all compilers efficiently inline the same math functions, though. See round() for float in C++


gcc -mno-sse2 has to use x87 for double, even in 64-bit code. x87 registers have an internal precision of 80 bits, but SSE2 uses the IEEE binary64 (aka double) format natively in XMM registers, so all the temporaries are rounded to 64-bit double at each step.

The problem isn't anything as interesting as the double rounding problem (80 bit -> 64 bit, then to integer). It's also not from gcc -O0 (the default: no extra optimizations) rounding when storing temporaries to memory, because you did the whole thing in one C statement so it does just use x87 registers for the whole expression.


It's simply that 80-bit precision leads to a result that just below 242.0 and is truncated to 241 by C's float->int semantics, while SSE2 produces a result just above 242.0 which truncates to 242. For x87, rounding down to the next lower integer happens consistently, not just 242, for any input from 1 to 65535. (I made a version of your program using atoi(argv[1]) so I could test other values, and with -O3).

Remember that int foo = 123.99999 is 123, because C uses the "truncation" rounding mode (towards zero). For non-negative numbers, this is the same as floor (which rounds towards -Infinity). https://en.wikipedia.org/wiki/Floating-point_arithmetic#Rounding_modes.


double can't represent the coefficients exactly: I printed them with gdb and got: {0.21260000000000001, 0.71519999999999995, 0.0722}. Those decimal representations are probably not exact representations of the base-2 floating point values. But they're close enough to see that the coefficients add up to 0.99999999999999996 (using an arbitrary-precision calculator).

We get consist rounding down because x87 internal precision is higher than the precision of the coefficients, so the sum rounding errors in n * rec709_luma_coeff[0] and so on, and in summing up the results, is ~2^11 smaller than the difference between the sum of the coefficients and 1.0. (64-bit significand vs. 53 bits).

The real question is how the SSE2 version managed to work! Presumably round to nearest-even on the temporaries happens to go upward in enough cases, at least for 242. It happens to produce the original input for more cases than not, but it produces input-1 for 5, 7, 10, 13, 14, 20... (252 of the first 1000 numbers from 1..1000 are "munged" by the SSE2 version, so it's not like it always works either.)


With -O3 for your source, it does the calculation at compile time with extended precision and produces the exact result. i.e. it compiles the same as printf("%u\n", n);.


And BTW you should use static const for you constants so gcc can optimize better. static is much better than plain global, though, because the compiler can see that nothing in the compilation unit writes the values or passes their address anywhere, so it can treat them as if they were const.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847