Why does these two code variants produce different floating-point results?

Question

Given this example C++ code snippet:

void floatSurprise()
{
    // these come from some sort of calculation
    int a = 18680, b = 3323524, c = 121;
    float m = float(a) / c;

    // variant 1: calculate result from single expression
    float r1 = b - (2.0f * m * a) + (m * m * c);
    cout << "r1 = " << r1 << endl;

    // variant 2: break up the expression into intermediate parts, 
    /// then calculate 
    float
        r2_p1 = 2.0f * m * a,
        r2_p2 = m * m * c,
        r2 = b - r2_p1 + r2_p2;

    cout << "r2 = " << r2 << endl;
}

The output is:

dev1 = 439703
dev2 = 439702

When viewed in the debugger, the values are actually 439702.50 and 439702.25, respectively, which is interesting in itself - not sure why iostream prints floats without the fractional part by default. EDIT: The reason for this was that the default precision setting for cout was too low, needed cout << setprecision(7) at least to see the decimal point for numbers of this magnitude.

But I'm even more interested in why am I getting different results. I suppose it has to do with rounding and some subtle interplay of ints with the required float output type, but I can't put my finger on it. Which value is the correct one?

I was amazed that it was so easy to shoot myself in the foot with such a simple piece of code. Any insight will be greatly appreciated! The compiler was VC++2010.

EDIT2: I did some more investigating using a spreadsheet to generate "correct" values for the intermediate variables and found (via tracing) that indeed they were being trimmed, contributing to the precision loss in the ultimate result. I also found a problem with the single expression, because I actually used a handy function for calculating squares instead of m * m there:

template<typename T> inline T sqr(const T &arg) { return arg*arg; }

Even though I asked nicely, the compiler apparently didn't inline this, and calculated the value separately, trimming the result before returning the value to the expression, skewing the result yet again. Ouch.

Not an answer to your question, but you should prefer `double` over `float` since the precision is often much better. You'll always get idiosyncrasies like these (that's the pain with floating point arithmetic) but the error will be a lot less significant. — syam, Mar 19 '13 at 01:45
@syam: I'm trying to vectorize some code using SSE so I need to deal with floats for the moment - I want to be able to process 4 values instead of 2. Besides, AFAIR, FP calculations on recent x86 hardware are done using a native 84-bit registers regardless of float/double, the trimming takes place at the very end. — neuviemeporte, Mar 19 '13 at 01:50
@neuviemeporte: You just answered your own question in that comment. The trimming takes place at the end *except when it doesn't*. The trimming can happen *anywhere along the line* at the whim of the compiler. — Eric Lippert, Mar 19 '13 at 01:52
@neuviemeporte: *the trimming takes place at the very end* that would explain the different results then: in the first case you do all the calculation at once in the native size and trim it only once, but in the second case you trim the intermediary values and *those* trimmed values are reused for your final calculation hence the additional error. — syam, Mar 19 '13 at 01:52
@syam, Eric: whoa, didn't think of it that way. Too bad that from my perspective the .25 was correct in that it gave good results mixed with some other calculated values. Seems like I have a bug somewhere else... :/ — neuviemeporte, Mar 19 '13 at 02:00
Just a small correction: the FP registers are 80 bits wide, not 84. :) — neuviemeporte, Mar 19 '13 at 02:18
What do you mean trimming only takes place at the end? C++ 2010 5.17 3 requires that assignment convert the right-hand side to the actual type. So “trimming” takes place at each assignment. — Eric Postpischil, Mar 19 '13 at 10:39
@EricPostpischil: That is exactly what I meant. A sequence of operations is performed by the FPU in 80-bit FP registers, then the result is cast to 32-bit float before the assignment, trimming some bits of precision. Of course, as Eric Lippert pointed out, the "sequence of operations" above may contain additional trims at the compiler's discretion. — neuviemeporte, Mar 19 '13 at 12:05
@EricPostpischil: Ah, I did not know that. Good to know. The *CLR* only requires trimming on assignment when the assignment is to a static field, instance field or array element; assignments to locals can remain in higher precision. And *C#* does not ever require trimming; the specification says that calculations can be carried out in any higher precision at any time for any reason. — Eric Lippert, Mar 19 '13 at 14:43

score 11 · Accepted Answer · edited May 23 '17 at 10:32

You should read my long, long answer about why the same thing happens in C#:

(.1f+.2f==.3f) != (.1f+.2f).Equals(.3f) Why?

Summing up: first of all, you only get about seven decimal places of accuracy with float. The correct answer were you do to it with exact arithmetic throughout the entire calculation is about 439702.51239669... so you are getting darn close to the correct answer considering the limitations of a float, in either case.

But that doesn't explain why you are getting different results with what looks like exactly the same calculations. The answer is: the compiler is permitted wide lattitude to make your math more accurate, and apparently you have hit upon two cases where the optimizer takes what is logically the same expression and does not optimize them down to the same code.

Anyway, read my answer regarding C# carefully; everything in there applies to C++ just as well.

Thanks for the .512... value - got paranoid and wouldn't even trust the Windows calculator. Not sure I should trust you, come to think of it.... ;) — neuviemeporte, Mar 19 '13 at 01:57

Why does these two code variants produce different floating-point results?

1 Answers1

Linked