Wrong floating point results when calculating in loops

Question

I wrote the following C++ code in Visual Studio 2017 to evaluate floating point performance:

#include <iostream>
#include <Windows.h>

int main(void)
{
    LARGE_INTEGER frequency;        // ticks per second
    LARGE_INTEGER t1, t2;           // ticks
    double elapsedTime = 0;
    float result = 1000.0f;
    float result2 = 2000.0f;
    float result3 = 3000.0f;
    float result4 = 4000.0f;
    float result5 = 5000.0f;
    float result6 = 6000.0f;
    float result7 = 7000.0f;
    float result8 = 8000.0f;
    long long i;

    // get ticks per second
    QueryPerformanceFrequency(&frequency);

    // start timer
    QueryPerformanceCounter(&t1);
       
    for (i = 0; i < 10000000; i++)
    {
        result = result + 1.4f;
        result2 = result2 + 1.4f;
        result3 = result3 + 1.4f;
        result4 = result4 + 1.4f;
        result5 = result5 + 1.4f;
        result6 = result6 + 1.4f;
        result7 = result7 + 1.4f;
        result8 = result8 + 1.4f;       
    }

    // stop timer
    QueryPerformanceCounter(&t2);
    
    // compute the elapsed time in millisec
    elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
    
    printf("Time for calculation: %f ms\n", elapsedTime);
    printf("Resulting value1: %10f\n", result);
    printf("Resulting value2: %10f\n", result2);
    printf("Resulting value3: %10f\n", result3);
    printf("Resulting value4: %10f\n", result4);
    printf("Resulting value5: %10f\n", result5);
    printf("Resulting value6: %10f\n", result6);
    printf("Resulting value7: %10f\n", result7);
    printf("Resulting value8: %10f\n", result8);
}

The expected result for variable "result" would therefore be 1400001000, for variable "result2" it would be 1400002000 and so on. However, the result variables all have the value 33554432.000000 at the end of the loop. It doesn't matter if i execute the loop 10 billion times instead of 1 billion times, the results stay the same.

However, when I set the compiler option "floating point model" to "fast", all result variables change to 268435456.000000.

Can anyone of you explain that strange behavior?

I expected correct floating point results.

The line "for (i = 0; i < 10000000; i++)" should be "for (i = 0; i < 1000000000; i++)" to get the result i mentioned. — Sonny86, Mar 27 '22 at 14:45
It's a precision problem with `float`. Change `1000000000` to `1` or `2` and you'll see that it works perfectly well until the result becomes so large that precision is lost with `float` and `%f`. `long double` and `%Lf` will work for larger numbers but even then it will lose precision eventually. And your tags are completely wrong. Should be something like `c++` and `floating-point`. — Jeff Holt, Mar 27 '22 at 14:58
Does this answer your question? [What is the difference between float and double?](https://stackoverflow.com/questions/2386772/what-is-the-difference-between-float-and-double) — Jeff Holt, Mar 27 '22 at 15:01
Sorry for the wrong tags, i'm completely new to stackoverflow. But how does the precision problem explain why the results don't change when i execute the loop 10 billion times instead of 1 billion times? I mean even if the results are unprecise, they should change with a much larger loop count. — Sonny86, Mar 27 '22 at 15:13

score 3 · Answer 1 · answered Mar 27 '22 at 15:25

However, the result variables all have the value 33554432.000000 at the end of the loop.

Floating-point operations generally produce the result that is the same as computing the nominal operation with real-number mathematics and rounding the result to the nearest value representable in the floating-point format.

The format commonly used for float has 24-bit significands. In this format, the representable numbers from 8,388,608 (2²³) to 16,777,216 (2²⁴) are the integers: Each of them fits within 24 bits and uses exactly 24 bits. There is no room for a bit with position value less than 1, because that would be a 25^th bit. In the interval from 16,777,216 (2²⁴) to 33,554,432 (2²⁵), the representable values are the even integers: As integers, numbers in this interval require 25 bits to represent. The floating-point format can represent only the first 24 of these bits, so the last bit, with position value 1, is necessarily 0.

Consider what happens when result is 16,777,216 and we add 1.4. (Actually, 1.39999997615814208984375 will be added, because that is the value representable in this float format that is nearest 1.4, so it is what results from the source text 1.4f. However, we will approximate with 1.4 for illustration). The real-number result would be 16,777,217.4. This number is not representable in float. The two nearest numbers are 16,777,216 and 16,777,218. The latter is nearest, so it is the result of floating-point addition.

When we add 1.4 again, the same rounding to the nearest representable result occurs, producing 16,777,220. This continues until we reach 33,554,432.

From 33,554,432 (2²⁵) to 67,108,864 (2²⁶), only numbers that are multiples of four are representable: Each of these numbers requires 26 bits to represent as an integer. The floating-point format can represent only the first 24, so the last two bits are zero.

So, when we add 33,554,432 and 1.4, the real-number result is 33,554,433.4, and the two nearest representable numbers are 33,554,432 and 33,554,436. Of these, the former is closer, so it is the result. Adding 33,554,432 and 1.4 produces 33,554,432.

At this point, all further additions of 1.4 produce the same result, 33,554,432, and that is why it is the final result of your code.

However, when I set the compiler option "floating point model" to "fast", all result variables change to 268435456.000000.

It would be necessary to examine the assembly code generated by the compiler, or otherwise inspecting what the compiler is doing, to be sure. However, this can result if the compiler has optimized by combining eight individual iterations of your loop into one, adding 11.2 each time instead of 1.4. This then results in behavior that is the same as above except the value being added is larger, so it reaches large results before the rounding prevents further progress. (268,435,456 is eight times 33,554,432.)

Thanks for the detailed answer, that explains it well. So calculating with floats in larger loops is probably a bad idea. — Sonny86, Mar 27 '22 at 15:56

Wrong floating point results when calculating in loops

1 Answers1