Floating point errors in C++

Question

I need to have some float divisions that must be accurate like double version of them. I can change divided value - it represents a mapping and I can offset it - to correct eventual floating point errors.

To correct the errors, I use following code:

do 
{
    float fValue = float(x) / 1024.f;
    double oldFValue = fValue;
    double dValue = double(x) / 1024.0;
    if(oldFValue != dValue)
    {
        x += 1;
    }
    else
    {
        break;
    }
}while(1);

With this code, for

x = 11

I have in debugger (Visual Studio 2010):

fValue = 0.010742188
oldFValue = 0.010742187500000000

Can you please explain why double value is different from float value? Is this a debugger problem or a floating point conversion problem? I'm asking this because:

if(oldFValue != dValue)

is never true, even though it should be. Should I compare the float value with the double value in some other way? I need the result of float division to be exactly same as the double division.

With fixed decimal digit precision, `1/3 * 3` won't be equal to `1` (it'll be .9999999). That's just how limited precision arithmetic works. — David Schwartz, Feb 05 '12 at 11:16

score 10 · Accepted Answer · answered Feb 05 '12 at 11:00

10

You have to read (and understand) What Every Computer Scientist Should Know About Floating-Point Arithmetic.

answered Feb 05 '12 at 11:00

johnsyweb

136,902
23
188
247

score 2 · Answer 2 · answered Feb 05 '12 at 11:07

How much do you know about single precision float?

It's stored as <sign><exponent><mantis>. You can write the final number as:

(sign ? 1 : -1) * 0.1<mantis> * 2^(expontent - 127)

As you can see number is ALWAYS stored as number >1 and as a binary fraction. Unfortunately some numbers such as 0.1 dec are periodic in binary so you won't get exact result with float.

You may try using this: if(oldFValue != (float)dValue) and if it won't work you can also try:

if(oldFValue*32 != (float)dValue*32)

This will cause:

mantis >> 5
expontent += 5

Which may eliminate your error (try 1 (weird, but may work in some cases), 2, 4, 8, 16..., 2^n).

EDIT: Definitively read Johnsywebs link

score 2 · Answer 3 · answered Feb 05 '12 at 11:47

2

11 / 1024 is exactly representable in both float and double. So of course oldFValue == dValue.

answered Feb 05 '12 at 11:47

Henrik

23,186
6
42
92

But why my debugger is showing different values? fValue = 0.010742188 oldFValue = 0.010742187500000000 – Mircea Ispas Feb 05 '12 at 11:57
@Felics Apparently it's rounding float to 8 significant places and displaying double with higher resolution. Enter `(doube)fValue` in the watch window and you wil see the exact value. – Henrik Feb 05 '12 at 12:25

score 1 · Answer 4 · answered Apr 13 '18 at 10:46

Most of floating point operations are performed with data loss in mantissa, even when components are fit well in it (numbers like 0.5 or 0.25). For example

a + b + c

is not the same as

a + c + b

Order of mul operation components also matters.

In order to fix issue necessary to know how fp numbers are represented by machines.

Perhaps this will help: http://stepan.dyatkovskiy.com/2018/04/machine-fp-partial-invariance-issue.html

Below is the C example of a + b + c issue. Good luck!

example.c

#include <stdio.h>

// Helpers declaration, for implementation scroll down
float getAllOnes(unsigned bits);
unsigned getMantissaBits();

int main() {

  // Determine mantissa size in bits
  unsigned mantissaBits = getMantissaBits();

  // Considering mantissa has only 3 bits, we would then get:
  // a = 0b10   m=1,  e=1
  // b = 0b110  m=11, e=1
  // c = 0b1000 m=1,  e=3
  // a + b = 0b1000, m=100, e=1
  // a + c = 0b1010, truncated to 0b1000, m=100, e=1
  // a + b + c result: 0b1000 + 0b1000 = 0b10000, m=100, e=2
  // a + c + b result: 0b1000 + 0b110 = 0b1110, m=111, e=1

  float a = 2,
        b = getAllOnes(mantissaBits) - 1,
        c = b + 1;

  float ab = a + b;
  float ac = a + c;

  float abc = a + b + c;
  float acb = a + c + b;

  printf("\n"
         "FP partial invariance issue demo:\n"
         "\n"
         "Mantissa size = %i bits\n"
         "\n"
         "a = %.1f\n"
         "b = %.1f\n"
         "c = %.1f\n"
         "(a+b) result: %.1f\n"
         "(a+c) result: %.1f\n"
         "(a + b + c) result: %.1f\n"
         "(a + c + b) result: %.1f\n"
         "---------------------------------\n"
         "diff(a + b + c, a + c + b) = %.1f\n\n",
         mantissaBits,
         a, b, c,
         ab, ac,
         abc, acb,
         abc - acb);

  return 1;
}

// Helpers

float getAllOnes(unsigned bits) {
    return (unsigned)((1 << bits) - 1);
}

unsigned getMantissaBits() {

    unsigned sz = 1;
    unsigned unbeleivableHugeSize = 1024;
    float allOnes = 1;

    for (;sz != unbeleivableHugeSize &&
          allOnes + 1 != allOnes;
          allOnes = getAllOnes(++sz)
          ) {}

    return sz-1;
}

amit · Answer 5 · 2012-02-05T11:12:29.690

The problem with floating point is that there are infinite number of rational numbers at any non singleton range. However, you only got finite number of bits to represent your floating point number.

Thus - floating points numbers are not real/rational numbers - and behave differently. You should expect it to be not exactly as a real number would have behaved.

For this reason you should never check equality of floating points using operator==. you should calculate the delta=abs(num1-num2), and check if it is smaller then some value you can tolerate its error.

As @Johnsyweb said, reading and understanding the attached article is important to handle floating points correctly.

Floating point errors in C++

5 Answers5