Floating point resolution seems more limited than it ought to be

Question

I'm seeing some error when simply assigning a floating point value which contains only 4 significant figures. I wrote a short program to debug and I don't understand what the problem is. After verifying the limits of a float on my platform is seems like there shouldn't be any error. What's causing this?

#include <stdlib.h>
#include <stdio.h>
#include <limits>
#include <iostream>

int main(){

  printf("float size: %lu\n", sizeof(float));
  printf("float max: %e\n", std::numeric_limits<float>::max());
  printf("float significant figures: %i\n", std::numeric_limits<float>::digits10);

  float a = 760.5e6;
  printf("%.9f\n", a);
  std::cout.precision(9);
  std::cout << a << std::endl;

  double b = 760.5e6;
  printf("%.9f\n", b);
  std::cout << b << std::endl;

  return 0;
}

The output:

float size: 4
float max: 3.402823e+38
float significant figures: 6
760499968.000000000
760499968
760500000.000000000
760500000

Change the displayed precision from 9 to 6 and you should get the result you want. — Paul R, Jan 28 '16 at 17:53
Which part of that output, exactly, do you think is an error? — Useless, Jan 28 '16 at 17:54
Are you aware that `float` and `double` are (almost) always *binary* floating-point types, so the values are stored in binary? The value `760.5e6` has only 4 significant figures when expressed in *decimal*, but in *binary* it looks like `101101010101000100111100100000`, which has 25 significant bits. The most common `float` format is IEEE 754 binary32, which stores a maximum of 24 significant bits, so `760.5e6` can't be stored exactly - instead, you get a (pretty good) approximation to it. — Mark Dickinson, Jan 28 '16 at 18:03
The integer 760500000 cannot be exactly represented with a float. For floating point numbers of greater magnitude than [2^(mantissa_bits + 1) - 1](http://stackoverflow.com/questions/3793838/which-is-the-first-integer-that-an-ieee-754-float-is-incapable-of-representing-e), not all integers may be represented. With 23 mantissa bits as is typical of a float, the closest integer to 76050000 that can be represented by a float is 760499968. — jaggedSpire, Jan 28 '16 at 18:03
Ah thanks for the responses, it's been awhile since I've worked with floating points. I was thinking of the exponent as base 10 in which case there'd obviously be enough precision, but of course that's not true with base 2. — E. Sollenberger, Jan 28 '16 at 18:20

score 3 · Answer 1 · answered Jan 28 '16 at 18:19

A float has 24 bits of precision, which is roughly equivalent to 7 decimal digits. A double has 53 bits of precision, which is roughly equivalent to 16 decimal digits.

As mentioned in the comments, 760.5e6 is not exactly representable by float; however, it is exactly representable by double. This is why the printed results for double are exact, and those from float are not.

It is legal to request printing of more decimal digits than are representable by your floating point number, as you did. The results you report are not an error -- they are simply the result of the decimal printing algorithm doing the best it can.

score 1 · Answer 2 · edited May 23 '17 at 11:52

The stored number in your float is 760499968. This is expected behavior for an IEEE 754 binary32 floating point numbers, as floats usually are.

IEEE 754 floating point numbers are stored in three parts: a sign bit, an exponent, and a mantissa. Since all these values are stored as bits the resulting number is sort of the binary equivalent of scientific notation. The mantissa bits are one less than the number of binary digits allowed as significant figures in the binary scientific notation.

Just like with decimal scientific numbers, if the exponent exceeds the significant figures, you're going to lose integer precision.

The analogy only extends so far: the mantissa is a modification of the coefficient found in the decimal scientific notation you might be familiar with, and there are certain bit patterns that have special meaning in the standard.

The ultimate result of this storage mechanism is that the integer 760500000 cannot be exactly represented by IEEE 754 binary32 with its 23-bit mantissa: it loses integer-level precision after the integer at 2^(mantissa_bits + 1), which is 16777217 for 23-bit mantissa floats. The closest integers to 76050000 that can be represented by a float are 760499968 and 76050032, the former of which is chosen for representation due to the round-ties-to-even rule, and printing the integer at a greater precision than the floating point number can represent will naturally result in apparent inaccuracies.

(pico)nitpick: `760500032` is also exactly representable as a float, so `760499968` is only one of the two closest integers to `760500000` that's representable. (It's the one that the round-ties-to-even rule chooses, of course.) — Mark Dickinson, Jan 28 '16 at 18:29

fatihk · Answer 3 · 2016-01-28T18:10:31.127

0

A double, which has 64 bit size in your case, naturally has more precision than a float, which is 32 bit in your case. Therefore, this is an expected result

Specifications do not enforce that any type should correctly represent all numbers less than std::numeric_limits::max() with all their precision.

edited Jan 28 '16 at 18:10

answered Jan 28 '16 at 18:02

fatihk

7,789
1
26
48

score 0 · Answer 4 · answered Jan 28 '16 at 18:20

The number you display is off only in the 8th digit and after. That is well within the 6 digits of accuracy you are guaranteed for a float. If you only printed 6 digits, the output would get rounded and you'd see the value you expect.

printf("%0.6g\n", a);

See http://ideone.com/ZiHYuT

Floating point resolution seems more limited than it ought to be

4 Answers4