float imprecision in C

Question

I'm starting to learn c programming language. I know that floats are represented by having:

1 bit for sign
8 bits for exponent
23 bits for mantissa

Does this mean that I can represent precisely only real numbers with 23 bits of precision? That is, if I would like to represent even an integer number that is greater than 2^23, its representation will include imprecisions?

At **least** related: https://stackoverflow.com/questions/588004/is-floating-point-math-broken — T.J. Crowder, May 18 '23 at 14:46
The format you're describing is [IEEE-754 single-precisiion binary floating point](https://en.wikipedia.org/wiki/Single-precision_floating-point_format). Yes, like all [binary floating point formats of this type](https://en.wikipedia.org/wiki/IEEE_754), it has precision issues. Because of some trickery, it's more like 24 bits of precision than 23, but fundamentally, yes. — T.J. Crowder, May 18 '23 at 14:49
@Zaratruta, Typical `float` can exactly represent every `int25_t` value and many others. — chux - Reinstate Monica, May 18 '23 at 18:48

score 3 · Answer 1 · answered May 18 '23 at 14:51

I'm starting to learn c programming language. I know that floats are represented by having:

1 bit for sign

8 bits for exponent

23 bits for mantissa

That's typical of the representation of floats, but it is not required by the C language.

Does this mean that I can represent precisely only real numbers with 23 bits of precision?

You get an implicit leading 1 bit for nonzero values that are not exceedingly small (subnormal), so you generally have 24 bits of mantissa. But yes, that's the precision you have to work with.

That is, if I would like to represent even an integer number that is greater than 2^23, its representation will include imprecisions?

With that floating-point format, you can exactly represent all integers with absolute value less than 2²⁴. You can also exactly represent some larger ones -- those whose absolute value does not exceed the maximum representable one and whose exact binary representations have a sufficient number of trailing zeroes.

score 2 · Answer 2 · answered May 18 '23 at 14:51

The C standard does not specify what format is used for the float type, aside from some minimum requirements on it. The IEEE-754 binary32 type is commonly used. In this type, finite numbers are represented as ±2^e•f, where e is an integer −126 ≤ e ≤ 127 and f is a 24-bit (not 23-bit) binary numeral with the radix point after the first digit (so d.ddd…ddd, where each d is a 0 or 1 and there are 24 of them).

F is called the significand. (Mantissa is an old term for the fraction portion of a logarithm. Significands are linear. Mantissas are logarithmic.)

The floating-point number is encoded into three fields S, E, and F:

S is 0 for + and 1 for −.
If the first bit of f is 1, E is e+127. If the first bit of f is 0, E is 0.
F is the last 23 bits of f.

Note that the encoding cannot encode floating-point representations in which f starts with 0 unless e is −126. However, any such representation can be converted to an encodable representation by shifting bits in f left and decreasing e, until either the first bit of f is 1 or e is −126. Then the new representation represents the same number and is encodable.

Does this mean that I can represent precisely only real numbers with 23 bits of precision?

It means the only finite numbers that can be represented are those that can be represented with 24 or fewer bits and an exponent e in the range −126 ≤ e ≤ 127.

That is, if I would like to represent even an integer number that is greater than 2^23, its representation will include imprecisions?

Not always; 2³⁰ is representable, because it can be represented as +2³⁰•1.00000000000000000000000₂, and 2³⁰+2⁷ is representable because it can be represented as +2³⁰•1.00000000000000000000001₂.

Integers over 2²⁴ can be represented as long as only 24 significant bits are needed to represent them and they are less than 2¹²⁸.

float imprecision in C

2 Answers2