0
Single precision floating point:
Sign bit: 1
Exponent: 8 bits
Mantissa: 23 bits

Double precision floating point:
Sign bit: 1
Exponent: 11 bits
Mantissa: 52 bits

What does this information mean? I don't know English terms well.

Steve Summit
  • 45,437
  • 7
  • 70
  • 103
  • Are you having trouble with the notions of exponent and mantissa or with bit widths? Have you ever seen a number expressed as 1.234*10^23 or -1.23E5? See e.g. https://en.wikipedia.org/wiki/Double-precision_floating-point_format – Bob__ Sep 09 '22 at 09:27
  • 2
    This has nothing to do with C, which does not require a specific floating point format. That having been said, you probably want to know about 64-bit, and possibly 32-bit, IEEE floating point numbers, which are the most commonly used formats. Again, nothing to do with C. – Tom Karzes Sep 09 '22 at 09:29
  • *Significand* is often used instead of *mantissa*. – Weather Vane Sep 09 '22 at 09:30
  • 3
    There's a ton of online documentation on IEEE floating point formats. A good starting point might be the [wiki page for IEEE floating point formats](https://en.wikipedia.org/wiki/IEEE_754). – Tom Karzes Sep 09 '22 at 09:31
  • 1
    "exponent" and "mantissa" are not common English words, either. You learn how floating-point works, and then you will know what these words mean. – user253751 Sep 09 '22 at 09:39
  • "Sign": whether the number is greater than zero (or equal to zero), or less than zero. "Exponent": a power of 2 which tells the general magnitude of the number. The greater the exponent, the bigger the number; conversely, the less (more negative) the exponent, the smaller the number. "Mantissa": the specific bits (0's and 1's) which describe this number, distinguishing it from other numbers of the same general magnitude. Hope this helps. – Robert Dodier Sep 09 '22 at 19:54
  • @RobertDodier *"Sign: whether the number is greater than zero (or equal to zero), or less than zero"*, actually floating point numbers have both negative and positive zero, because sign bit is independent of magnitude. For simple calculations, this makes no difference, but in some obscure cases, when result gets rounded to zero, it's still important to know if it was rounded from negative or positive value. – hyde Sep 18 '22 at 14:47

1 Answers1

7

A floating-point quantity (in most situations, not just C) is defined by three numbers: the sign, the significand (also called the "mantissa"), and the exponent. These combine to form a pseudo-real number of the form

sign × significand × 2exponent

This is similar to scientific notation, except that the numbers are all binary, and the multiplication is by powers of 2, not powers of 10.

For example, the number 4.000 can be represented as

+1 × 1 × 22

The number 768.000 can be represented as

+1 × 1.5 × 29

The number -0.625 can be represented as

-1 × 1.25 × 2-1

The number 5.375 can be represented as

+1 × 1.34375 × 22

In any particular floating-point format, you can have different numbers of bits assigned to the different parts. The sign is always 0 (positive) or 1 (negative), so you only ever need one bit for that. The more bits you allocate to the significand, the more precision you can have in your numbers. The more bits you allocate to the exponent, the more range you can have for your numbers.

For example, IEEE 754 single-precision floating point has a total of 24 bits of precision for the significand (which is, yes, one more than your table called out, because there's literally one extra or "hidden" bit). So single-precision floating point has the equivalent of log10(224) or about 7.2 decimal digits worth of precision. It has 8 bits for the exponent, which gives us exponent values of about ±127, meaning we can multiply by 2±127, giving us a decimal range of about ±1038.

When you start digging into the details of actual floating-point formats, there are a few more nuances to consider. You might need to understand where the decimal point (really the "binary point" or "radix point") sits with respect to the number that is the significand. You might need to understand the "hidden 1 bit", and the concept of subnormals. You might need to understand how positive and negative exponents are represented, typically by using a bias. You might need to understand the special representations for infinity, and the "not a number" markers. You can read about all of these in general terms in the Wikipedia article on Floating point, or you can read about the specifics of the IEEE 754 floating-point standard which most computers use.

Once you understand how binary floating-point numbers work "on the inside", some of their surprising properties begin to make sense. For example, the ordinary-looking decimal fraction 0.1 is not exactly representable! In single precision, the closest you can get is

+1 × 0x1.99999a × 2-4

or equivalently

+1 × 1.60000002384185791015625 × 2-4

or equivalently

+1 × 0b1.10011001100110011001101 × 2-4

which works out to about 0.10000000149. We simply can't get any more precise than that — we can't add any more 0's to the decimal equivalent — because the significand 1.10011001100110011001101 has completely used up our 1+23 available bits of single-precision significance.

You can read more about such floating point "surprises" at this canonical SO question, and this one, and this one.


Footnote: I said everything was based on "a pseudo-real number of the form sign × significand × 2exponent, but strictly speaking, it's more like -1sign × significand × 2exponent. That is, the 1-bit sign component is 0 for positive, and 1 for negative.

Steve Summit
  • 45,437
  • 7
  • 70
  • 103