10

I've been looking at floating point formats, both IEEE 754 and x87. Here's a summary:

                Total       Bits per field
Precision       Bits    Sign  Exponent  Mantissa
Single          32      1     8         23  (+1 implicit)   
Double          64      1     11        52  (+1 implicit)
Extended (x87)  80      1     15        64
Quadruple       128     1     15        112 (+1 implicit) 

My question is, why do the higher-precision formats have so many exponent bits? Single-precision gets you a maximum value on the order of 10^38, and I can see how in extreme cases (number of atoms in the universe) you might need a larger exponent. But double-precision goes up to ~10^308, and extended- and quadruple-precision have even more exponent bits. This seems much larger than could ever be necessary for actual hardware-accelerated computation. (It's even more absurd with negative exponents!)

That being said, the mantissa bits are so obviously valuable that I figure there must be a good reason to sacrifice them in favor of the exponent. So what is it? I thought it might be to represent the difference between two adjacent values without needing subnormals, but even that doesn't take a big change in the exponent (-6 out of a full range of +1023 to -1022 for a double).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Adam Haun
  • 359
  • 7
  • 13
  • 6
    Are you sure mantissa bits are so valuable? How many physical quantities have been measured to one part in 2^53? The wide exponent range reduces the risk of overflow on intermediate results. – Patricia Shanahan Nov 24 '16 at 01:28
  • 3
    I thought the precision was needed to limit the accumulation of rounding errors over large numbers of FP operations. – Adam Haun Nov 24 '16 at 02:58

1 Answers1

17

The IEEE-754 floating-point standard grew out of work professor William Kahan of UC Berkeley had done as a consultant to Intel when Intel embarked on the creation of the 8087 math coprocessor. One of the design criteria for what became the IEEE-754 floating-point formats was functional compatibility with existing proprietary floating-point formats to the largest extent possible. The book

John F. Palmer and Stephen P. Morse, "The 8087 Primer". Wiley, New York 1984.

specifically mentions the 60-bit floating-point format of the CDC 6600, with an 11-bit exponent and 48-bit mantissa, with respect to the double-precision format.

The following published interview (which inexplicably mangles Jerome Coonen's name into Gerome Kunan) provides a brief overview of the genesis of IEEE-754, including a discussion of the choice of floating-point formats:

Charles Severance, "IEEE 754: An Interview with William Kahan", IEEE Computer, Vol. 31, No. 3, March 1998, pp. 114-115 (online)

In the interview, William Kahan mentions adoption of the floating-point formats of the extremely popular DEC VAX minicomputers, in particular the F format for single precision with 8 exponent bits, and the G format for double precision with 11 exponent bits.

The VAX F format goes back to DEC's earlier PDP-11 architecture, and the rationale for choosing 8 exponent bits is stated in PDP-11/40 Technical Memorandum #16: a desire to be able to represent all important physical constants, including the Planck constant (6.626070040 x 10-34) and the Avogadro constant (6.022140857 x 1023).

The VAX had originally used the D format for double precision, which used the same number of exponent bits, namely 8, as the F format. This was found to cause trouble through underflow in intermediate computations, for example in the LAPACK linear algebra routines, as noted in a contribution by James Demmel in NA Digest Sunday, February 16, 1992 Volume 92 : Issue 7. This issue is also alluded to in the interview with Kahan, in which it is mentioned that the subsequently introduced VAX G format was inspired by the CDC 6600 floating-point format.

David Stephenson, "A Proposed Standard for Binary Floating-Point Arithmetic", IEEE Computer, Vol. 14, No. 3, March 1981), pp. 51-62 (online)

explains the choice of number of exponent bits for IEEE-754 double precision as follows:

For the 64-bit format, the main consideration was range; as a minimum, the desire was that the product of any two 32-bit numbers should not overflow the 64-bit format. The final choice of exponent range provides that a product of eight 32-bit terms cannot overflow the 64-bit format — a possible boon to users of optimizing compilers which reorder the sequence of arithmetic operations from that specified by the careful programmer.

The "extended" floating-point types of IEEE-754 were introduced specifically as intermediate formats that ease implementation of accurate standard mathematical functions for the corresponding "regular" floating-point types.

Jerome T. Coonen, "Contributions to a Proposed Standard for Binary Floating-Point Arithmetic". PhD dissertation, Univ. of California, Berkeley 1984

states that precursors were extended accumulators in the IBM 709x and Univac 1108 machines, but I am not familiar with the formats used for those.

According to Coonen, the choice of the number of mantissa bits in extended formats was driven by the needs of binary-decimal conversion as well as general exponentiation xy. Palmer / Morse mention exponentiation as well and provide details: Due to the error magnification properties of exponentiation, a naive computation utilizing an extended format requires as many additional bits in the mantissa as there are bits in the exponent of the regular format to deliver accurate results. Since double precision uses 11 exponent bits, 64 mantissa bits are therefore required for the double-extended format.

I checked the draft documents published ahead of the release of the IEEE-754 standard in addition to Coonen's PhD thesis and was unable to find a stated rationale for the number of 15 exponent bits in the double-extended format.

From personal design experience with x87 floating-point units I am aware that the straightforward implementation of elementary math functions, without danger of intermediate overflow, motivates at least three additional exponent bit. The use of 15 bits specifically may be an artifact of the hardware design. The 8086 CPU used 16-bit words as a basic building block, so a requirement of 64 mantissa bits in the double-extended format would lead to a format comprising 80 bits (= five words), leaving 15 bits for the exponent.

njuffa
  • 23,970
  • 4
  • 78
  • 130
  • 3
    This is a fantastic answer. Thank you very much for taking the time to write it! – Adam Haun Nov 24 '16 at 16:10
  • 1
    Given the requirement x87 for the mantissa to be at least 64 bits (thanks for pointing out the reasons for that), having it *exactly* 64 bits makes sense for more convenient/efficient software floating-point implementations than if it had been say 65 or 66 bits. Having the sign+exponent exactly fill a 16-bit word may also have been seen as desirable. So given that it needed to be more than 72 bits, using all 80 makes sense, and distributing them the way they did also makes sense. – Peter Cordes Oct 23 '20 at 05:09
  • 1
    x87 did choose *not* to use an implicit leading-mantissa bit, so they did sort of leave 1 bit of redundancy. It makes sense for internal use, not having to decode it from the exponent every time. They could have used an 81-bit redundant format internally, and convert to/from an 80-bit implicit format on load/store, like for binary32 / binary64 `fld` / `fstp`. But with transistor budgets back then, it's easy to see why they'd choose not to. The only benefit would be 1 more exponent bit, and that would mean soft-float routines would have to handle the same decoding of the implicit bit. – Peter Cordes Oct 23 '20 at 05:15