0

I think IEEE754 is an engineering miracle, I rely on it daily, and I don't have a question about floating point representation, computation, or using it in a programming language.

My question is about the design choices behind how bits are allocated by IEEE754: 8 and 23 bits for the exponent and fraction fields of a 32-bit float, and 11 and 52 bits for those of a 64-bit double. Why those particular values?

The basic techniques of interpreting the exponent field with a bias (tweaked when the exponent is all 0s), and having an implicit leading 1 for the fraction field (for normalized, but not for denormalized), and using an all 1s exponent to signal special values... all of that would also work fine if there were instead 7 and 24 bits, or 9 and 22 bits, in the exponent and fraction fields of a 32-bit float, and would also allow representing a wide range of values with a roughly scale-invariant distribution.

I'm assuming that, like everything else in IEEE754, there's wisdom behind these choices, but I've never seen it explained in the descriptions of IEEE754 that I've read, hence my question here. Is there some measure of the distribution of representable values that is optimized by choosing 8 and 23? My naive guesses about this are disproven by how for 64-bit floats, the ratio of fraction to exponent bits is about 5-to-1, instead of about 3-to-1 for 32-bit floats.

  • Does this answer your question? [Why did IEEE754 choose 11 exponent bits for double aka binary64?](https://stackoverflow.com/questions/55316037/why-did-ieee754-choose-11-exponent-bits-for-double-aka-binary64) – jonrsharpe Mar 03 '23 at 14:52
  • No, not really. It acknowledges that there is a trade-off being made, specifically about double-precision, but it doesn't identify what is being optimized with the 11:52 allocation. The 1st linked "Why do higher-precision floating point formats have so many exponent bits?" question has a good answer (about how the product of any 32-bit numbers shouldn't overflow 64-bits), and notes for 32-bits that someone wanted to be able to represent some important physical constants. But leaves unanswered the basic question of why 8:23 bits for single precision, and the reasoning that informed that. – Gordon Kindlmann Mar 03 '23 at 15:07
  • Gordon Kindlmann, Early FP used many various bit combinations for exponents, the significand (and specified varies bases 2,10, 16,... for decades from the 1940s to 1980s before IEEE754. IEEE754 simply reflects a compromise of earlier successful choices. – chux - Reinstate Monica Mar 03 '23 at 17:25
  • 2
    It is not formally optimized or mathematically derived. It is just a selection based on what people found convenient, including some balancing of existing practices when setting the new standard. – Eric Postpischil Mar 03 '23 at 17:36
  • 1
    Relevant previous questions: [How are IEEE-754 single and double precision formats determined?](https://stackoverflow.com/questions/23064893), [What is the rationale for exponent and mantissa sizes in IEEE floating point standards?](https://stackoverflow.com/questions/4397081), [Where did the free parameters of IEEE 754 come from?](https://retrocomputing.stackexchange.com/questions/13493) – Steve Summit Mar 04 '23 at 14:20
  • 1
    There is a "mathematical derivation" mentioned in [this answer](https://stackoverflow.com/questions/23064893#29714925) (that is, an expression giving exponent and hence significand sizes for an arbitrary N-bit floating-point format), but IIRC the standard single- and double-precision formats are special cases which don't quite fall on that curve. – Steve Summit Mar 04 '23 at 14:25
  • @SteveSummit thank you for the links you provide that probably answer my question, but I don't see a "derivation" in [that answer](https://stackoverflow.com/questions/23064893/how-are-ieee-754-single-and-double-precision-formats-determined#29714925). Is it "In fact nowadays the rule for IEEE-754 interchange format the size for the exponent is _round(4 log2(k))_"? Sorry what is _k_? I scanned the quoted Stevenson "A Proposed Standard ..." article and didn't see the rationale in there (or a consistently prominent use of a _k_). – Gordon Kindlmann Mar 09 '23 at 09:23
  • 1
    @GordonKindlmann I saw another answer years ago that gave a better explanation, but I can't find it; [this](https://stackoverflow.com/questions/23064893#29714925) was the best I could do. But: If we take `k` as the total number of bits, the expression `round(4 × log₂(k)) - 13` gives 7, 11, and 15 for inputs of 32, 64, and 128, which is quite close to IEEE-754's actual exponent sizes of 8, 11, and 15 for single, double, and quad precision. – Steve Summit Mar 09 '23 at 14:19
  • 1
    But you're right, there's no real "derivation" there. I get the impression that the expression basically formalizes some old rules of thumb for what seemed to work well. For example, @phuclv mentions a desire to allow up to 16 multiplications of a narrower type without overflowing the next-larger type. – Steve Summit Mar 09 '23 at 14:46

0 Answers0