2

I have been reading and it seems the IEEE 754 defines a 64 bit float's (double) exponent as 11 bits. (https://en.wikipedia.org/wiki/Double-precision_floating-point_format)

My question is why?

A 64 bit float has a 53 bit significand (the first bit is implied to be one, so only 52 bit are actually stored)- so you need the exponent to be able to represent at least number 53 (to be able to move the binary radix point to any location in the significand), so for now you need 7 bits.

Then you also need negative exponents, so 8 bits.

Also you need representations for 0, negative and positive infinity, and NaN- (those need 4 additional representations) so I guess 10 bits.

So my question is: why 11 bits for the exponent and not 10 or 12, and how would they be determined for other lengths of floats?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
lsauceda
  • 315
  • 3
  • 12
  • 1
    It is a binary point, not a decimal point. I do not understand your reasoning. If you wanted a really high precision narrow range type you could have a much smaller exponent field, and use the bits for more significand. To get a wider range, at a precision cost, use more exponent bits and fewer significand bits. – Patricia Shanahan Mar 23 '19 at 17:24
  • I have the impression you don't quite understand the hidden bit. It is left out **in the *stored* representation of normalized values** because it is always 1. But if you decompose such a float into its single components (or if the hardware does that, internally), it is added to the significand. That does not affect the exponent. – Rudy Velthuis Mar 23 '19 at 17:34
  • I do understand the hidden bit: it is obviously 1 so why store it. But my question is about the exponent, not the significand please see the edit – lsauceda Mar 23 '19 at 17:35
  • The exponent is additional to the bits of the significand. The number of bits in the exponent has nothing to do with the number of bits in the significand, except that it would be cool if they all fit in a 64 bit value together. You don't move the decimal point to anywhere in the significand. You multiply the significand, which is always (except for denormals) between 1 (inclusive) and 2 (exclusive), by 2**exponent. And your reasoning is wrong anyway. 53 < 64, so you'd only need 6 bits. But the bits of the exponent allow you to reach a much bigger range, and that is the reason for the number. – Rudy Velthuis Mar 23 '19 at 17:43
  • So was the number 11 just arbitrarily chosen then? – lsauceda Mar 23 '19 at 17:45
  • 2
    @Isauceda: not entirely arbitrarily, I guess (I was not present when they chose it). But 52 + 11 + 1 = 64, so they chose some kind of nice balance between bits in significand and remaining bits for the exponent (and sign). They could have used 48 + 15 + 1 too, but that would minimize the precision (but greatly enhance the range). You could say that 53 was more or less "arbitrarily" chosen. – Rudy Velthuis Mar 23 '19 at 17:48
  • @PatriciaShanahan: In real life, if you want extra significand precision with the same exponent range, you can use so-called "double double" and take advantage of hardware support for IEEE binary64 math to go much faster than a soft-float format with a narrower exponent and wider significand. [double-double precision floating point as sum of two doubles](//stackoverflow.com/q/9857418) / [float128 and double-double arithmetic](//stackoverflow.com/q/31647409) / [Optimize for fast multiplication but slow addition: FMA and doubledouble](//stackoverflow.com/q/30573443) (x86 SIMD for double double) – Peter Cordes Mar 23 '19 at 17:49
  • @Peter: I guess the question is more why these values (53 bits for significand and thus 12 bits left for exponent and sign) were chosen, those days. Of course you can use the 80 bit extended precision float, but even there, the question is: why they chose 64 + 15 + 1 and not, say, 62 + 17 + 1, etc. or 66 + 13 + 1. – Rudy Velthuis Mar 23 '19 at 17:53
  • @Peter: FWIW, I think a very good way to get a feeling for the intricacies of FP is to write at least one software FP type yourself. Will never beat hardware, but gives you great insight. And if you write it properly, it can give exactly the same results for **all your platforms**, if that is important to you. – Rudy Velthuis Mar 23 '19 at 17:57
  • @RudyVelthuis: Agreed, I found implementing `log` and `nextafter` with x86 integer SIMD to manipulate FP bit patterns was sufficient for me to really grok it. `log` can take advantage of the exponential-based representation of floats by getting the integer part of the result from converting the integer exponent *to* a float. exp is the opposite. [Efficient implementation of `log2(__m256d)` in AVX2](//stackoverflow.com/q/45770089). Also `nextafter` is an eye-opener [Implementing std::nextafter: Should denormals-are-zero mode affect it? If so, how?](//scicomp.stackexchange.com/q/23191) ... – Peter Cordes Mar 23 '19 at 18:13
  • 1
    @Isauceda I answered a very similar question [here](https://stackoverflow.com/a/40789013/780717) – njuffa Mar 23 '19 at 18:14
  • ... the fact that the exponent bias makes integer increment or compare of the FP bit-pattern "work", modulo handling of sign/magnitude vs. 2's complement, is pretty cool. Also, Bruce Dawson's series of FP articles are fantastic: https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/ – Peter Cordes Mar 23 '19 at 18:14
  • @Peter: thanks for the links, although I already knew the Dawson one. – Rudy Velthuis Mar 23 '19 at 18:19
  • possible duplicate: [How are IEEE-754 single and double precision formats determined?](https://stackoverflow.com/q/23064893/995714), [Why did IEEE 754 choose to allocate 23 bits to the manitssa and not 22 or 24 (etc.)?](https://stackoverflow.com/q/51777010/995714), [What is the rationale for exponent and mantissa sizes in IEEE floating point standards?](https://stackoverflow.com/q/4397081/995714) – phuclv Dec 01 '19 at 02:24

1 Answers1

6

Related: Why do higher-precision floating point formats have so many exponent bits? about why the design choices were made this way.

Wikipedia's https://en.wikipedia.org/wiki/Double-precision_floating-point_format is excellent.

See also https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/ Bruce Dawson's series of FP articles is essential reading for intermediate/advanced understanding of FP.

Also https://www.h-schmidt.net/FloatConverter/IEEE754.html is great for trying bit-patterns.


Most of your reasoning about why the exponent field has to be some minimum length is wrong! Some of the factors you cite are reasonable design choices for general-purpose use-cases, but not required.

The design choice is a matter of giving lots of dynamic range to maintain high precision over a large range of numbers.

so you need the exponent to be able to represent at least number 53 (to be able to move the binary radix point to any location in the significand), so for now you need 7 bits.

Not true. There's no inherent reason why a binary floating-point format in the style of IEEE754 needs to support an exponent range large enough to make that happen. If large numbers aren't important, you could choose so few exponent bits that even with the largest exponent, the nearest representable values are closer than 1.0 apart.

Also, 6 bits gives you 64 exponent values, which is enough to move the binary point beyond the end of the 53-bit significand.

Then you also need negative exponents, so 8 bits.

Yes, it's pretty reasonable to want your dynamic range centered around 1. But for some use-cases, e.g. audio processing, you might only ever use numbers with magnitudes from [0..1). Or maybe up to 4 to allow some room for larger temporary values.

In that case you'd want to choose your exponent bias to have most of your exponent values represent negative exponents.

Also you need representations for 0, negative and positive infinity, and NaN- (those need 4 additional representations) so I guess 10 bits.

No, it doesn't take extra flag bits, just one of the exponent encodings to signal Inf/NaN depending on the significand. So for your hypothetical 8 exponent bits, this would only reduce you from 256 to 255 possible exponent values for actual numbers. e.g. 2^-127 to 2^+127 is still a big range.

The maximum (all-ones) exponent value means Inf (significand=0) or NaN (any other significand value), so IEEE binary64 spends 2x 2^52 bit patterns - 2 of its 2^64 coding space on NaN payloads. This doesn't get as much use as the designers maybe hoped, and might have been better spent on gradual overflow, like how subnormals allow gradual underflow.

+-0.0 is a special case of subnormal numbers, with the minimum exponent value (encoded as 0) and significand=0. Biased exponent=0 implies a leading 0 for the significand, instead of the usual implicit 1. Other significand values are real numbers, allowing gradual underflow. This special case takes another exponent value away from "normal" numbers.

So 0.0 is represented by an all-zero bit-pattern, which is very convenient because memory is commonly initialized with integer zero, and it makes it possible to zero arrays with memset (which only accepts a 1-byte pattern, not a 4 or 8-byte pattern that you'd need to init an array with any other repeating double)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847