Related: Why do higher-precision floating point formats have so many exponent bits? about why the design choices were made this way.
Wikipedia's https://en.wikipedia.org/wiki/Double-precision_floating-point_format is excellent.
See also https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/ Bruce Dawson's series of FP articles is essential reading for intermediate/advanced understanding of FP.
Also https://www.h-schmidt.net/FloatConverter/IEEE754.html is great for trying bit-patterns.
Most of your reasoning about why the exponent field has to be some minimum length is wrong! Some of the factors you cite are reasonable design choices for general-purpose use-cases, but not required.
The design choice is a matter of giving lots of dynamic range to maintain high precision over a large range of numbers.
so you need the exponent to be able to represent at least number 53 (to be able to move the binary radix point to any location in the significand), so for now you need 7 bits.
Not true. There's no inherent reason why a binary floating-point format in the style of IEEE754 needs to support an exponent range large enough to make that happen. If large numbers aren't important, you could choose so few exponent bits that even with the largest exponent, the nearest representable values are closer than 1.0 apart.
Also, 6 bits gives you 64 exponent values, which is enough to move the binary point beyond the end of the 53-bit significand.
Then you also need negative exponents, so 8 bits.
Yes, it's pretty reasonable to want your dynamic range centered around 1. But for some use-cases, e.g. audio processing, you might only ever use numbers with magnitudes from [0..1)
. Or maybe up to 4
to allow some room for larger temporary values.
In that case you'd want to choose your exponent bias to have most of your exponent values represent negative exponents.
Also you need representations for 0, negative and positive infinity, and NaN- (those need 4 additional representations) so I guess 10 bits.
No, it doesn't take extra flag bits, just one of the exponent encodings to signal Inf/NaN depending on the significand. So for your hypothetical 8 exponent bits, this would only reduce you from 256 to 255 possible exponent values for actual numbers. e.g. 2^-127 to 2^+127 is still a big range.
The maximum (all-ones) exponent value means Inf (significand=0) or NaN (any other significand value), so IEEE binary64 spends 2x 2^52 bit patterns - 2 of its 2^64 coding space on NaN payloads. This doesn't get as much use as the designers maybe hoped, and might have been better spent on gradual overflow, like how subnormals allow gradual underflow.
+-0.0
is a special case of subnormal numbers, with the minimum exponent value (encoded as 0) and significand=0. Biased exponent=0 implies a leading 0 for the significand, instead of the usual implicit 1. Other significand values are real numbers, allowing gradual underflow. This special case takes another exponent value away from "normal" numbers.
So 0.0 is represented by an all-zero bit-pattern, which is very convenient because memory is commonly initialized with integer zero, and it makes it possible to zero arrays with memset
(which only accepts a 1-byte pattern, not a 4 or 8-byte pattern that you'd need to init an array with any other repeating double
)