Is there a formula to find the numbers of bits for either exponent or significand in a floating point number?

Question

Recently, I have been interested with using bit shiftings on floating point numbers to do some fast calculations.

To make them work in more generic ways, I would like to make my functions work with different floating point types, probably through templates, that is not limited to float and double, but also "halfwidth" or "quadruple width" floating point numbers and so on.

Then I noticed:

 - Half   ---  5 exponent bits  ---  10 signicant bits
 - Float  ---  8 exponent bits  ---  23 signicant bits
 - Double --- 11 exponent bits  ---  52 signicant bits

So far I thought exponent bits = logbase2(total byte) * 3 + 2,
which means 128bit float should have 14 exponent bits, and 256bit float should have 17 exponent bits.

However, then I learned:

 - Quad   --- 15 exponent bits  ---  112 signicant bits
 - Octuple--- 19 exponent bits  ---  237 signicant bits

So, is there a formula to find it at all? Or, is there a way to call it through some builtin functions?
C or C++ are preferred, but open to other languages.

Thanks.

Note: No guarantees, at least none from the C++ Standard. What the implementation uses for the encoding of floating point is left up to the implementors, but will probably be defined in the IEEE floating point standards. IEEE 754, is common. [wiki page for IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) — user4581301, Jun 29 '20 at 04:41
What would prevent you from hardcoding the counts directly into your program, given all of them are standardized? — vmt, Jun 29 '20 at 04:41
Anyone can define their own floating point formats and allocate the bits how they see fit. About IEEE-754 formats specifically, there are several good pointers [here](https://stackoverflow.com/questions/40775949/why-do-higher-precision-floating-point-formats-have-so-many-exponent-bits/) about the history and choices. — dxiv, Jun 29 '20 at 04:51
In IEEE 754 double has 53 significant bits. You're ignoring the hidden bit. — john, Jun 29 '20 at 04:54
What makes you think you can do these bit twiddling optimizations on floating point better than the compiler can with full retail optimizations turned on? — selbie, Jun 29 '20 at 05:14
@vmt the reason I want to see if there's a formula is to say if 512bit float is put in as standard, it would automatically work with it, without the need of altering anything. — Ranoiaetep, Jun 29 '20 at 05:32
@Ranoiaetep given the rarity of even 256 bit floats, I'd say once you want/need to add support, it should be trivial to hardcode a new exponent bit count for the type in addition to the actual implementation — vmt, Jun 29 '20 at 06:06
The newest hardware has `bfloat16`, which has only 7 significant bits. There's no logical pattern. — MSalters, Jun 29 '20 at 06:54
Please note that, even when you get the bit counts right, you don't necessarily get the byte order right as well. If a CPU stores integers in little endian format, there is no guarantee that floats are not stored in big endian, or vice versa. Or mixed endianness. Even if you avoid the UB due to strict aliasing rules, you are going to have implementation defined behavior and will need to check every single architecture that you support. — cmaster - reinstate monica, Jun 29 '20 at 12:31
`DBL_MANT_DIG` provide number of `bits` in a base 2 `double` in the significand. Usable at pre-processor time. — chux - Reinstate Monica, Jun 29 '20 at 14:52
C and C++ don't require IEEE-754. Other floating-point formats may have different number of bits for the exponent and significand. See [What uncommon floating-point sizes exist in C++ compilers?](https://stackoverflow.com/q/38509009/995714) — phuclv, Sep 20 '21 at 10:55

Eric Postpischil · Accepted Answer · 2020-06-29T12:06:25.523

Characteristics Provided Via Built-In Functions

C++ provides this information via the std::numeric_limits template:

#include <iostream>
#include <limits>
#include <cmath>


template<typename T> void ShowCharacteristics()
{
    int radix = std::numeric_limits<T>::radix;

    std::cout << "The floating-point radix is " << radix << ".\n";

    std::cout << "There are " << std::numeric_limits<T>::digits
        << " base-" << radix << " digits in the significand.\n";

    int min = std::numeric_limits<T>::min_exponent;
    int max = std::numeric_limits<T>::max_exponent;

    std::cout << "Exponents range from " << min << " to " << max << ".\n";
    std::cout << "So there must be " << std::ceil(std::log2(max-min+1))
        << " bits in the exponent field.\n";
}


int main()
{
    ShowCharacteristics<double>();
}

Sample output:

The floating-point radix is 2.
There are 53 base-2 digits in the significand.
Exponents range from -1021 to 1024.
So there must be 11 bits in the exponent field.

C also provides the information, via macro definitions like DBL_MANT_DIG defined in <float.h>, but the standard defines the names only for types float (prefix FLT), double (DBL), and long double (LDBL), so the names in a C implementation that supported additional floating-point types would not be predictable.

Note that the exponent as specified in the C and C++ standards is one off from the usual exponent described in IEEE-754: It is adjusted for a significand scaled to [½, 1) instead of [1, 2), so it is one greater than the usual IEEE-754 exponent. (The example above shows the exponent ranges from −1021 to 1024, but the IEEE-754 exponent range is −1022 to 1023.)

Formulas

IEEE-754 does provide formulas for recommended field widths, but it does not require IEEE-754 implementations to conform to these, and of course the C and C++ standards do not require C and C++ implementations to conform to IEEE-754. The interchange format parameters are specified in IEEE 754-2008 3.6, and the binary parameters are:

For a floating-point format of 16, 32, 64, or 128 bits, the significand width (including leading bit) should be 11, 24, 53, or 113 bits, and the exponent field width should be 5, 8, 11, or 15 bits.
Otherwise, for a floating-point format of k bits, k should be a multiple of 32, and the significand width should be k−round(4•log₂k)+13, and the exponent field should be round(4•log₂k)−13.

+1 for tracking down the reference. May be worth noting at the last point that the formula is only meant for `k >= 128` (it does in fact match the `11` bits for `k = 64`, too, but it is off by `1` and `2` for `k = 32, 16`). — dxiv, Jun 29 '20 at 14:33

dxiv · Answer 2 · 2020-06-29T14:47:11.683

I want to see if there's a formula is to say if 512bit float is put in as standard, it would automatically work with it, without the need of altering anything

I don't know of a published standard that guarantees the bit allocation for future formats (*). Past history shows that several considerations factor into the final choice, see for example the answer and links at Why do higher-precision floating point formats have so many exponent bits?.
(*) EDIT: see note added at the end.

For a guessing game, the existing 5 binary formats defined by IEEE-754 hint that the number of exponent bits grows slightly faster than linear. One (random) formula that fits these 5 data points could be for example (in WA notation) exponent_bits = round( (log2(total_bits) - 1)^(3/2) ).

This would foresee that a hypothetical binary512 format would assign 23 bits to the exponent, though of course IEEE is not bound in any way by such second-guesses.

The above is just an interpolation formula that happens to match the 5 known exponents, and it is certainly not the only such formula. For example, searching for the sequence 5,8,11,15,19 on oeis finds 18 listed integer sequences that contain this as a subsequence.

[ EDIT ] As pointed out in @EricPostpischil's answer, IEEE 754-2008 does in fact list the formula exponent_bits = round( 4 * log2(total_bits) - 13 ) for total_bits >= 128 (the formula actually holds for total_bits = 64, too, though it does not for = 32 or = 16).

The empirical formula above matches the reference IEEE one for 128 <= total_bits <= 1472, in particular IEEE also gives 23 exponent bits for binary512 and 27 exponent bits for binary1024.

The formula has to be asymptotically linear though, regardless of how it starts out... — Mad Physicist, Dec 30 '21 at 00:18
@MadPhysicist It has to be *at least* linear for the product of 2 b-bits values to fit into a (b+1)-bit one, which was a rationale quoted in the [linked](https://stackoverflow.com/a/40789013/5538420) answer. Other than that, I don't see a hard requirement that it be strictly linear, unless additional constraints are imposed. — dxiv, Dec 30 '21 at 00:57
It has to be at most linear asymptotically if you want to have any fractional part left as you go to infinity. — Mad Physicist, Dec 30 '21 at 02:03
@MadPhysicist I meant linear in `log2(b)`. For example, the empirical formula above has faster than log-linear growth for both exponent and mantissa bits. — dxiv, Dec 30 '21 at 02:20

6502 · Answer 3 · 2020-06-29T06:33:01.000

4

The answer is no.

How many bits to use (or even which representation to use) is decided by compiler implementers and committees. And there's no way to guess what a committee decided (and no, it's not the "best" solution for any reasonable definition of "best"... it's just what happened that day in that room: an historical accident).

If you really want to get down to that level you need to actually test your code on the platforms you want to deploy to and add in some #ifdef macrology (or ask the user) to find which kind of system your code is running on.

Also beware that in my experience one area in which compilers are extremely aggressive (to the point of being obnoxious) about type aliasing is with floating point numbers.

edited Jun 29 '20 at 06:33

answered Jun 29 '20 at 06:22

6502

112,025
15
165
265

The last paragraph cannot be stressed enough: You cannot manipulate floating point bits without either invoking UB or going through at least one `memcpy()` call, two if you actually need the result. Even `union` is not enough to make the type punning defined. – cmaster - reinstate monica Jun 29 '20 at 12:22
@cmaster-reinstatemonica You're talking about C++, right? In C I believe it's still possible to access the bits of a floating-point (or other) quantity by using a `char *` pointer. – Steve Summit Oct 17 '22 at 13:02
@SteveSummit: yes you can. However like I said in the last part of the post I've run into problems in the past for a `malloc` address being returned in a `double` (after casting it to `intptr_t`, on a platform where this cannot lose precision). Somehow g++ thought that an address can't be returned using a floating point number and decided to "optimize away" the entire `malloc` call. This was clearly a bug in my opinion (and indeed was later fixed) but there was some discussion about if a C++ compiler is allowed to do that. I expect other compilers to use the same (IMO insane) reasoning too. – 6502 Oct 17 '22 at 13:16
@SteveSummit No, I'm talking about C. More precisely: I'm talking about turning a `float` into an `int32_t`, or a `double` into an `int64_t` for bit manipulation. It is true that you can cast a `float*` into an `unsigned char*` and then manipulate the bits byte by byte. However, unless you are only interested in a single bit, this is tedious. And as such, I took it for granted that the ultimate goal is to get access to the float bits via a single integer. And that does indeed require at least one `memcpy()`. – cmaster - reinstate monica Oct 17 '22 at 14:40

RARE Kpop Manifesto · Answer 4 · 2022-08-28T02:20:35.873

UPDATE : I've now incorporated that into a single unified function that perfectly lines up with the official formula while incorporating the proper exponents for 16- and 32-bit formats, and how the bits are split between sign-bit, exponent bits, and mantissa bits.

inputs can be in # of bits, e.g. 64, a ratio like "2x", or even case-insensitive single letters :

"S" for 1x single, - "D" for 2x double,
"Q" for 4x quadruple, - "O" for 8x "octuple",
"X" for 16x he"X", - "T" for 32x "T"hirty-two,

-— all other inputs, missing, or invalid, defaults to 0.5x half-precision

gcat <( jot 20 | mawk '$!(_=NF)=(_+_)^($_)' ) \
     <( jot - -1 8 | mawk '$!NF =(++_+_)^$(_--)"x"' ) | 

{m,g}awk '

function _754(__,_,___) {
    return \
    (__=(__==___)*(_+=_+=_^=_<_) ? _--^_++ : ">"<__ ? \
        (_+_)*(_*_/(_+_))^index("SDQOXT", toupper(__)) : \
        __==(+__ "") ? +__ : _*int(__+__)*_)<(_+_) \
                               \
    ? "_ERR_754_INVALID_INPUT_" \
    : "IEEE-754-fp:" (___=__) "-bit:" (_^(_<_)) "_s:"(__=int(\
       log((_^--_)^(_+(__=(log(__)/log(--_))-_*_)/_-_)*_^(\
        -((++_-__)^(--_<__) ) ) )/log(_))) "_e:" (___-++__) "_m"
}
function round(__,_) {
    return \
    int((++_+_)^-_+__)
}
function _4xlog2(_) {
    return (log(_)/log(_+=_^=_<_))*_*_
}
BEGIN { CONVFMT = OFMT = "%.250g"
}
( $++NF = _754(_=$!__) ) \
( $++NF = "official-expn:" \
          +(_=round(_4xlog2(_=_*32^(_~"[0-9.]+[Xx]")))-13) < 11 ? "n/a" :_) | 

column -s':' -t | column -t | lgp3 5

.

2        _ERR_754_INVALID_INPUT_  n/a
4        _ERR_754_INVALID_INPUT_  n/a
8        IEEE-754-fp              8-bit        1_s  2_e   5_m        n/a
16       IEEE-754-fp              16-bit       1_s  5_e   10_m       n/a
32       IEEE-754-fp              32-bit       1_s  8_e   23_m       n/a

64       IEEE-754-fp              64-bit       1_s  11_e  52_m       11
128      IEEE-754-fp              128-bit      1_s  15_e  112_m      15
256      IEEE-754-fp              256-bit      1_s  19_e  236_m      19
512      IEEE-754-fp              512-bit      1_s  23_e  488_m      23
1024     IEEE-754-fp              1024-bit     1_s  27_e  996_m      27

2048     IEEE-754-fp              2048-bit     1_s  31_e  2016_m     31
4096     IEEE-754-fp              4096-bit     1_s  35_e  4060_m     35
8192     IEEE-754-fp              8192-bit     1_s  39_e  8152_m     39
16384    IEEE-754-fp              16384-bit    1_s  43_e  16340_m    43
32768    IEEE-754-fp              32768-bit    1_s  47_e  32720_m    47

65536    IEEE-754-fp              65536-bit    1_s  51_e  65484_m    51
131072   IEEE-754-fp              131072-bit   1_s  55_e  131016_m   55
262144   IEEE-754-fp              262144-bit   1_s  59_e  262084_m   59
524288   IEEE-754-fp              524288-bit   1_s  63_e  524224_m   63
1048576  IEEE-754-fp              1048576-bit  1_s  67_e  1048508_m  67

0.5x     IEEE-754-fp              16-bit       1_s  5_e   10_m       n/a
1x       IEEE-754-fp              32-bit       1_s  8_e   23_m       n/a
2x       IEEE-754-fp              64-bit       1_s  11_e  52_m       11
4x       IEEE-754-fp              128-bit      1_s  15_e  112_m      15
8x       IEEE-754-fp              256-bit      1_s  19_e  236_m      19

16x      IEEE-754-fp              512-bit      1_s  23_e  488_m      23
32x      IEEE-754-fp              1024-bit     1_s  27_e  996_m      27
64x      IEEE-754-fp              2048-bit     1_s  31_e  2016_m     31
128x     IEEE-754-fp              4096-bit     1_s  35_e  4060_m     35
256x     IEEE-754-fp              8192-bit     1_s  39_e  8152_m     39

===============================================

Similar to the concept mentioned above, here's an alternative formula (just re-arranging some terms) that will calculate the unsigned integer range of the exponent ([32,256,2048,32768,524288], corresponding to [5,8,11,15,19]-powers-of-2) without needing to call the round function :

uint_range =  ( 64 **  ( 1 + (k=log2(bits)-4)/2) )
              *
              (  2 ** -(  (3-k)**(2<k)         ) )

(a) x ** y means x-to-y-power
(b) 2 < k is a boolean condition that should just return 0 or 1.

The function shall be accurate from 16-bit to 256-bit, at least. Beyond that, this formula yields exponent sizes of

   –  512-bit : 23 
   – 1024-bit : 27 
   – 2048-bit : 31 
   – 4096-bit : 35

(beyond-256 may be inaccurate. even 27-bit-wide exponent allows exponents that are +/- 67 million, and over 40-million decimal digits once you calculate 2-to-that-power.)

from there to IEEE 754 exponent is just a matter of log2(uint_range)

Is there a formula to find the numbers of bits for either exponent or significand in a floating point number?

4 Answers4

Characteristics Provided Via Built-In Functions

Formulas

Linked