How many digits can float8, float16, float32, float64, and float128 contain?

Question

Numpy's dtype documentation only shows "x bits exponent, y bits mantissa" for each float type, but I couldn't translate that to exactly how many digits before/after the decimal point. Is there any simple formula/table to look this up in?

[This](https://en.wikibooks.org/wiki/A-level_Computing/AQA/Paper_2/Fundamentals_of_data_representation/Floating_point_numbers) is a note on what exponent and mantissa do in decimal, in binary everything is the same, just instead of base 10 it is base 2. I think you can figure it out from there since you are the "mathguy". (Hint: translate upper and lower limits to decimal representation and see number of digits you get.) — campovski, Jun 09 '19 at 13:19
It's not a dumb question at all, but the answer is complicated, and depends on how you're going to use the information. For example, the IEEE 754 `binary64` type can faithfully represent any not-too-large not-too-small decimal value with 15 or fewer significant digits, but to represent a `binary64` value faithfully in decimal requires 17 decimal digits. There are arguments to be made for various different values in the range 15-17. — Mark Dickinson, Jun 09 '19 at 13:23
[`np.finfo`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.finfo.html) should give you all you need to know. — Paul Panzer, Jun 09 '19 at 13:23
Thinking of a binary floating-point as “containing” decimal digits is hazardous because that is not how the mathematics works. Although it may be possible to effectively store a certain number of decimal digits in a floating-point number and get them back out because the format is precise enough to support that, arithmetic done with the numbers will use binary, and that will change the numbers in “non-decimal” ways, and you cannot expect to get an answer with the same number of decimal digits you would get with decimal arithmetic. — Eric Postpischil, Jun 09 '19 at 23:32

Netch · Accepted Answer · 2019-06-10T04:58:43.633

This is not as simple as usually expected. For accuracy of mantissa, there generally are two values:

Given a value in decimal representation, how many decimal digits can be guaranteedly preserved if converted from decimal to a selected binary format and back (with default rounding).
Given a value in binary format, how many decimal digits are needed if value is converted to decimal format and back to original binary format (again, with default rounding) to get the original value unchanged.

In both cases, decimal representation is treated as independent of used exponent, without leading and trailing zeros (for example, all of 0.0123e4, 1.23e2, 1.2300e2, 123, 123.0, 123000.000e-3 are 3 digits).

For 32-bit binary float, these two sizes are 6 and 9 decimal digits, respectively. In C <float.h>, these are FLT_DIG and FLT_DECIMAL_DIG. (This is weird that 32-bit float keeps 7 decimal digits for total most of all numbers, but there are exceptions.) In C++, look at std::numeric_limits<float>::digits10 and std::numeric_limits<float>::max_digits10, respectively.

For 64-bit binary float, these are 15 and 17 (DBL_DIG and DBL_DECIMAL_DIG, respectively; and std::numeric_limits<double>::{digits10, max_digits10}).

General formulas for them (thx2 @MarkDickinson)

${format}_DIG (digits10): floor((p-1)*log10(2))
${format}_DECIMAL_DIG (max_digits10): ceil(1+p*log10(2))

where p is number of digits in mantissa (including hidden one for normalized IEEE754 case).

Also, comments with some mathematical explanation at C++ numeric limits page:

The standard 32-bit IEEE 754 floating-point type has a 24 bit fractional part (23 bits written, one implied), which may suggest that it can represent 7 digit decimals (24 * std::log10(2) is 7.22), but relative rounding errors are non-uniform and some floating-point values with 7 decimal digits do not survive conversion to 32-bit float and back: the smallest positive example is 8.589973e9, which becomes 8.589974e9 after the roundtrip. These rounding errors cannot exceed one bit in the representation, and digits10 is calculated as (24-1)*std::log10(2), which is 6.92. Rounding down results in the value 6.

Look for values for 16- and 128-bit floats in comments (but see below for what is 128-bit float in real).

For exponent, this is simpler because each of the border values (minimum normalized, minimum denormalized, maximum represented) are exact and can be easily obtained and printed.

@PaulPanzer suggested numpy.finfo. It gives first of these values ({format}_DIG); maybe it is the thing you search:

>>> numpy.finfo(numpy.float16).precision
3
>>> numpy.finfo(numpy.float32).precision
6
>>> numpy.finfo(numpy.float64).precision
15
>>> numpy.finfo(numpy.float128).precision
18

but, on most systems (my one was Ubuntu 18.04 on x86-84) the value is confusing for float128; it is really for 80-bit x86 "extended" float with 64 bits significand; real IEEE754 float128 has 112 significand bits and so real value shall be around 33, but numpy presents another type under this name. See here for details: in general, float128 is a delusion in numpy.

UPD3: you mentioned float8 - there is no such type in IEEE754 set. One could imagine such type for some utterly specific purposes, but its range will bee too narrow for any universal usage.

The C standard has the relevant formulas in it: see e.g., C11 5.2.4.2.2p11. Those formulas, for a given binary precision p, are `floor((p-1)*log10(2))` and `1 + ceil(p * log10(2))`. For IEEE 754 binary16 the relevant numbers are `3` and `5`; for IEEE 754 binary128 they're `33` and `36`. — Mark Dickinson, Jun 09 '19 at 15:51
The general condition for precision-p base-B floating-point to round-trip through precision-q base-D floating-point (assuming that `B` and `D` aren't powers of a common base) is that `B**p <= D**(q-1)`; both of the formulas in the C standard can be derived from this. — Mark Dickinson, Jun 09 '19 at 15:53
_"but, on my system ... the value is wrong for float128"_ Nope, what's"wrong" here is the name `float128` which is a in fact a `float80`, the 128 refers to the fact that it is "padded" to 128 bits for the sake of alignment - fell for that [myself](https://stackoverflow.com/q/55096575/7207392) — Paul Panzer, Jun 09 '19 at 16:55
@PaulPanzer thanks, integrated this into the reply. For "float128 is a fact a float80", that's the thing to confuse anybody not accustomly familiar with this specifics and so should be regularly reminded. — Netch, Jun 09 '19 at 19:35
With actual operations you can also have problems when combining extreme large and small numbers, even if you only have 1 or 2 sig digits in each and use an IEEE float64. The trouble is truncated results. IEEE float64 is 52b mantissa, if you add two numbers with exponents that are separated by more than 52 the smaller number will be entirely lost. ie 2^51 + 2^-2 Even in a loop, `A=2^26, B=2^-27, while T (A = A+B)` A will never grow. If the exp are 41 apart then you get the preservation of 52-41= an 11bit mantissa(the same 3/5 decimal digits of float16). — Max Power, Oct 10 '21 at 01:18
Correction float64 has 52+1 [53] mantissa 1 sign 11 exp bits. Possible better resolution with less range or more range with less resolution with a given bit float but this is custom for corner cases. Eg. 16b floats can be found as 11m5e(used for graphics) and 8m8e(used for machine learning) both are intended to retain range and reduce cpu, bus, and memory overhead at the sacrifice of resolution over int or float32. 8m8e can provide at best 0.4% resolution(1/(2^8)) but has the 2^[+-]255 range of a float32 (~10^77); 11m5e has at best 0.05% resolution (1/(2^11)) but only 2^[+-]31range(4billion) — Max Power, Oct 10 '21 at 02:04
8-bit floats are reasonably common in AI and there are multiple standardisation efforts from major hardware players. Most common seems to be 1.4.3 with 4-bit exponent and 3-bit mantissa. This gives you 0 and 2 for your respective precision constants. It seems 1sf might be asking too much of it! — Dannie, Nov 09 '22 at 14:52

SREERAG R NANDAN · Answer 2 · 2023-06-13T09:19:27.017

12

To keep it simple.

Normally as the magnitude of the value increases or decreases, the number of decimal digits of precision increases or decreases respectively

Generally,

Data-Type | Precision
----------------------
float16   | 3 to 4
float32   | 6 to 9
float64   | 15 to 17
float128  | 18 to 34

if you understood don't forget to upvote the answer

Bitwise properties:

float16 : 1 sign bit, 5 exponent bit, 10-bit significand (fractional part).

float32 : 1 sign bit, 8 exponent bit, and 23-bit significand (fractional part).

float64 : 1 sign bit, 11 exponent bits, and 52 fraction bits.

float128 : 1 sign bit, 15 exponent bits, and 112 fraction bits.

edited Jun 13 '23 at 09:19

answered Sep 03 '21 at 12:04

SREERAG R NANDAN

593
5
15

I don't believe that float128 has only 18 decimal digits of precision. – Pascal Cuoq May 16 '23 at 10:11
answer is edited, thank you for pointing out included both lower and upper limit – SREERAG R NANDAN May 17 '23 at 11:16

How many digits can float8, float16, float32, float64, and float128 contain?

2 Answers2

Linked

Related