How to know how floating-point data are stored in a C++ program?
If I assign the number 1.23 to a double object for example, how can I know how this number is encoded?

- 501
- 3
- 11
-
What's stopping you from looking at the individual bytes, either using a debugger or by writing them out? – Sam Varshavchik Oct 31 '19 at 01:54
-
Possible duplicate of [Converting from double to hexadecimal back to double using union](https://stackoverflow.com/questions/46668785/converting-from-double-to-hexadecimal-back-to-double-using-union) – kaya3 Oct 31 '19 at 01:55
-
@Sam Okay, if I looked at the individual bytes, how could I know what method was used to generate these bytes? – Mason Oct 31 '19 at 01:58
-
Have you tried a Google search? A simple search brings up a wealth of information that explains standard floating point encoding formats. All modern computer hardware uses the same floating point format. – Sam Varshavchik Oct 31 '19 at 02:02
-
@SamVarshavchik: A “wealth of information” about the standard floating-point formats does not state which format is used in a particular C++ implementation. The question asks how to **know**, not how to **assume**. Nor does the fact that something can be found with a Google search mean it is ineligible or inappropriate for documenting via question-and-answer on Stack Overflow. – Eric Postpischil Oct 31 '19 at 11:13
-
@Mason are you asking how you can figure out which [floating-point format](http://www.quadibloc.com/comp/cp0201.htm) is implemented, which [endianess](https://en.wikipedia.org/wiki/Endianness) is used, or what exactly. It is a bit unclear what you really ask. – kvantour Oct 31 '19 at 12:53
2 Answers
The compiler will use the encoding that is used by the CPU architecture that you are compiling for. (Unless that architecture doesn't support floating point, in which case the compiler probably would choose the encoding that they'll use the emulate).
The vendor that designed the CPU architecture should document the encoding that the CPU it uses. You can know what the documentation says by reading it.
The IEEE 754 standard is fairly ubiquitous.

- 232,697
- 12
- 197
- 326
The official way to know how floating-point data are encoded is to read the documentation of the C++ implementation, because the 2017 C++ standard says, in 6.9.1 “Fundamental types” [basic.fundamental], paragraph 8, draft N4659:
… The value representation of floating-point types is implementation-defined…
“Implementation-defined” means the implementation must document it (3.12 “implementation-defined behavior” [defns.impl.defined]).
The C++ standard appears to be incomplete in this regard, as it says “… the value representation is a set of bits in the object representation that determines a value…” (6.9 “Types” [basic.types] 4) and “The object representation of an object of type T
is the sequence of N unsigned char
objects taken up by the object of type T
,…” (ibid), but I do not see that it says the implementation must define which of the bits in the object representation are the value representation, or in which order/mapping. Nonetheless, the responsibility of informing you about the characteristics of the C++ implementation lies with the implementation and the implementors, because no other party can do it. (That is, the implementors create the implementation, and they can do so in arbitrary ways, so they are the ones who determine what the characteristics are, so they are the source of that information.)
The C standard defines some mathematical characteristics of floating-point types and requires implementations to describe them in <float.h>
. C++ inherits these in <cfloat>
(C++ 20.5.1.2 “Header” [headers] 3-4). C 2011 5.2.4.2.2 “Characteristics of floating types <float.h>
” defines a model in which a floating-point number x equals sbe sum(fkb−k for k=1 to p), where s is a sign (±1), b is the base or radix, e is an exponent between emin and emax, inclusive, p is the precision (number of base-b digits in the significand), and fk are base-b digits of the significand (nonnegative integers less than b). The floating-point type may also contain infinities and Not-a-Number (NaN) “values”, and some values are distinguished as normal or subnormal. Then <float.h>
relates the parameters of this model:
FLT_RADIX
provides the base, b.FLT_MANT_DIG
,DBL_MANT_DIG
, andLDBL_MANT_DIG
provide the number of significand digits, also known as the precision, p, for thefloat
,double
, andlong double
types, respectively.FLT_MIN_EXP
,DBL_MIN_EXP
,LDBL_MIN_EXP
,FLT_MAX_EXP
,DBL_MAX_EXP
, andLDBL_MAX_EXP
provide the minimum and maximum exponents, emin and emax.
In addition to providing these in <cfloat>
, C++ provides them in the numeric_limits
template defined in the <numeric>
header (21.3.4.1 “numeric_limits members” [numeric.limits.members]) in radix
(b), digits
(p), min_exponent
(emin) and max_exponent
(emax). For example, std::numeric_limits<double>::digits
gives the number of digits in the significand of the double
type. That template includes other members that describe the floating-point type, such as whether it supports infinities, NaNs, and subnormal values.
These provide a complete description of the mathematical properties of the floating-point format. However, as stated above, C++ appears to fail to specify that the implementation should document how the value bits that represent a type appear in the object bits.
Many C++ implementations use the IEEE-754 basic 32-bit binary format for float
and the 64-bit format for double
, and the value bits are mapped to the object bits in the same way as for integers of the corresponding width. If so, for normal numbers, the sign s is encoded in the most significant bit (0 or 1 for +1 or −1, respectively), the exponent e is encoded using the biased value e+126 (float
) or e+1022 (double
) in the next 8 (float
) or 11 (double
) bits, and the remaining bits contain the digits fk for k from 2 to p. The first digit, f1, is 1 for normal numbers. For subnormal numbers, the exponent field is zero, and f1 is 0. (Note the biases here are 126 and 1022 instead of the 127 and 1023 used in IEEE-754 because the C model expresses the significand using b−k instead of b1−k as is used in IEEE-754.) Infinities are encoded with all ones in the exponent field and all zeros in the significand field. NaNs are encoded with all ones in the exponent field and not all zeros in the significand field.

- 195,579
- 13
- 168
- 312
-
I believe that the only thing missing here is [Endianness](https://en.wikipedia.org/wiki/Endianness). This [question](https://stackoverflow.com/questions/2100331/) tells you how you can figure that one out. – kvantour Oct 31 '19 at 12:28
-
@kvantour: As stated, “C++ appears to fail to specify that the implementation should document how the value bits that represent a type appear in the object bits.” – Eric Postpischil Oct 31 '19 at 12:30
-
Indeed, but it seems that the next paragraph tries to explain this particular sentence by giving an example of the most common case. It mentions that the most significant bit represents the sign. This is true for both little- and big-endian. However, the location of this bit in the byte-sequence is entirely different. If you meant the most-significant bit in the byte-sequence, it would imply you are explaining big-endian, but little-endian is more common. (at least that is my understanding of your last paragraph and why I wrote the comment) – kvantour Oct 31 '19 at 12:42
-
Thanks for answering! About the last part, if the implementation uses The IEEE-754's binary32 and binary64 formats for `float` and `double`, must the decimal number I assign to a `float` or a `double` object be converted to a binary one before getting encoded? If yes, is there a standard way for this conversion? – Mason Nov 01 '19 at 06:52
-
1@Mason: Consider a numeral like “1.23” that is being converted to floating-point. It represents some number *x*. If *x* is inside the range of values representable in the floating-point format, it is either some value that is exactly representable, or it is between two representable values, *a* and *b*. The C++ standard requires that, if *x* is exactly representable, the result of conversion be *x*. Otherwise, it may be *a* or *b*. – Eric Postpischil Nov 01 '19 at 11:07
-
1@Mason: This is per C++ 2017 5.13.4 “Floating literals” [lex.fcon] 1: “… : “If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner…” High-quality C++ implementations will return whichever of *a* or *b* is closest to *x*, rounding ties to the one with the even low bit. – Eric Postpischil Nov 01 '19 at 11:08