How to express float constants precisely in source code

Question

I have some C++11 code generated via a code generator that contains a large array of floats, and I want to make sure that the compiled values are precisely the same as the compiled values in the generator (assuming that both depend on the same float ISO norm)

So I figured the best way to do it is to store the values as hex representations and interpret them as float in the code.

Edit for Clarification: The code generator takes the float values and converts them to their corresponding hex representations. The target code is supposed to convert back to float.

It looks something like this:

const unsigned int data[3] = { 0x3d13f407U, 0x3ea27884U, 0xbe072dddU};
float const* ptr = reinterpret_cast<float const*>(&data[0]);

This works and gives me access to all the data element as floats, but I recently stumbled upon the fact that this is actually undefined behavior and only works because my compiler resolves it the way I intended:

https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8

https://en.cppreference.com/w/cpp/language/reinterpret_cast.

The standard basically says that reinterpret_cast is not defined between POD pointers of different type.

So basically I have three options:

Use memcopy and hope that the compiler will be able to optimize this
Store the data not as hex-values but in a different way.
Use std::bit_cast from C++20.

I cannot use 3) because I'm stuck with C++11.

I don't have the resources to store the data array twice, so I would have to rely on the compiler to optimize this. Due to this, I don't particularly like 1) because it could stop working if I changed compilers or compiler settings.

So that leaves me with 2):

Is there a standardized way to express float values in source code so that they map to the exact float value when compiled? Does the ISO float standard define this in a way that guarantees that any compiler will follow the interpretation? I imagine if I deviate from the way the compiler expects, I could run the risk that the float "neighbor" of the number I actually want is used.

I would also take alternative ideas if there is an option 4 I forgot.

https://en.wikipedia.org/wiki/Hexadecimal#Hexadecimal_exponential_notation https://en.wikipedia.org/wiki/IEEE_754#Hexadecimal_literals — KamilCuk, Aug 16 '22 at 11:29
Do you want to ensure the value matches exactly what is written in the source code (and what do you want to do if the exact value does not exist)? Or do you want that the value does not change between different compilers and computers? — VLL, Aug 16 '22 at 11:34
No, the C++ standard does not provide such guarantees - the notion of undefined behaviour giving assurances of consistency between implementations is a contradiction. If your implementation supports IEEE-754 floating point, you can probably do what you want - but your code would not be portable to machines which support other floating point representations (and possibly not to machines with different endianness if you are relying on reading integral values and doing `reinterpret_cast`). — Peter, Aug 16 '22 at 11:36
Re “Does the ISO float standard define this”: Any floating-point standard is irrelevant because you say the prerequisite is C++ 11, and C++ 11 does not require conformance to any floating-point standard. — Eric Postpischil, Aug 16 '22 at 12:14
@VLL Both I guess. I would need the code generator to ensure that the code is generated in a way that makes sure that the exact value does exist. — Cerno, Aug 16 '22 at 12:45
@Peter Good point. So far I haven't spent enough effort to make sure that the current code works across all machines that we are using, so just because it currently works, I have no guarantee that it will always work. I think I need to unit test this specifically. — Cerno, Aug 16 '22 at 12:47
@EricPostpischil Of course it doesn't \*cries*. That is valuable information, thank you. — Cerno, Aug 16 '22 at 12:49

score 5 · Answer 1 · answered Aug 16 '22 at 11:37

5

How to express float constants precisely in source code

Use hexadecimal floating point literals. Assuming some endianess for the hexes you presented:

float floats[] = { 0x1.27e80ep-5, 0x1.44f108p-2, -0x1.0e5bbap-3 };

answered Aug 16 '22 at 11:37

KamilCuk

120,984
8
59
111

4

Problem is, hex float literals were introduced in C++17, and the OP states "stuck with C++11". – Peter Aug 16 '22 at 11:57
@Peter hex float literals were introduced in C99 so it's easy to work around that, just compile a C file then link together – phuclv Aug 16 '22 at 12:03
1

One option with C++11 is to use `std::stringstream` with `std::hexfloat` manipulator instead of hex literal directly. – Yksisarvinen Aug 16 '22 at 12:05

score 3 · Answer 2 · answered Aug 16 '22 at 12:12

3

If you have the generated code produce the full representation of the floating-point value—all of the decimal digits needed to show its exact value—then a C++ 11 compiler is required to parse the number exactly.

C++ 11 draft N3092 2.14.4 1 says, of a floating literal:

… The exponent, if present, indicates the power of 10 by which the significant [likely typo, should be “significand”] part is to be scaled. If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner…

Thus, if the floating literal does not have all the digits needed to show the exact value, the implementation may round it either upward or downward, as the implementation defines. But if it does have all the digits, then the value represented by the floating literal is representable in the floating-point format, and so its value must be the result of the parsing.

answered Aug 16 '22 at 12:12

Eric Postpischil

195,579
13
168
312

Of course, [as you know](https://stackoverflow.com/a/61614323/787480) that may be *quite a few digits*. – Sneftel Aug 16 '22 at 12:17
@Sneftel only 9 for `float` https://en.wikipedia.org/wiki/Single-precision_floating-point_format? You might be able to get more than 9 non-zero decimal digits from a `float` but you don't need more than 9 to uniquely indentify the `float` – Alan Birtles Aug 16 '22 at 12:35
3

@AlanBirtles: No, not nine. The requirement is not to uniquely identify the floating-point value but to identify its exact value. That is what the text in the standard says: If the value indicated by the significand scaled by the power of ten is **representable**, meaning it is exactly one of the values representable in the floating-point format, it is the result. It does not say if the value indicated uniquely distinguishes a representable value, then that representable value is the result. – Eric Postpischil Aug 16 '22 at 12:39
Isn't the goal to store a particular `float` from a decimal literal? Surely if you convert the target `float` to a decimal with 9 significant digits then use that as the literal it is guaranteed that the stored `float` will be correct? The standard isn't clear but I imagine the "implementation-defined" part is how it chooses smaller or larger if the values are equally close, I'd hope it always uses the nearest of the two? – Alan Birtles Aug 16 '22 at 12:49
2

@AlanBirtles: The standard is clear: If the value of the literal (using real-number arithmetic to parse its significand and scale it by the exponent) is representable, that is the result. Otherwise, the result is an implementation-defined choice of larger or smaller. Hoping the implementation chooses the nearest is not engineering; it is not guaranteed, and OP predicates only C++ 11, not hoping. – Eric Postpischil Aug 17 '22 at 00:05
@AlanBirtles: The C++ 11 standard may have left this open due to the state of known or available efficient decimal-to-binary conversion algorithms. I am not sure how things stood in 2011, but keep in mind the standard committee is writing for flexibility in the `float`, `double`, and `long double` formats. Even if efficient correctly-rounded to-nearest algorithms were widely available for the IEEE-754 binary32 and binary64 (“single precision” and “double precision”) formats, they might not have been available for binary128. Or for other formats that C++ implementors are free to choose to use… – Eric Postpischil Aug 17 '22 at 00:09
… for `float`, `double`, and `long double`. So guaranteeing correct rounding for the binary32 and binary64 formats would require inserting conditions in the standard based on which formats were chosen. The consequence is the standard leaves wiggle room on this point. High-quality C++ implementations should provide correctly rounded to-nearest conversions, but it is not guaranteed by the standard. – Eric Postpischil Aug 17 '22 at 00:10
@AlanBirtles: For more insight, consider what is necessary to guarantee correct rounding. One has to either use extended-precision arithmetic in the compiler (which requires over a thousand digits for the binary64 format) or engineer complicated code and construct proofs that it is correct. In practice, what a medium-quality implementation may do is use a little extended precision that effectively uses the first *n* digits of a decimal numeral to make a decision and has a proof that that is sufficient that all exactly representable numbers will be converted to the exact value and all others… – Eric Postpischil Aug 17 '22 at 00:13
… to one of the two straddling values, possibly not correctly rounded, although most will be. Then, if you knew what *n* were for such an implementation, you would only have to have the generator produce *n* significant digits. But we do not always know what *n* is, and we cannot know if the literals are desired to work for any C++ 11 implementation. Except we do know that producing all of the digits will work. – Eric Postpischil Aug 17 '22 at 00:15
@EricPostpischil Your comment in bold was my main worry. So do I see it right that writing down the float with the precise number of digits would get me what I want, but adding a zero at the end might break exactness? It's exactly those nuances that made me hesitate to use plain floats in my code generator. A slight deviance from what is expected could lead to numerical errors that would be impossible to spot but may falsify my result. – Cerno Aug 18 '22 at 15:52
@Cerno: Appending a zero would not change the nominal value. It would still be exactly a representable value, so that value is the one that parsing must produce. – Eric Postpischil Aug 18 '22 at 16:15
@EricPostpischil Ok I misunderstood it then, but could you give an example of a non-representable value? I would understand if I accidentally removed a digit, but are there more nefarious cases that might not immediately be clear, like expressing the number using exponent notation? – Cerno Aug 19 '22 at 12:52
@Cerno: IEEE-754 binary32 can represent NaN, ±∞, and every number that is ±M•2^e where M is an integer 0 ≤ M < 2^24 and e is an integer −1022 ≤ e ≤ 1023. Everything else is not representable in the format. 0.1 and ⅓ are not representable in the format because they do not equal ±M•2^e for any allowed values of M and e. 0.375 is representable in the format because it is +3•2^−3. The strings “0.375”, “0.37500”, “3.75e-1”, “+3750000000000e-13”, and “.000000000000000003750000e17” all represent the real number 0.375, and 0.375 is representable in binary32, so they must all be parsed as that number. – Eric Postpischil Aug 19 '22 at 14:09
@EricPostpischil Thanks for the extensive answer. So I can represent the number any way I like as long as it is uniquely interpretable as the number in question, however if I remove a digit at the end of the float number, I would likely run intro trouble? That alleviates some of my concerns at least. The remaining danger could probably sufficiently be resolved through proper testing. – Cerno Aug 20 '22 at 18:12

Cerno · Answer 3 · 2022-08-18T15:59:36.427

I have read some very valuable information here and would like to throw in an option that does not strictly answer the question, but could be a solution.

It might be problematic, but if so, I would like to discuss it.

The simple solution would be: Leave it as it is.

A short rundown of why I am hesitant about the suggested options:

memcpy relies on the compiler to optimize away the actual copy and understand that I only want to read the values. Since I am having large arrays of data I would want to avoid a surprise event in which a compiler setting would be changed that suddenly introduces increased runtime and would require a fix on short notice.
bit_cast is only available from C++20. There are reference implementations but they basically use memcpy under the hood (see above).
hex float literals are only available from C++17
Directly writing the floats precisely... I don't know, it seems to be somewhat dangerous, because if I make a slight mistake I may end up with a data block that is slightly off and could have an impact on my classification results. A mistake like that would be a nightmare to spot.

So why do I think I can get away with an implementation that is strictly speaking undefined? The rationale is that the standard may not define it, but compiler manufacturers likely do, at least the ones I have worked with so far gave me exact results. The code has been running without major problems for a fairly long time, across dozens of code generator run and I would expect that a failed reinterpret_cast would break the conversion so severely that I would spot the result in my classification results right away.

Still not robust enough though. So my idea was to write a unit test that contains a significant number of hex-floats, do the reinterpret_cast and compare to reference float values for exact correspondence to tell me if a setting or compiler failed in this regard.

I have one doubt though: Is the assumption somewhat reasonable that a failed reinterpret_cast would break things spectacularly, or are the bets totally off when it comes to undefined behavior?

I am a bit worried that if the compiler implementation defines the undefined behavior in a way that it would pick a float that is close the hex value instead of the precise one (although I would wonder why), and that it happens only sporadically so that my unit test misses the problems.

So the endgame would be to unit test every single data entry against the corresponding reference float. Since the code is generated, I can generate the test as well. I think that should put all my worries to rest and make sure that I can get this to work across all possible compilers and compiler settings or be notified if anything breaks.

How to express float constants precisely in source code

3 Answers3