Are floating point contractions only allowed within one expression?

Question

This C/C++ reduced testcase:

int x = ~~~;
const double s = 1.0 / x;
const double r = 1.0 - x * s;

assert(r >= 0); // fail

isn't numerically stable, and hits the assert. The reason is that the last calculation can be done with FMA, which brings r into the negatives.

Clang has enabled FMA by default (since version 14), so it is causing some interesting regressions. Here's a running version: https://godbolt.org/z/avvnnKo5E

Interestingly enough, if one splits the last calculation in two, then the FMA is not emitted and the result is always non-negative:

int x = ~~~;
const double s = 1.0 / x;
const double tmp = x * s;
const double r = 1.0 - tmp;

assert(r >= 0); // OK

Is this a guaranteed behavior of IEEE754 / FP_CONTRACT, or is this playing with fire and one should just find a more numerically stable approach? I can't find any indication that fp contractions are only meant to happen "locally" (within one expression), and a simple split like to above is sufficient to prevent them.

(Of course, in due time, one could also think of replacing the algorithm with a more numerically stable one. Or add a clamp into the [0.0, 1.0] range, but that feels hackish.)

Interesting question; however, can you change the `int x = ~~~;` into something that can be compiled? I mean, *you* may know what it means, and I *may* be able to guess, but why not just give a sample *actual* value that produces your result? — Adrian Mole, Jan 02 '23 at 14:59
In the 2nd example why do you expect `r >= 0` to be true? My expectation is that a value very close to `0` is produced but it could be +ve or -ve. This augment then extends to the 1st example. — Richard Critten, Jan 02 '23 at 15:01
Related/duplicate? https://stackoverflow.com/q/49278125/10871073 — Adrian Mole, Jan 02 '23 at 15:06
@AdrianMole: there's a godbolt link -- for the sake of argument, it can be any number >0. The code is a reduced testcase; the calculation of `r` is done in a loop, indexed with `i` that goes from 0 to `x` (inclusive). The last iteration is broken (`r` gets negative). — peppe, Jan 02 '23 at 15:15
It's certainly related to that question, but it's not conclusive, I think. Why doesn't a compiler apply the contraction with a trivial rewrite of the expression? Is it because it's not smart enough or because it's not allowed to do so? — peppe, Jan 02 '23 at 15:16
Notice that I didn't actually cast the duplicate close vote. But, as stated in the answers there, the Standard (C++, in that case, but doubtless the same for C) just says that FMA and 'FP contraction' is allowed; but it doesn't really say much more than that. How specific compilers deal with code optimisation(s) isn't really part of the Standard. Just sayin'. :) — Adrian Mole, Jan 02 '23 at 15:22
Expecting IEEE floating point expressions to behave exactly like integers (I.e. being predictable) is a non-starter. In your example, s and tmp are **approximations** of a rational value (in this case), so the result r can only be an approximation of zero. However, the precision of this approximation can be computed quite exactly and should be taken into account whenever the result is used. — Michaël Roy, Jan 02 '23 at 15:25
... ultimately, I think the issue has little or nothing to do with FMA or other optimisations, but stems from any sort of FP comparison that uses a `>=` (or similar) operator. — Adrian Mole, Jan 02 '23 at 15:25
People get PhDs studying how to get high precision results from floating-point arithmetic. It's hard. And it's made harder by the fact that compilers by default don't use IEEE arithmetic; they optimize the code to run faster. My understanding (not having a PhD) is that in most cases that's what knowledgeable users want: fast, even if it's slightly less accurate than the IEEE-conforming results would be. And even with full IEEE compliance, the results are often not what we expect, because floating-point math doesn't deal with real numbers; it deals with floating-point values. — Pete Becker, Jan 02 '23 at 16:24
@RichardCritten: well, without FMA, the result is always >= 0 for any integer input `x` (>0). One can verify this experimentally: https://gcc.godbolt.org/z/a7Wcq5qoq The code has been like that for decades; only recently compilers have turned on FMA by default, causing this kind of interesting regressions all over the place. My position is that splitting shouldn't help at all and it's playing with fire, although seems to "work in practice". — peppe, Jan 02 '23 at 16:27
@PeteBecker: I'm explicitly talking about IEEE compliance. There's *no* -ffast-math or similar at play here. — peppe, Jan 02 '23 at 16:28
@peppe -- you need to say that in your question, including whatever compiler switches you've used to try to get IEEE compliance. — Pete Becker, Jan 02 '23 at 16:29
Well, there's a link to a Godbolt with the actual code running and showing the problem. — peppe, Jan 02 '23 at 16:45
@MichaëlRoy: That is not a correct model of floating-point arithmetic. The IEEE-754 standard specifies that each floating-point datum that is not a NaN represents one number exactly. It is floating-point operations that approximate real-number arithmetic, not floating-point numbers that approximate real numbers. If you are comparing it to integer arithmetic, then we do not say that integer arithmetic is unpredictable because 7/3 gets a result different from real-number arithmetic. Floating-point arithmetic is well specified and is predictable. — Eric Postpischil, Jan 02 '23 at 17:58
@AdrianMole: Take `x` = 5. Then `s` is 0.200000000000000011102230246251565404236316680908203125, and an FMA for `1 - x*s` produces -5.5511151231257827021181583404541015625e-17. — Eric Postpischil, Jan 02 '23 at 18:01
Re “isn't numerically stable”: That is not what [numerically stable](https://en.wikipedia.org/wiki/Numerical_stability) means. Numerical stability is about whether calculations magnify errors or not, or the degree to which they do. You are asking about compilers not implementing floating-point arithmetic in clear well-specified ways. — Eric Postpischil, Jan 02 '23 at 18:34
@EricPostpischil Right, but there is also the fact that a simple number like 1/3 CANNOT be represented without a one bit error, which is called Espilon, for a double with a value of 1, Epsilon equals 2.22045e-16. So, the result of 3.0 - (3.0 * (1.0 / 3.0)) will be close to zero and MAY have an error of +- 2.22045e-16. — Michaël Roy, Jan 02 '23 at 19:02
Does this answer your question? [Discrepancies in C++ float subtraction](https://stackoverflow.com/questions/66398767/discrepancies-in-c-float-subtraction) — Michaël Roy, Jan 02 '23 at 19:10
@MichaëlRoy: That same problem is worse in integer arithmetic: `3 - 3 * (1/3)` has an error of 3, hugely larger than ±2.204e-16. — Eric Postpischil, Jan 02 '23 at 19:23
@EricPostpischil To paraphrase your previous comment: Every integer number "represents one number exactly", so to follow your logic, the error that shouldn't happen with double should not happen with integer, and for the same reason? Not only does it happen, but it is numerically quantifiable, and quite precisely, to boot. — Michaël Roy, Jan 02 '23 at 19:37
@MichaëlRoy: What error? I did not say any rounding errors do not occur with `double` arithmetic. I said they occur in the arithmetic, not with the number. A floating-point number or an `int` number is is a number. A floating-point arithmetic operation or an integer arithmetic operation is not a real-number arithmetic operation. — Eric Postpischil, Jan 02 '23 at 20:05
@EricPostpischil And I never said that floating point numbers were not unique You reap what you saw.. — Michaël Roy, Jan 02 '23 at 21:39
@MichaëlRoy: I'm not sure that answer is relevant here. Here I'm specifically talking about fp contractions and if trivial expression rewrites would enable/disable them or not. The fact that the results are different is why one cares about this. (Of course if someone just happens to have an equivalent calculation that keeps `r>=0` _with or without_ FMA, I'll be very grateful :-), but that's not really my original question) — peppe, Jan 02 '23 at 21:53
@MichaëlRoy: I did not say you said floating point numbers were not unique. You wrote “s and tmp are **approximations** of a rational value”. As I wrote, that is not a correct model. A floating-point object that is not a NaN represents one number, exactly, not an approximation. The correct model to use, for designing software, for analysis, and for proofs, is that floating-point objects represent numbers exactly and floating-point operations approximate real-number arithmetic by returning results rounded to representable numbers. — Eric Postpischil, Jan 02 '23 at 22:09
@EricPostpischil That's what I read. I'll quote you: "It is floating-point operations that approximate real-number arithmetic, not floating-point numbers that approximate real numbers"... I maintain thet IEEE floating point variables are NOT precise, and approximate real numbers. The reasoning behind my assertion that IEEE double variables hold approximate values is: _approximate operations => approximate results => approximate stored values_. If you think this reasoning is wrong, please demonstrate. — Michaël Roy, Jan 03 '23 at 00:08
@MichaëlRoy: It is specified in the IEEE-754 standard. Its sections on encodings specify exactly what number each encoding represents, and there is no statement they represent anything else. That is the *specification* of what each encoding represents. Its sections on the operations and rounding specifies that the operations produce results equivalent to performing real-number arithmetic on the represented values and then rounding the real-number arithmetic result to the nearest representable value (according to a selected rounding rule). Those rules are how conforming implementations behave. — Eric Postpischil, Jan 03 '23 at 03:19
@MichaëlRoy: Further, if we apply the notion that because operations may produce results different from the real-number arithmetic to integer arithmetic, we see how absurd it is. In integer arithmetic, `15/8` yields 1, but we do not say that means 1 is an approximation for 1⅞. We just know that integer arithmetic yields different results from real-number arithmetic, and we use it accordingly. You can use floating-point arithmetic to calculate approximations of things you would like to calculate with real-number arithmetic, but interpretation of the results as approximations is… — Eric Postpischil, Jan 03 '23 at 03:22
… application-specific and is not part of the floating-point specification. People may be tempted to think of 0.333333333333333314829616256247390992939472198486328125 as approximating ⅓, but, when you get zoom into a graph of the nearby floating-point numbers, including 0.333333333333333314829616256247390992939472198486328125 and 0.33333333333333337034076748750521801412105560302734375 and look at where ⅓ falls between them, you see it is similar to 1⅞ falling between 1 and 2. People just are not paying attention to the actual numbers and computations; they gloss over floating-point semantics. — Eric Postpischil, Jan 03 '23 at 03:23
@EricPostpischil What are you talking about? An IEEE double can only be expressed in 17 significant digits maximum. That's actually part of the specification: Any IEEE 754 double that is converted to text in 17 significant digits, is convertible back to an identical IEEE 754 double format. Conversely, any IEEE approximation of a rational or an rirrationat number can only be expressed in a 17 sigificant digits when converted to its base10 text representation. The limitations ARE part of the specificaion, and ALL applications that deal with floating point variables. — Michaël Roy, Jan 03 '23 at 04:13
If engineering applications do not gloss over the issue, it is because they are only concerned in getting a result within a certain precision, which is most commonly computed in an integer amount of parts per million, or 6 significant digits or less. That's the reason why you'll often see rounding being used when comparing numbers and ranges in engineering software. That's not glossing over the issue, it's being practical. — Michaël Roy, Jan 03 '23 at 04:19
@MichaëlRoy: Re “An IEEE double can only be expressed in 17 significant digits maximum”: You can convey what the represented value is in 17 significant digits, but that decimal numeral is not the actual value. For the binary64 format, commonly used for `double`, IEEE-2017 3.4 specifies that the value represented by the bits 0x3fd5555555555555 is (using binary for the significand) 1.0101010101010101010101010101010101010101010101010101•2^−2, which equals 0.333333333333333314829616256247390992939472198486328125… — Eric Postpischil, Jan 03 '23 at 15:27
… If somebody shows you “0.33333333333333331”, you can figure out that the nearest representable value is 0.333333333333333314829616256247390992939472198486328125, but the value represented is 0.333333333333333314829616256247390992939472198486328125 (equivalently, 6,004,799,503,160,661•2^−54), not 0.33333333333333331, per the specification in IEEE-2017 3.4. — Eric Postpischil, Jan 03 '23 at 15:28
0.33333333333333331... is the value being stored, and 1/3 is the value it represents. Since these two values are NOT the same, this clearly shows that that the actual (aka **concrete**) value being stored is an approximation of what we wanted to store (the **logical**) value. No IEEE operation needed, since the logical value is a constant. The IEEE precission for 1/3 is +- Epsilon/3, and the error is a bit less than 2 EXP -17, which is more than acceptable for most tasks, yet should be taken into account when coding. Note that the precision is a finite number, which can be precisely — Michaël Roy, Jan 03 '23 at 17:08
expressed in IEEE format, while the error is a rational value, which cannot be expressed precisely with an IEEE number. — Michaël Roy, Jan 03 '23 at 17:10

score 6 · Answer 1 · answered Jan 02 '23 at 17:53

The C++ standard allows floating-point expressions to be evaluated with extra range and precision because C++ 2020 draft N4849 7.1 [expr.pre] 6 says:

The values of the floating-point operands and the results of floating-point expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.

However, note 51 tells us:

The cast and assignment operators must still perform their specific conversions as described in 7.6.1.3, 7.6.3, 7.6.1.8 and 7.6.19.

The meaning of this is that an assignment or a cast must convert the value to the nominal type. So if extra range or precision has been used, that result must be converted to an actual double value when an assignment to a double is performed. (I expect, for this purpose, assignment includes initialization in a definition.)

So 1.0 - x * s may used a fused multiply-add, but const double tmp = x * s; const double r = 1.0 - tmp; must compute a double result for x * s, then subtract that double result from 1.0.

Note that this does not preclude const double tmp = x * s; from using extra precision to compute x * s and then rounding again to get a double result. In rare instances, this can produce a double-rounding error, where the result differs slightly from what one would get by rounding the real-number-arithmetic result of x•s directly to a double. That is unlikely to occur in practice; a C++ implementation would have no reason to compute x * s with extra precision and then round it to double.

Also note that C and C++ implementations do not necessarily conform to the C or C++ standard.

Thank you. Just nitpicking, though: `double tmp = x * s;` isn't using a cast, and it's not an assignment -- it's "just" initialization. So in theory a compiler can store it in arbitrary precision, and when used in `1.0 - tmp`, still convert all of that in a FMA. This is what GCC ends up doing, for instance. https://gcc.godbolt.org/z/Y5Ecd7PxG — peppe, Jan 02 '23 at 18:38
@peppe Interesting. For C, we have (e.g., C99, §6.7.8p11): "The initializer for a scalar shall be a single expression, optionally enclosed in braces. The initial value of the object is that of the expression (after conversion); the same type constraints and conversions as for simple assignment apply [...]", which links initialization semantics to assignment semantics. Is there anything equivalent in the C++ standard? — Mark Dickinson, Jan 03 '23 at 18:52
Looks like the initialization / assignment distinction isn't relevant here: https://gcc.godbolt.org/z/YP8sMz78Y — Mark Dickinson, Jan 03 '23 at 19:10
In GCC, it looks like the `-fexcess-precision` is related to this (or `-ffloat-store`). " By default, `-fexcess-precision=fast` is in effect; this means that operations may be carried out in a wider precision than the types specified in the source if that would result in faster code, and **it is unpredictable when rounding to the types specified in the source code takes place**." https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html — peppe, Jan 04 '23 at 00:13
But indeed, that's non-Standard. I'm getting to the conclusions that the Standards (C and C++) say to do something, and compilers (in name of speed) just ignore that. — peppe, Jan 04 '23 at 00:22

Are floating point contractions only allowed within one expression?

1 Answers1