Where is the source of imprecise calculation in the assembler code of gcc -Ofast compared with -O3?

Question

The following 3 lines give imprecise results with "gcc -Ofast -march=skylake":

int32_t  i = -5;
const double  sqr_N_min_1 = (double)i * i;
1. - ((double)i * i) / sqr_N_min_1

Obviously, sqr_N_min_1 gets 25., and in the 3rd line (-5 * -5) / 25 should become 1. so that the overall result from the 3rd line is exactly 0.. Indeed, this is true for compiler options "gcc -O3 -march=skylake".

But with "-Ofast" the last line yields -2.081668e-17 instead of 0. and with other i than -5 (e.g. 6 or 7) it gets other very small positive or negative random deviations from 0.. My question is: Where exactly is the source of this imprecision?

To investigate this, I wrote a small test program in C:

#include <stdint.h>      /* int32_t */
#include <stdio.h>
#define MAX_SIZE 10

double W[MAX_SIZE];

int main( int argc, char *argv[] )
{
  volatile int32_t n = 6; /* try 6 7 or argv[1][0]-'0' */
  double           *w = W;
  int32_t          i = 1 - n;
  const int32_t    end = n - 1;
  const double     sqr_N_min_1 = (double)i * i;

  /* Here is the crucial part. The loop avoids the compiler replacing it with constants: */
  do {
    *w++ = 1. - ((double)i * i) / sqr_N_min_1;
  } while ( (i+=2) <= end );

  /* Then, show the results (only the 1st and last output line matters): */
  w = W;
  i = 1 - n;
  do {
    fprintf( stderr, "%e\n", *w++ );
  } while ( (i+=2) <= end );

  return( 0 );
}

Godbolt shows me the assembly produced by an "x86-64 gcc9.3" with the option "-Ofast -march=skylake" vs. "-O3 -march=skylake". Please, inspect the five columns of the website (1. source code, 2. assembly with "-Ofast", 3. assembly with "-O3", 4. output of 1st assembly, 5. output of 2nd assembly):

Godbolt site with five columns

As you can see the differences in the assemblies are obvious, but I can't figure out where exactly the imprecision comes from. So, the question is, which assembler instruction(s) are responsible for this?

A follow-up question is: Is there a possibility to avoid this imprecision with "-Ofast -march=skylake" by reformulating the C-program?

What imprecision? Is it any worse than 'normal' FP imprecision? Details in question please, not link. — Martin James, May 10 '21 at 09:21
`-Ofast` isn't really stable and might deviate from standard C. It might cut some corners somewhere to drop accuracy for speed. — Lundin, May 10 '21 at 09:28
Did you try other compilers like [Clang](http://clang.llvm.org/) or [CompCert](http://compcert.inria.fr) ? You might also be interested by the [CADNA](https://www-pequan.lip6.fr/cadna/) software project, or by [FLUCTUAT](http://www.lix.polytechnique.fr/~putot/fluctuat.html) (contact me by email for more) — Basile Starynkevitch, May 10 '21 at 09:33
That's why I added the assembly: You see the FP engines and registers used. But still it's hard to find which line of the assembler commands is responsible. I didn't use other compilers so far... — Hartmut Pfitzinger, May 10 '21 at 09:33
From a quick glance, the `-Ofast` is using `vfnmadd132sd` to compute `1- i*i/sqr_N_min_1` as `1 - y*z` where `y` is indeed `i*i` but `z` is `1 / sqr_N_min_1` (computed before the loop). The other version is using plain `vmulsd/vmulsb/vsubsd`. Taking the reciprocal affect the precision, along with the fact that an FMA has more precision than the equivalent three sequence of instructions. — Margaret Bloom, May 10 '21 at 09:57
@Margaret indeed that sounds like a possible explanation. How to make sure? And the follow-up question? — Hartmut Pfitzinger, May 10 '21 at 10:00
You know `-Ofast` is a synonym for `-O3 -ffast-math`, right? Part of `-ffast-math` is `-funsafe-math-optimizations`. This is exactly the kind of speed-over-accuracy optimization you're asking for by using that option. If you don't want that, don't enable all of the `-ffast-math` sub-options. — Peter Cordes, May 10 '21 at 10:12
There's no way of formulating your source to avoid all possible problems. (Possibly by using `volatile` tmp vars all over the place to defeat optimization while also making your code slower than regular `-O3`.) — Peter Cordes, May 10 '21 at 10:14
@HartmutPfitzinger You can remove the filters from the Godbolt output, copy both code and assemble them (`as example.S -o example.o`), then compile them (`gcc -static example.o -g -o example`). Since the first difference appears already in the first time, you can debug the program with `gdb` up to the first iteration. [Which I did and took note of how the value was computed](https://pastebin.com/JLXhUQzW). So my hypothesis was right: you actually get **increased** precision with the FMA (that's what they are for). The reciprocal of 25 is 0.04 which is represented exactly in double. You are... — Margaret Bloom, May 10 '21 at 10:34
... thus seeing an increase of precision by the use of `vfnmadd132sd`. As for the follow-up question, I don't know. Massaging the C source to get the assembly you want is usually dangerous and seems an X-Y problem to me. Maybe there are other ways to achieve what you want. — Margaret Bloom, May 10 '21 at 10:36
@MargaretBloom: “0.04 which is represented exactly” → “0.04 which is not represented exactly”. — Eric Postpischil, May 10 '21 at 10:41
@EricPostpischil Honestly, I've only checked that `0x3fa47ae147ae147b` converts back to 0.04. Increasing or decreasing the mantissa won't convert back to 0.04. You may be right, I didn't check the conversion since I'm doomed by the fact that I'd use the very same hardware for the checking calculation — Margaret Bloom, May 10 '21 at 10:47
@HartmutPfitzinger: In your update, did you mean *this is true for compiler options `gcc -O3 -march=skylake`*? You actually wrote that `-Ofast -march=skylake` is producing exactly 0. — Peter Cordes, May 11 '21 at 11:07
I really have no idea how to further improve to get the question reopened — Hartmut Pfitzinger, May 12 '21 at 10:35

Peter Cordes · Answer 1 · 2021-05-10T12:29:54.490

Comments and another answer have pointed out the specific transformation that's happening in your case, with a reciprocal and an FMA instead of a division.

Is there a possibility to avoid this imprecision with "-Ofast -march=skylake" by reformulating the C-program?

Not in general.

-Ofast is (currently) a synonym for -O3 -ffast-math.
See https://gcc.gnu.org/wiki/FloatingPointMath

Part of -ffast-math is -funsafe-math-optimizations, which as the name implies, can change numerical results. (With the goal of allowing more optimizations, like treating FP math as associative to allow auto-vectorizing the sum of an array with SIMD, and/or unrolling with multiple accumulators, or even just rearranging a sequence of operations within one expression to combine two separate constants.)

This is exactly the kind of speed-over-accuracy optimization you're asking for by using that option. If you don't want that, don't enable all of the -ffast-math sub-options, only the safe ones like -fno-math-errno / -fno-trapping-math. (See How to force GCC to assume that a floating-point expression is non-negative?)

There's no way of formulating your source to avoid all possible problems.

Possibly you could use volatile tmp vars all over the place to defeat optimization between statements, but that would make your code slower than regular -O3 with the default -fno-fast-math. And even then, calls to library functions like sin or log may resolve to versions that assume the args are finite, not NaN or infinity, because of -ffinite-math-only.

GCC issue with -Ofast? points out another effect: isnan() is optimized into a compile-time 0.

Eric Postpischil · Answer 2 · 2021-05-10T12:24:48.997

From the comments, it seems that, for -O3, the compiler computes 1. - ((double)i * i) / sqr_N_min_1:

Convert i to double and square it.
Divide that by sqr_N_min_1.
Subtract that from 1.

and, for -Ofast, computes it:

Prior to the loop, calculate the reciprocal of sqr_N_min_1.
Convert i to double and square it.
Compute the fused multiply-subtract of 1 minus the square times the reciprocal.

The latter improves speed because it calculates the division only once, and multiplication is much faster than division in the target processors. On top of that, the fused operation is faster than a separate multiplication and subtraction.

The error occurs because the reciprocal operation introduces a rounding error that is not present in the original expression (1/25 is not exactly representable in a binary format, while 25/25 of course is). This is why the compiler does not make this optimization when it is attempting to provide strict floating-point semantics.

Additionally, simply multiplying the reciprocal by 25 would erase the error. (This is somewhat by “chance,” as rounding errors vary in complicated ways. 1./25*25 produces 1, but 1./49*49 does not.) But the fused operation produces a more accurate result (it produces the result as if the product were computed exactly, with rounding occurring only after the subtraction), so it preserves the error.

Where is the source of imprecise calculation in the assembler code of gcc -Ofast compared with -O3?

2 Answers2