209

I understand gcc's --ffast-math flag can greatly increase speed for float ops, and goes outside of IEEE standards, but I can't seem to find information on what is really happening when it's on. Can anyone please explain some of the details and maybe give a clear example of how something would change if the flag was on or off?

I did try digging through S.O. for similar questions but couldn't find anything explaining the workings of ffast-math.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
Ponml
  • 2,917
  • 5
  • 20
  • 17

2 Answers2

360

-ffast-math does a lot more than just break strict IEEE compliance.

First of all, of course, it does break strict IEEE compliance, allowing e.g. the reordering of instructions to something which is mathematically the same (ideally) but not exactly the same in floating point.

Second, it disables setting errno after single-instruction math functions, which means avoiding a write to a thread-local variable (this can make a 100% difference for those functions on some architectures).

Third, it makes the assumption that all math is finite, which means that no checks for NaN (or zero) are made in place where they would have detrimental effects. It is simply assumed that this isn't going to happen.

Fourth, it enables reciprocal approximations for division and reciprocal square root.

Further, it disables signed zero (code assumes signed zero does not exist, even if the target supports it) and rounding math, which enables among other things constant folding at compile-time.

Last, it generates code that assumes that no hardware interrupts can happen due to signalling/trapping math (that is, if these cannot be disabled on the target architecture and consequently do happen, they will not be handled).

Damon
  • 67,688
  • 20
  • 135
  • 185
  • 24
    Damon, thanks! Can you add some references? Like [gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html](http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) "`-ffast-math` *Sets -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans and -fcx-limited-range. This option causes the preprocessor macro __FAST_MATH__ to be defined.*" and something from glibc, like (`math.h` near math_errhandling) "*By default all functions support both errno and exception handling. In gcc's fast math mode and if inline functions are defined this might not be true.*" – osgx Mar 03 '14 at 00:39
  • @osgx: Well, you basically provided the reference yourself :-) The GCC page you linked to contains the quote ("Sets -fno-math-errno, -funsafe-math...") you posted plus descriptions of all the sub-options that `-ffast-math` enables (including associative math, which is in turn enabled by unsafe math). – Damon Mar 03 '14 at 00:51
  • so if I want fast math should I use `fast-math` or it is to dangerous? can I take arbitrary applications and compile it with `fast-math` and use it? – Oleg Vazhnev Nov 14 '14 at 11:37
  • 8
    @javapowered: Whether it is "dangerous" depends on what guarantees you need. `-ffast-math` allows the compiler to cut some corners and break some promises (as explained), which in general is not dangerous as such and not a problem for most people. For most people, it's the same, only faster. However, if your code _assumes and relies on_ these promises, then your code may behave differently than you expect. Usually, this means that the program will _seem_ to work fine, mostly, but some outcomes may be "unexpected" (say, in a physics simulation, two objects might not collide properly). – Damon Nov 14 '14 at 12:06
  • An example, suppose your processor has a 'max' instruction for floats, which will be a lot faster than a library call. Normally a call to the library function `fmaxf` can only be replaced by this operation if it has the right behavior with respect to NaNs etc. `fast-math` will allow the compiler to use the instruction, as long as it works for finite values. It's possible to instead write your own inline version of a `max` function which just does `(a>b)?a:b`, and you may find that the compiler will use the built-in max instruction for that operation regardless of `fast-math`. – greggo Mar 21 '17 at 17:26
  • Does `-ffast-math` included in `-O3` in GCC? Because it seems in my project ([Image Convolution](https://github.com/RoyiAvital/Projects/tree/master/ImageConvolution)) that MSVC 2015 generate much faster code than GCC 7.1. – Royi Aug 05 '17 at 10:03
  • 6
    @Royi: The two should be independent of each other. `-O2` generally enables "every" legal optimization, except those that trade size for speed. `-O3` also enables optimizations that trade size for speed. It still maintains 100% correctness. `-ffast-math` attempts to make mathematical operations faster by allowing "slightly incorrect" behavior which is usually not harmful, but would be considered incorrect by the wording of the standard. If your code is indeed **much** different in speed on two compilers (not just 1-2%) then check that your code is strictly standards compliant and ... – Damon Aug 05 '17 at 10:11
  • So `-O3` doesn't include `-ffast-math` and I should include both, right? – Royi Aug 05 '17 at 10:13
  • 3
    ... produces zero warnings. Also, make sure you do not get in the way of aliasing rules and things like auto-vectorization. In principle, GCC should perform at least as good (usually better in my experience) as MSVC. When that isn't the case, you've probably made a subtle mistake which MSVC just ignores but which causes GCC to disable an optimization. You should give both options if you want them both, yes. – Damon Aug 05 '17 at 10:15
  • But in my project [Image Convolution](https://github.com/RoyiAvital/Projects/tree/master/ImageConvolution) I don't need any vectorization from the Compiler as I wrote it by hand and still GCC is much slower. It is really small code and I don't understand why. – Royi Aug 05 '17 at 10:18
  • 2
    @Royi: That code does not look like really small and simple to me, not something one could analyse in depth in a few mins (or even hours). Among other things, it involves a seemingly harmless `#pragma omp parallel for`, and within the loop body you are both reading from and writing to addresses pointed to by function arguments, and do a non-trivial amount of branching. As an uneducated guess, you might be thrashing caches from within your implementation-defined invokation of threads, and MSVC may incorrectly avoid intermediate stores which aliasing rules would mandate. Impossible to tell. – Damon Aug 05 '17 at 10:34
  • @Damon, Can I tell the compiler the reading part is from "Read Only" array so no problem there and writing is guaranteed not to be aliased (Namely each thread is targeting different locations)? – Royi Aug 05 '17 at 10:39
  • @Royi: This is going a bit off-topic, but I think the larger-scale problem is that (if I read the OMP code correctly, which I'm not good at, I prefer to do my threading by hand) you do a somewhat unlucky parallelization by having one parallel task for each xmm-sized part of each kernel line. The approach that I would choose would be to subdivide the input image in tiles, and process each tile individually. Much easier code to follow, and almost certainly much faster, too (since threads never get in each other's way). – Damon Aug 05 '17 at 10:57
  • 1
    detailed information regarding `-ffast-math` can be found in https://gcc.gnu.org/wiki/FloatingPointMath – phuclv Aug 29 '18 at 07:57
  • Does `-ffast-math` impact math done with external libraries such as GMP? Also if a program doesn't use any advanced math and only does things like increment a loop counter by 1, should `-ffast-math` be used or would it not make a difference? – northerner Jan 30 '19 at 13:17
  • 1
    @northerner: No to both. GMP or such could only be affected if you compiled GMP itself with `-ffast-math`, but GMP does arbitrary precision math so I would be surprised if it used anything but integer math. Things like loop counters, too, are (normally) integers, so `-ffast-math` won't change a thing. – Damon Jan 30 '19 at 14:09
  • Also `-ffast-math` breaks [std::signbit](https://en.cppreference.com/w/cpp/numeric/math/signbit) on GCC 11.1 ([demo](https://godbolt.org/z/b6rzzd8dM)) – Gabriel Devillers Jun 29 '21 at 13:06
  • As part of #1, `-ffast-math` disables denormals handling in hardware (which speeds up fp math). – rustyx Aug 26 '21 at 12:53
130

As you mentioned, it allows optimizations that do not preserve strict IEEE compliance.

An example is this:

x = x*x*x*x*x*x*x*x;

to

x *= x;
x *= x;
x *= x;

Because floating-point arithmetic is not associative, the ordering and factoring of the operations will affect results due to round-off. Therefore, this optimization is not done under strict FP behavior.

I haven't actually checked to see if GCC actually does this particular optimization. But the idea is the same.

Rakete1111
  • 47,013
  • 16
  • 123
  • 162
Mysticial
  • 464,885
  • 45
  • 335
  • 332
  • How this should speed up execution? – Andrey Sep 14 '11 at 17:54
  • 36
    @Andrey: For this example, you go from 7 multiplies down to 3. – Mysticial Sep 14 '11 at 17:55
  • The associativity makes sense, thank you. Do you have any idea what order of magnitude these differences will really effect? Can the rounding happen anywhere, or will just numbers 10^-5 and below be possibly skewed. Do you think simply testing some equations such as your example will give me reasonable results for most cases? Thanks for the answer. – Ponml Sep 14 '11 at 17:58
  • 4
    @Andrey: Mathematically, it will be correct. But the result may differ slightly in the last few bits due to the different rounding. – Mysticial Sep 14 '11 at 17:58
  • 2
    In most cases, this slight difference won't matter (relatively on the order of 10^-16 for `double`, but varies depending on the application). One thing to note is that ffast-math optimizations don't necessarily add "more" round-off. The only reason why it's not IEEE compliant is because the answer is different (albeit slightly) from what is written. – Mysticial Sep 14 '11 at 18:03
  • 1
    @user: The magnitude of the error depends on the input data. It should be small relative to the result. For example, if `x` is smaller than 10, the error in Mystical's example will be down around 10^-10. But if `x = 10e20`, the error is likely to be many millions. – Ben Voigt Sep 14 '11 at 18:05
  • It also appears to remove a lot of special case checking but I haven't examined what exactly. At least the usual NaN/inf behavior (division by zero etc) seems to be correct with -ffast-math enabled so I have no idea why I would want to have those checks there in the first place. – Tronic Jan 02 '12 at 12:07
  • @Tronic I'm not sure there would be special case checking in the first place. NaN/Inf divide-by-zero is all handled in hardware according to the IEEE standard. The compiler doesn't have to generate any handling code for it. – Mysticial Jan 02 '12 at 12:28
  • A little late, but this is a good reference regarding floating point arithmetic and IEEE rules. It may help understand what changes with `ffast-math`: http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html – Marc Claesen May 22 '13 at 10:20
  • The answer more accurately describes `-funsafe-math` not `-ffast-math`. – stefanct Feb 08 '18 at 19:19
  • 5
    @stefanct it's actually about `-fassociative-math` which is included in `-funsafe-math-optimizations` which in turn is enabled with `-ffast-math` [Why doesn't GCC optimize `a*a*a*a*a*a` to `(a*a*a)*(a*a*a)`?](https://stackoverflow.com/q/6430448/995714) – phuclv Aug 29 '18 at 07:55
  • @Mysticial `double d = NAN; std::printf("%d\n", std::isnan(d));` [prints](https://gcc.godbolt.org/z/YpG_j5) `0` with `-ffast-math` as an example special case. – Aykhan Hagverdili Jan 30 '20 at 08:47
  • 1
    Is there a way to identify which parts of the code will be optimized by `-ffast-math`, so that I can manually change it to a more optimized version so it runs faster than it would even without `-ffast-math`? – Aaron Franke Nov 29 '20 at 23:23
  • 1
    A simpler example would be `x=3*x/x` which can be optimized away with fastmath, but not without. – tommsch Apr 30 '21 at 08:41