g++: optimization -march=haswell and newer changes numerical result

Question

I have been working on optimizing performance and of course doing regression tests when I noticed that g++ seems to alter results depending on chosen optimization. So far I thought that -O2 -march=[whatever] should yield the exact same results for numerical computations regardless of what architecture is chosen. However this seems not to be the case for g++. While using old architectures up to ivybridge yields the same results as clang does for any architecture, I get different results for gcc for haswell and newer. Is this a bug in gcc or did I misunderstand something about optimizations? I am really startled because clang does not seem to show this behavior.

Note that I am well aware that the differences are within machine precision, but they still disturb my simple regression checks.

Here is some example code:

#include <iostream>
#include <armadillo>

int main(){
    arma::arma_rng::set_seed(3);
    arma::sp_cx_mat A = arma::sprandn<arma::sp_cx_mat>(20,20, 0.1);
    arma::sp_cx_mat B = A + A.t();
    arma::cx_vec eig;
    arma::eigs_gen(eig, B, 1, "lm", 0.001);
    std::cout << "eigenvalue: " << eig << std::endl;
}

Compiled using:

g++ -march=[architecture] -std=c++14 -O2 -o test example.cpp -larmadillo

gcc version: 6.2.1

clang version: 3.8.0

Compiled for 64 bit, executed on an Intel Skylake processor.

I remember having such issues with an Intel compiler, which also gave different results on Haswell. On the project I worked on, someone from Intel gave us a whole bunch of compiler flags, basically to not employ the full optimization the compiler offers. I don't know the flags anymore, but I remember it was at -O3. Note that Intel compilers (often) employ the same flags as the GCC compilers. — Ramon van der Werf, Jul 27 '17 at 14:18
I have heard about differences with the Intel compiler before, and refrained from using -O3 for good reason. What surprises me here is that gcc gives a different result than gcc... And especially that gcc -march=core2,sandybridge,ivybridge gives the same as clang -march=core2,sandybridge,ivybridge,haswell,broadwell. — laolux, Jul 27 '17 at 15:29
I think this happens because of the fused-multiply-add instructions. Can you reproduce the differences if you use `-mno-fused-add`? — geza, Jul 27 '17 at 15:47
@geza: You are right. If I use `-mno-fused-madd` I get rid of the differences (but get a deprecation warning ;-) ). So does this mean that clang does not use fused-multiply-add, even when I am using `-O3` and `-march=haswell`? In any case I do not observe any real speed improvements by fused-multiply-add in my project. — laolux, Jul 27 '17 at 16:01

geza · Accepted Answer · 2017-07-27T17:48:19.000

8

It is because GCC uses fused-multiply-add (fma) instruction by default, if it is available. Clang, on the contrary, doesn't use them by default, even if it is available.

Result from a*b+c can differ whether fma used or not, that's why you get different results, when you use -march=haswell (Haswell is the first Intel CPU which supports fma).

You can decide whether you want to use this feature with -ffp-contract=XXX.

-ffp-contract=off, you won't get fma instructions.
-ffp-contract=on, you get fma instructions, but only in the case of contraction if allowed by the language standard. In current version of GCC, this means off (because it is not implemented yet).
-ffp-contract=fast (that's the GCC default), you'll get fma instrucions.

edited Jul 27 '17 at 17:48

answered Jul 27 '17 at 17:20

geza

28,403
6
61
135

Generally -O3 and above can reorder operations ("a+b+c+d" is by the language strictly "(((a+b)+c)+d)" but -O3 allows the compiler to rewrite this as "(a+b)+(c+d)" which executes quicker, but can produce a numerically different result (fixed precision FP). FMA is slightly different in that a*b+c WITHOUT FMA first calculates a*b and then truncates that to 64 bits before adding c, but FMA does NOT truncate the product before adding c. As such its use is not forbidden by the -O2 rule about not re-ordering operations, but it can produce different numerical results. – Tim Jul 27 '17 at 17:35
@Tim: is it really the case for -O3? Where can I read about this? At first sight, I would think that reordering additions is forbidden, no matter of optimization level. – geza Jul 27 '17 at 17:47
1

Ah in gcc these days it's -ffast-math which is enabled by -Ofast not -O3 (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html), but generally across years of different compilers, before the modern rule of gcc & clang etc, the folk rule was that -O2 would be the limit of optimisations that were safe and wouldn't change results. The similar behaviour in MSVC is /fp:fast My comment was just to say that FMA is NOT quite the same as -ffast-math as it doesn't involve re-ordering operations but it's actually primarily driven by the need to maintain precision that is otherwise discarded. – Tim Jul 27 '17 at 18:02
Okay then, `-Ofast` is quite different than `-O3`, exactly because it includes `-ffast-math`. I think that optimization levels -O[1-X] never allowed to reorder floating point operations. Yes, FMA is not `-ffast-math`, that's true :) But applying it does change programs behavior. Actually, for me, it is strange that GCC's default is `-ffp-contract=fast`. It could break a program which was fine before. – geza Jul 27 '17 at 18:12
the compiler will still respect precedence (ie mult before add) order, but it can pretend FP maths is associative (actually the flag might be -fassociative-math) which, of course, it isn't when working in fixed precision. Note that the re-ordered version is not "less accurate" mathematically, depending on the values involved it *may* be more accurate, but while the re-ordering does break the strict rules of C++, many people find this a useful optimisation (you can simply add brackets to an expression yourself to get the same result and performance boost). – Tim Jul 27 '17 at 18:19
@Tim, yes, these are true. The only thing I'm saying is that `-O[1-X]` doesn't allow **any** reordering. `-fassociative-math` is not included in `-O[1-X]`. Only the newer (previously non-existent) `-Ofast` includes it. – geza Jul 27 '17 at 18:40
under modern gcc maybe not, I over generalised (or failed to be sufficiently specific), and for that I apologise... Historically under some compilers such optimisations could be invoked by different combinations of flags (but we live in more civilised times these days). My original intent was only to reinforce your deduction that it was use of FMA not such re-ordering (which is often blamed in such cases) that was triggering the effect – Tim Jul 27 '17 at 20:20
Thanks for all the explanations and discussions. So now I know what to use when regression checking and what to use for real computations. Now if I understand your comments correctly, then I can use `-fassociative-math` and still get good results, even though they are not standard compliant. Now is it also "safe" to use `-freciprocal-math`? I know some option of `-Ofast` is unsafe, because in my real program it leads to `nan`s, but I have not yet figured out which option causes that. – laolux Jul 27 '17 at 20:42
@Hannebambel: I can say my favorite sentence here, "It depends on the exact conditions". None of `-fassociative-math` and `-freciprocal-math` are safe in all computations. There're a lot of computations, where these don't matter at all. But, there are math routines, where it matters a lot. For example, SVD (singular value decomposition) routines are usually very sensitive. These routines carefully designed, and any of these options can ruin them. Not in general, but for specific inputs, their preciseness will be ruined, and they will return totally bad values. – geza Jul 27 '17 at 21:03
@Hannebambel: so I recommend you to check out, what effect these switches have. Do tests with/without them, and you'll see, that for your usage, which option can be used. (an example of mine: I run a simulation, which used SVD. using `-ffast-math`, simulation was mostly OK. But there was states, where the simulation suddenly become invalid. Turning off `-ffast-math` solved the problem. But yes, **maybe** turning on just `-ffassociative-math` doesn't cause a problem. But it was long time ago, when only `-ffast-math` existed, there was no separate options) – geza Jul 27 '17 at 21:10
@geza: Yeah, I will try on sample systems and hope the results hold true when I scale up. Luckily I don't use too many sophisticated algorithms. Most of my time I spend doing sparse matrix-vector multiplications :-) Just interesting to hear that rounding differences (which I guess should be the main side effect of `-fassociative-math`) make math routines explode. – laolux Jul 27 '17 at 21:11
@Hannebambel: for that case, it is likely that it is safe to turn on `-fassociative-math`. If it is not the case, please share your results, I'm curious :) – geza Jul 27 '17 at 21:17
Ok, the option crashing my program (returning nans) is `-fcx-limited-range`. The rest of `-Ofast` works fine with gcc. Clang works with `-Ofast`, but only when disabling `-fopenmp`. Otherwise it won't even compile... In total clang generates about 30% faster code using `-O3` than gcc with `-Ofast -fno-cx-limited-range`. Results of course only applicable to my specific program. – laolux Jul 28 '17 at 11:37

g++: optimization -march=haswell and newer changes numerical result

1 Answers1