6

I like to run my code with floating point exceptions enabled. I do this under Linux using:

feenableexcept( FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW );

So far so good.

The issue I am having, is that sometimes the compiler (I use clang8) decides to use SIMD instructions to do a scalar division. Fine, if that is faster, even for a single scalar, why not.

But the result is that an unused lane in the SIMD register can contain a zero.

And when the SIMD division is executed, a floating point exception is thrown.

Does that mean that floating point exceptions cannot be used at all if you allow the compiler to use sse/avx extensions?

In my case, this line of C code:

float a0, min, a, d;
...
a0 = (min - a) / (d);

...is exectuted as:

divps  %xmm2,%xmm3

Which then throws a:

Thread 1 "noisetuner" received signal SIGFPE, Arithmetic exception.
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Bram
  • 7,440
  • 3
  • 52
  • 94
  • Does clang have an equivalent for GCC's `-ftrapping-math` to make FP exceptions a visible side-effect? (Note that GCC's version of that option is on by default, but is actually broken: it fails to stop GCC from doing some optimizations that change the number or type of of FP exceptions, possibly including from 0 to non-zero IIRC.) – Peter Cordes Jul 28 '20 at 02:03
  • clang doesn't complain when I feed it `-ftrapping-math` but it doesn't fix it. To stop the FPE, I have to supply `-mno-mmx -mno-sse` arguments. – Bram Jul 28 '20 at 02:41
  • File a bugreport. – EOF Jul 28 '20 at 07:16
  • 2
    Are you sure it generates a `divps` and not a `divss`? Can you provide a [mre]? – chtz Jul 28 '20 at 07:26
  • 1
    @chtz Not OP but it's very easy to repro, see there: https://godbolt.org/z/Wd98eG – Soonts Jul 28 '20 at 20:23

1 Answers1

4

I think you have found a bug in clang or maybe in llvm.

Here’s how I have reproduced, clang 10.0 emits the same code i.e. has that bug as well. Clearly, that vdivps instruction only has valid data in the initial 2 lanes of the vectors, and in the higher 2 lanes it will run 0.0 / 0.0, thus you’ll get a runtime exception if you enable these interrupts in mxcsr register like you’re doing.

Microsoft, Intel and gcc don’t emit divps for that code. If you can, switch to gcc and it should be good.

Update: Clang 10+ has an option controlling such optimizations, -ffp-exception-behavior=maytrap, take a look: https://godbolt.org/z/WG7bEE

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • Note that gcc misses that optimization even with `-fno-trapping-math` and `#pragma STDC FENV_ACCESS OFF` https://godbolt.org/z/bGoe1n. So unfortunately even when you do want the optimization, you can't get it with GCC. (Even with `-ffast-math`, actually). Ironically, storing x and y to a `dst[0]` and `dst[1]` output array (making divps an even better optimization, no shuffle needed) defeats both clang and GCC's auto-vectorizer. – Peter Cordes Jul 28 '20 at 21:19
  • 1
    Looks like clang could easily avoid this by replacing both `movsd` by `movddup` (which at least on not too old architectures has the same port usage). – chtz Jul 28 '20 at 21:20
  • @chtz: oh good point, yes on Nehalem and later (and some AMD I think/hope), `movddup` is a pure load with the broadcast handled by the load port, no vector ALU. That would require special exception-safe vectorization support to look for, which may not succeed enough of the time to be worth looking for (given the cost in compile time). – Peter Cordes Jul 28 '20 at 21:23
  • The -ffp-exception-behavior=maytrap flag makes this issue go away on clang-10. Thanks. – Bram Oct 17 '20 at 02:41