4

I am evaluating the usage (clearing and querying) of Floating-Point Exceptions in performance-critical/"hot" code. Looking at the binary produced I noticed that neither GCC nor Clang expand the call to an inline sequence of instructions that I would expect; instead they seem to generate a call to the runtime library. This is prohibitively expensive for my application.

Consider the following minimal example:

#include <fenv.h>
#pragma STDC FENV_ACCESS on

inline int fetestexcept_inline(int e)
{
  unsigned int mxcsr;
  asm volatile ("vstmxcsr" " %0" : "=m" (*&mxcsr));
  return mxcsr & e & FE_ALL_EXCEPT;
}

double f1(double a)
{
    double r = a * a;
    if(r == 0 || fetestexcept_inline(FE_OVERFLOW)) return -1;
    else return r;
}

double f2(double a)
{
    double r = a * a;
    if(r == 0 || fetestexcept(FE_OVERFLOW)) return -1;
    else return r;
}

And the output as produced by GCC: https://godbolt.org/z/jxjzYY

The compiler seems to know that he can use the CPU-family-dependent AVX-instructions for the target (it uses "vmulsd" for the multiplication). However, no matter which optimization flags I try, it will always produce the much more expensive function call to glibc rather than the assembly that (as far as I understand) should do what the corresponding glibc function does.

This is not intended as a complaint, I am OK with adding the inline assembly. I just wonder whether there might be a subtle difference that I am overlooking that could be a bug in the inline-assembly-version.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Ikaros
  • 395
  • 1
  • 10
  • 1
    I'd assume that it's partly a matter of keeping the compiler simple, and only treating a few functions that are heavily-used in lots of people's code as builtins (like `sqrt()`). `fetestexcept` is much more rarely used. Also, GCC is not particularly good at FP exception semantics; even though `-ftrapping-math` is enabled by default, it doesn't always stop GCC from doing transformations that eliminate or create FP exceptions vs. naive evaluation in the C++ abstract machine. – Peter Cordes Jan 25 '21 at 20:28
  • 2
    Hmm, on 2nd thought, fenv functions might need special handling anyway to not reorder them around FP math operations? So unless GCC always treats the FP environment as visible program state that functions can see, it would need something special for them anyway. – Peter Cordes Jan 25 '21 at 20:30
  • 1
    @PeterCordes Yes, as far as I can tell, they do. This is the motivation for the r == 0 || to incentivize the compiler to actually perform the computation before the exception test. In this example it doesn't really make as much sense as in my actual code. – Ikaros Jan 25 '21 at 20:35
  • 2
    Neither gcc not clang handles fenv_access, so you are out of luck anyway. – Marc Glisse Jan 25 '21 at 20:41
  • 2
    You could use `_mm_getcsr` instead of the inline asm. – Marc Glisse Jan 25 '21 at 20:45
  • 1
    @MarcGlisse I think Clang 12 will, at least judging from the current development builds ( see https://godbolt.org/z/4qfhfa with clang-trunk ). – Ikaros Jan 25 '21 at 20:46
  • Ah, I knew they were working on it, but I thought they had given up. Good to know they made progress, I'll have to play with it (the generated code doesn't seem very optimal, but the priority is correctness). – Marc Glisse Jan 25 '21 at 20:52

1 Answers1

4

It's required to support long double arithmetic. fetestexcept needs to merge the SSE and FPU states because long double operations only update the FPU state, but not the MXSCR register. Therefore, the benefit from inlining is somewhat reduced.

Florian Weimer
  • 32,022
  • 3
  • 48
  • 92
  • 1
    So it could still make sense for `-mlong-double-64` (or `-mlong-double-128`?) with `-mfpmath=sse` or if we detect that there are no long double operations between clearexcept and testexcept, but clearly that's not a priority, when gcc does not support fenv_access, and clang-12 generates the most unoptimized code it can. – Marc Glisse Jan 26 '21 at 12:44