Comparing Fortran & C++ assembler for int = floor(sqrt(...))

Question

I have implemented a function in Fortran and C++ each:

#include <math.h>

void dbl_sqrt_c(double *x, double *y){
   *y = sqrt(*x - 1.0);
   return;
}

pure subroutine my_dbl_sqrt(x,y) bind(c, name="dbl_sqrt_fort")
   USE, INTRINSIC :: ISO_C_BINDING
   implicit none
   real(kind=c_double), intent(in)  :: x
   real(kind=c_double), intent(out) :: y

   y = sqrt(x - 1d0)
end subroutine my_dbl_sqrt

I compared them in the compiler explorer:

Fortran: https://godbolt.org/z/froz4rx97
C++: https://godbolt.org/z/45aex99Yz

And the way I read the assembler, they do basically the same thing, but C++ checks whether the argument of the sqrt is negative, which Fortran doesn't. I compared their performance using googles benchmark, but they are pretty evenly matched:

--------------------------------------------------------
Benchmark              Time             CPU   Iterations
--------------------------------------------------------
bm_dbl_c/8          2.07 ns         2.07 ns    335965892
bm_dbl_fort/8       2.06 ns         2.06 ns    338643106

Here is the interesting part. If I turn this into integer based functions:

void int_sqrt_c(int *x, int *y){
   *y = floor(sqrt(*x - 1.0));
   return;
}

and

pure subroutine my_int_root(x,y) bind(c, name="int_sqrt_fort")
   USE, INTRINSIC :: ISO_C_BINDING
   implicit none
   integer(kind=c_int), intent(in)  :: x
   integer(kind=c_int), intent(out) :: y

   y = floor(sqrt(x - 1d0))
end subroutine my_int_root

Then this is where they start to diverge:

--------------------------------------------------------
Benchmark              Time             CPU   Iterations
--------------------------------------------------------
bm_int_c/8          3.05 ns         3.05 ns    229239198
bm_int_fort/8       2.13 ns         2.13 ns    328933185

The Fortran code seems not significantly slower by this change, but the C++ code slowed down by 50%. This seems quite large. These are the assemblies:

Fortran: https://godbolt.org/z/axqqrc5E1
C++: https://godbolt.org/z/h7K75oKbn

The Fortran assembly seems pretty straight forward. It just adds conversion between double and int and not much else, but C++ seems to do a lot more, which I don't full understand.

Why is the C++ assembler so much more complicated? How could I improve the C++ code to achieve matching performance?

You're hobbled by bad defaults and compatibility with obsolete machines: Bad defaults are gcc setting `errno` for floating-point computations (despite this not being required by the C langauge), and compatibility with x86 machines that don't have any better SSE instructions than SSE2. If you want decent code generation, add `-fno-math-errno -msse4` to the [compiler flags](https://godbolt.org/z/rhjsr6qsM) — EOF, Apr 11 '21 at 15:38
This work almost perfectly: bm_int_c/8 2.08 ns ; bm_int_fort/8 2.09 ns If you write an answer, I'll accept it. — Stein, Apr 11 '21 at 15:48

EOF · Accepted Answer · 2021-04-11T16:12:53.027

TL;DR: You're hobbled by bad defaults and compatibility with obsolete machines: Bad defaults are gcc setting errno for floating-point computations (despite this not being required by the C language), and compatibility with x86 machines that don't have any better SSE instructions than SSE2. If you want decent code generation, add -fno-math-errno -msse4 to the compiler flags.

Modern processor architectures that contain floating-point hardware typically offer square root calculation as a primitive operation (instruction), which sets an error flag in the floating-point environment if the operand of the square root instruction was out of range (negative). On the other hand, old instruction set architectures may not have had floating point instructions, or not have had hardware accelerated square root instructions, so the C language permits an implementation to set errno on an out of range argument instead, but errno being a thread-local memory location practically prevents any sane architecture from setting errno directly from the square root instruction. To get decent performance, gcc inlines the square root calculation by calling the hardware instruction (sqrtsd), but to set errno, it seperately checks the sign of the argument, and calls to the library function only in case the argument was negative, so the library function can set errno. Yes, this is crazy, but that in turn is par for the course. You can avoid this braindamage that nobody ever needs or wants by setting -fno-math-errno in the compiler flags.

Reasonably recent x86-64 processors have more instructions than were present in the original x86-64 as first developed by AMD (which included only SSE2 vector/floating-point instructions). Among the added instructions are float/integer conversion instructions that allow controlled rounding/truncation, so this doesn't have to be implemented in software. You can get gcc to use these new instructions by specifying a target that supports these instructions, for example by using the -msse4 compiler flag. Note that this will cause the generated program to fault if it is run on a target that doesn't support these instructions, so the generated program will be less portable (though it doesn't reduce portability of the source code obviously).

Great answer, thank you. May I ask if msse4 is correctly set when using -xHost (and appropriate architecture) — mcocdawc, Apr 11 '21 at 21:27
@mcocdawc: you mean with Intel compilers? Yes. With gcc/clang, `-march=native` (or `-march=znver2` or `-march=skylake` or whatever) also includes ISA extensions and tuning settings. — Peter Cordes, Apr 11 '21 at 21:35
@EOF: gfortran manages to optimize `(int)floor()` without needing SSE4 roundsd; it's using truncating conversion from double->int, with an extra `-1` for negative results. (Which are impossible for the result of sqrt, so it really could have just used `cvttsd2si` because trunc = floor for non-negative results, and it's ill-defined for NaN.) `gcc` could have done the same thing as `gfortran` with just `-fno-math-errno`, but didn't :/ The `roundsd` with `-msse4` could ideally be optimized away. — Peter Cordes, Apr 11 '21 at 21:39
Of course, writing the source in a smarter way, as just `(int)sqrt(...)` leads to the optimization we want: https://godbolt.org/z/xhjcdrnPM. C double->int conversion truncates, which is the same as floor for a non-negative sqrt result. So we can work around GCC's failure to optimize away floor for a non-negative value-range. (With equivalent results for NaN, Inf, and out-of-range doubles, I think.) — Peter Cordes, Apr 11 '21 at 21:44
@PeterCordes Yes, I consider it a bit sad that `gcc` doesn't seem to propagate known non-negativeness of operands. I found that it also doesn't omit the check for negative argument to `sqrt()` if the argument is `unsigned` either. However, answering a question "What do I need to change in my code to get optimal assembly?" with "Nothing. Just fix the compiler settings." appeals more to me than manually doing strength reduction that the compiler failed at. — EOF, Apr 11 '21 at 22:22
Agreed, everyone should always use `-fno-math-errno` for this and other reasons. And `-march=native` or at least SSE4.1 is good if you can use it. Still, worth pointing out that just `sqrt()` with implicit truncation makes even better code than you can get with compiler options, for both C and Fortran — Peter Cordes, Apr 11 '21 at 22:34

Peter Cordes · Answer 2 · 2021-04-12T19:30:04.147

As @EOF explained, always use -fno-math-errno, and use -march=native if you can, or at least -msse4 if you don't need the binary to run on old machines like first-gen Core 2, or AMD Phenom II. (Preferably also other good stuff like popcnt and SSE4.2, like
-march=nehalem -mtune=haswell)

And preferably -O3 to enable SIMD auto-vectorization when possible, although that's sometimes not possible for floating point, e.g. reductions like dot product or sum of an array, because FP math isn't quite associative.

(-ffast-math will let compilers pretend it is, or you can use #pragma omp simd reduction(+:my_var_name) to give it license to rearrange order of operations in just one loop. But that's not safe in general, e.g. if your code does stuff like Kahan summation to compensate for FP rounding error, and enables other optimizations like treating denormals as 0.)

You can get even better asm out of gfortran / gcc by omitting the floor(). Neither seems to realize that any result of sqrt will be non-negative or NaN, and thus floor(sqrt) = trunc(sqrt)

(Related: if you want to round a double to the nearest int, rather than floor or trunc, use lrint(x) or (int)nearbyint(x), which can inline to an instruction that uses the current FP rounding mode, which (unless you changed it) will be round-to-nearest (even as tiebreak). See this answer for how it compiles for x86-64 and AArch64.)

GCC's back-end doesn't optimize away floor (with either front-end language), -msse4 just makes it cheap enough for the throughput bottleneck of sqrtsd to hide its cost in your C benchmark. Specifically, we get

# gcc -O2 -fno-math-errno -msse4   with floor() still in the source.
        sqrtsd  xmm0, xmm0
        roundsd xmm0, xmm0, 9              # floor separately, result as a double
        cvttsd2si       eax, xmm0          # convert (with Truncation) to signed int

Even without SSE4, GFortran chooses to use a flooring-conversion to int (which only has to work for the range of int, not for doubles outside that range, which is what GCC's code was doing manually without -msse4):

# GFortran -O2 -msse4      # with floor()   # chooses not to use SSE4
        sqrtsd  xmm0, xmm0
        cvttsd2si       eax, xmm0          # eax = int trunc(sqrt result)
        cvtsi2sd        xmm1, eax          # convert back to double
        ucomisd xmm0, xmm1                 # compare, setting CF if truncation rounded *up*
        sbb     eax, 0                     # eax -= CF

Fun fact: this code avoids checking for "unordered" (PF=1) in the ucomisd FLAGS result. CF = 1 for that case, but apparently gfortran doesn't care that it will make the result 0x7FFFFFFF instead of 0x80000000.

gcc could have done this for C, since the behaviour is undefined if the double->int conversion result doesn't fit in an int. (Fun fact, negative double -> unsigned is also UB for that reason; the modular range-reduction rule is only for wide integral types->unsigned.) So this is a gcc missed optimization bug for C without SSE4.

When SSE4 is available, it's better to use roundsd before conversion (in the general case where floor can't just be optimized away). Round-trip conversion to integer and back is 12 cycle latency on Skylake, so an extra maybe 7 vs. just one convert, plus ucomisd + sbb latency. vs. 8 cycles for roundsd. And roundsd is fewer total uops (https://uops.info). So this is a missed optimization for gfortran -msse4 which continues to use its double->int->double compare / sbb trick for floor-conversion.

Optimizing the source: remove the `floor`

C (and Fortran) FP -> int conversion truncates (rounds towards 0.0) by default. For non-negative integers, this is equivalent to floor, so it's redundant. Since compilers don't realize that and/or don't take advantage of the fact that a sqrt result is non-negative (or NaN), we can remove it ourselves:

#include <math.h>
int int_sqrt_c(int x){
   return sqrt(x - 1.0);
   //return floor(sqrt(x - 1.0));
}

I also simplified your C to take/return an int, instead of via pointers. https://godbolt.org/z/4zWeaej9T

int_sqrt_c(int):
        pxor    xmm0, xmm0
        cvtsi2sd        xmm0, edi             # int -> double
        subsd   xmm0, QWORD PTR .LC0[rip]
        sqrtsd  xmm0, xmm0
        cvttsd2si       eax, xmm0             # truncating double -> int
        ret

Exact same difference for the Fortan code, https://godbolt.org/z/EhbTjhGen - removing floor removes the useless instructions, leaving just cvttsd2si.

roundsd costs 2 uops and has 8 cycle latency on Skylake (like a chain of two additions) https://uops.info/, so avoiding it takes out a decent fraction of the total latency from x -> retval, and of the front-end throughput cost.

(Your throughput benchmark will bottleneck on the back-end throughput of sqrtsd, e.g. one per 4.5 cycles on Skylake, and OoO exec hides the latency, so your current test setup probably won't be able to measure the difference.)

When exactly is it safe to enable `-fno-math-errno`, when you don't intend to `errno` and you tested the program? — northerner, Apr 13 '21 at 10:19
@northerner: No, it won't make your code break if you `sqrt(-1.0)`. It won't waste time setting errno but you'll still get a NaN, and set the sticky FP-exception bits which you can check with `fenv.h`, because the hardware `sqrtsd` instruction already does that. See [this answer](https://stackoverflow.com/questions/57673825/how-to-force-gcc-to-assume-that-a-floating-point-expression-is-non-negative/57674631#57674631) for more details about why everyone should use `-fno-math-errno`. — Peter Cordes, Apr 13 '21 at 19:55
@northerner: The only time it would cause a problem is old legacy code that wants to actually read `errno` after using some math library functions like `sqrt` or `log`, instead of using `fenv`, the normal way to see if any FP exceptions happened. **Some C implementations don't bother to implement math-function errno-setting, so it's not something that portable code can rely on anyway.** (math-errno is basically a bad early idea that C dropped early on when fenv came along). You can still use errno normally to get error status from *other* functions like `open`. — Peter Cordes, Apr 13 '21 at 19:57

Comparing Fortran & C++ assembler for int = floor(sqrt(...))

2 Answers2

Optimizing the source: remove the floor

Optimizing the source: remove the `floor`