Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with (long)float type punning)

Question

I'm trying to benchmark the fast inverse square root. The full code is here:

#include <benchmark/benchmark.h>
#include <math.h>

float number = 30942;
    
static void BM_FastInverseSqrRoot(benchmark::State &state) {
    for (auto _ : state) {
        // from wikipedia:
        long i;
        float x2, y;
        const float threehalfs = 1.5F;

        x2 = number * 0.5F;
        y  = number;
        i  = * ( long * ) &y;
        i  = 0x5f3759df - ( i >> 1 );
        y  = * ( float * ) &i;
        y  = y * ( threehalfs - ( x2 * y * y ) );
        //  y  = y * ( threehalfs - ( x2 * y * y ) );
        
        float result = y;
        benchmark::DoNotOptimize(result);
    }
}


static void BM_InverseSqrRoot(benchmark::State &state) {
    for (auto _ : state) {
        float result = 1 / sqrt(number);
        benchmark::DoNotOptimize(result);
    } 
}

BENCHMARK(BM_FastInverseSqrRoot);
BENCHMARK(BM_InverseSqrRoot);

and here is the code in quick-bench if you want to run it yourself.

Compiling with GCC 11.2 and -O3, the BM_FastInverseSqrRoot is around 31 times slower than Noop (around 10 ns when I ran it locally on my machine). Compiling with Clang 13.0 and -O3, it is around 3.6 times slower than Noop (around 1 ns when I ran it locally on my machine). This is a 10x speed difference.

Here is the relevant Assembly (taken from quick-bench).

With GCC:

               push   %rbp
               mov    %rdi,%rbp
               push   %rbx
               sub    $0x18,%rsp
               cmpb   $0x0,0x1a(%rdi)
               je     408c98 <BM_FastInverseSqrRoot(benchmark::State&)+0x28>
               callq  40a770 <benchmark::State::StartKeepRunning()>
  408c84       add    $0x18,%rsp
               mov    %rbp,%rdi
               pop    %rbx
               pop    %rbp
               jmpq   40aa20 <benchmark::State::FinishKeepRunning()>
               nopw   0x0(%rax,%rax,1)
  408c98       mov    0x10(%rdi),%rbx
               callq  40a770 <benchmark::State::StartKeepRunning()>
               test   %rbx,%rbx
               je     408c84 <BM_FastInverseSqrRoot(benchmark::State&)+0x14>
               movss  0x1b386(%rip),%xmm4        # 424034 <_IO_stdin_used+0x34>
               movss  0x1b382(%rip),%xmm3        # 424038 <_IO_stdin_used+0x38>
               mov    $0x5f3759df,%edx
               nopl   0x0(%rax,%rax,1)
   408cc0      movss  0x237a8(%rip),%xmm0        # 42c470 <number>
               mov    %edx,%ecx
               movaps %xmm3,%xmm1
        2.91%  movss  %xmm0,0xc(%rsp)
               mulss  %xmm4,%xmm0
               mov    0xc(%rsp),%rax
        44.70% sar    %rax
        3.27%  sub    %eax,%ecx
        3.24%  movd   %ecx,%xmm2
        3.27%  mulss  %xmm2,%xmm0
        9.58%  mulss  %xmm2,%xmm0
        10.00% subss  %xmm0,%xmm1
        10.03% mulss  %xmm2,%xmm1
        9.64%  movss  %xmm1,0x8(%rsp)
        3.33%  sub    $0x1,%rbx
               jne    408cc0 <BM_FastInverseSqrRoot(benchmark::State&)+0x50>
               add    $0x18,%rsp
               mov    %rbp,%rdi
               pop    %rbx
               pop    %rbp
  408d0a       jmpq   40aa20 <benchmark::State::FinishKeepRunning()>

With Clang:

           push   %rbp
           push   %r14
           push   %rbx
           sub    $0x10,%rsp
           mov    %rdi,%r14
           mov    0x1a(%rdi),%bpl
           mov    0x10(%rdi),%rbx
           call   213a80 <benchmark::State::StartKeepRunning()>
           test   %bpl,%bpl
           jne    212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
           test   %rbx,%rbx
           je     212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
           movss  -0xf12e(%rip),%xmm0        # 203cec <_IO_stdin_used+0x8>
           movss  -0xf13a(%rip),%xmm1        # 203ce8 <_IO_stdin_used+0x4>
           cs nopw 0x0(%rax,%rax,1)
           nopl   0x0(%rax)
 212e30 2.46%  movd   0x3c308(%rip),%xmm2        # 24f140 <number>
        4.83%  movd   %xmm2,%eax
        8.07%  mulss  %xmm0,%xmm2
        12.35% shr    %eax
        2.60%  mov    $0x5f3759df,%ecx
        5.15%  sub    %eax,%ecx
        8.02%  movd   %ecx,%xmm3
        11.53% mulss  %xmm3,%xmm2
        3.16%  mulss  %xmm3,%xmm2
        5.71%  addss  %xmm1,%xmm2
        8.19%  mulss  %xmm3,%xmm2
        16.44% movss  %xmm2,0xc(%rsp)
        11.50% add    $0xffffffffffffffff,%rbx
               jne    212e30 <BM_FastInverseSqrRoot(benchmark::State&)+0x40>
 212e69        mov    %r14,%rdi
               call   213af0 <benchmark::State::FinishKeepRunning()>
               add    $0x10,%rsp
               pop    %rbx
               pop    %r14
               pop    %rbp
  212e79       ret

They look pretty similar to me. Both seem to be using SIMD registers/instructions like mulss. The GCC version has a sar that is supposedly taking 46%? (But I think it's just mislabelled and it's the mulss, mov, sar that together take 46%). Anyway, I'm not familiar enough with Assembly to really tell what is causing such a huge performance difference.

Anyone know?

It looks like gcc is transferring `xmm0` to `eax` through memory, whereas clang does it via `movd`. It's hard to be sure because you didn't label the branch targets. — Raymond Chen, May 21 '22 at 18:31
@RaymondChen The full Assembly including branch targets can be seen in the Assembly tab in quick-bench. Sorry I wanted to copy the instruction addresses into my post too but they don't seem to be copyable from quick-bench. — bun9, May 21 '22 at 18:35
Questions should be self-contained, so copy anything necessary into the question. (This includes the original source code that produced the assembly.) There is a graveyard of questions that have become useless because they put important information in links to sites that no longer work. — Raymond Chen, May 21 '22 at 18:37
This invokes UB, so it is probably not reliable. [Replacing the aliasing with `memcpy` gives similar results in both compilers](https://quick-bench.com/q/vvGkoKA29KtA6Hgb2brocGx1LBI) — Lala5th, May 21 '22 at 18:38
Now that is worse. Don't put pictures in there. Just copy the relevant addresses that are jump targets in yourself. — Goswin von Brederlow, May 21 '22 at 18:41
You can't cast a float (32bit) to a long (64bit unless you are windows where it's 32bit) on x86_64. And those casts violate aliasing rules. So I cleaned up the code a bit: https://godbolt.org/z/6oM9MTsfM Looks to me the difference is that clang optimizes `x2 = number * 0.5F;` to `x2 = number * (-0.5F);` and that saves 2 opcodes and the use of xmm2. — Goswin von Brederlow, May 21 '22 at 19:03
@Lala5th The difference between your version and mine is that yours has an extra `mov 0x39b82(%rip),%ea`, mine has an extra `movss %xmm0,0xc(%rsp)` and `mov 0xc(%rsp),%ra` and yours has `shr %eax` while mine is `sar %rax`. I'm surprised these can have such a big performance impact — bun9, May 21 '22 at 19:15
@bun9 That is beside the point I am trying to make. You can't read the data of a `float` as a `long`, [even if they were the same size](https://eel.is/c++draft/basic.lval#11). As mentioned by Goswin however the fact that `sizeof(float)!=sizeof(long)` doesnt't help either. See as the "fixed" (using `uint32_t`) code produces [similar output for both compilers](https://quick-bench.com/q/zECRXPU7H9tzAecMCQOhiPD0FBc). — Lala5th, May 21 '22 at 19:20
@Lala5th It was my understanding that UB basically means that the compiler can generate whatever assembly it wants and it may or may not be what you expected. So when comparing assembly, why does it matter that the original code had UB (I see now that it is)? It is already past that point. Hope what I said makes sense, I'm not sure if I phrased it well. — bun9, May 21 '22 at 19:25
@GoswinvonBrederlow Thanks. This is off-topic but is the line `fi.f = fi.f * ( threehalfs - ( x2 * fi.f * fi.f ) );` UB? I thought in C++, reading from a union member that was not last written to is UB. — bun9, May 21 '22 at 19:28
It used to be you couldn't read `fi.f` after writing `fi.i` without UB. But the standard now requires that in a union the `float` and `int` must occupy the same memory. So the bit pattern transfers between the 2. The meaning of the bit pattern when taken as `int` or `float` is implementation defined. The algorithm will only work on some architectures but it's no longer UB, just IDB. — Goswin von Brederlow, May 21 '22 at 19:41
Just FYI, [Is it still worth using the Quake fast inverse square root algorithm nowadays on x86-64?](https://stackoverflow.com/q/71608126) - no, obsoleted by `rsqrtss` which you can use with or without a Newton iteration. — Peter Cordes, May 21 '22 at 21:10
@GoswinvonBrederlow: union type-punning is well-defined behaviour in ISO C99, but not ISO C++ last I heard. Did that change recently? It is well-defined in GNU C++ and MSVC, as extensions to the ISO C++ standard. `memcpy` is well-defined everywhere, and usually more readable. C++20 defines `std::bit_cast(int32_t)` which is now the preferred way to type-pun in C++. But yeah, pointing a 64-bit `long` at a `float` is likely to be a showstopper for performance as well as correctness. — Peter Cordes, May 21 '22 at 21:16
@PeterCordes I thought it changed in c++ too. But maybe I remember that wrong. I just used it because I know then gcc/clang don't complain about bit casting. `std::bit_cast` would be the way to go with c++20. Isn't really relevant to the question, the difference in the asm output was. — Goswin von Brederlow, May 21 '22 at 21:32
Any update on how gcc/clang perform when you bit-cast the float to 32bit int? Is the remaining difference in the asm still showing a difference? — Goswin von Brederlow, May 21 '22 at 21:37
@GoswinvonBrederlow: Fixing the code on the quick-bench link in the question to use `int32_t` and `bit_cast` (https://quick-bench.com/q/jYLeX2krrTs0afjQKFp6Nm_G2v8), yeah now BM_FastInverseSqrRoot is faster by a factor of 2 (than the sqrtss / divss throughput benchmark), with GCC, as expected. — Peter Cordes, May 21 '22 at 23:51

Peter Cordes · Accepted Answer · 2022-05-22T00:15:46.817

Just FYI, Is it still worth using the Quake fast inverse square root algorithm nowadays on x86-64? - no, obsoleted by SSE1 rsqrtss which you can use with or without a Newton iteration.

As people pointed out in comments, you're using 64-bit long (since this is x86-64 on a non-Windows system), pointing it at a 32-bit float. So as well as a strict-aliasing violation (use memcpy or std::bit_cast<int32_t>(myfloat) for type punning), that's a showstopper for performance as well as correctness.

Your perf report output confirms it; GCC is doing a 32-bit movss %xmm0,0xc(%rsp) store to the stack, then a 64-bit reload mov 0xc(%rsp),%rax, which will cause a store forwarding stall costing much extra latency. And a throughput penalty, since actually you're testing throughput, not latency: the next computation of an inverse sqrt only has a constant input, not the result of the previous iteration. (benchmark::DoNotOptimize contains a "memory" clobber which stops GCC/clang from hoisting most of the computation out of the loop; they have to assume number may have changed since it's not const.)

The instruction waiting for the load result (the sar) is getting the blame for those cycles, as usual. (When an interrupt fires to collect a sample upon the cycles event counter wrapping around, the CPU has to figure out one instruction to blame for that event. Usually this ends up being the one waiting for an earlier slow instruction, or maybe just one after a slow instruction even without a data dependency, I forget.)

Clang chooses to assume that the upper 32 bits are zero, thus movd %xmm0, %eax to just copy the register with an ALU uop, and the shr instead of sar because it knows it's shifting in a zero from the high half of the 64-bit long it's pretending to work with. (A function call still used %rdi so that isn't Windows clang.)

Bugfixed version: GCC and clang make similar asm

Fixing the code on the quick-bench link in the question to use int32_t and std::bit_cast, https://godbolt.org/z/qbxqsaW4e shows GCC and clang compile similarly with -Ofast, although not identical. e.g. GCC loads number twice, once into an integer register, once into XMM0. Clang loads once and uses movd eax, xmm2 to get it.

On QB (https://quick-bench.com/q/jYLeX2krrTs0afjQKFp6Nm_G2v8), now GCC's BM_FastInverseSqrRoot is faster by a factor of 2 than the naive version, without -ffast-math

And yes, the naive benchmark compiles to sqrtss / divss without -ffast-math, thanks to C++ inferring sqrtf from sqrt(float). It does check for the number being >=0 every time, since quick-bench doesn't allow compiling with -fno-math-errno to omit that check to maybe call the libm function. But that branch predicts perfectly so the loop should still easily just bottleneck on port 0 throughput (div/sqrt unit).

Quick-bench does allow -Ofast, which is equivalent to -O3 -ffast-math, which uses rsqrtss and a Newton iteration. (Would be even faster with FMA available, but quick-bench doesn't allow -march=native or anything. I guess one could use __attribute__((target("avx,fma"))).

Quick-bench is now giving Error or timeout whether I use that or not, with Permission error mapping pages. and suggesting a smaller -m/--mmap_pages so I can't test on that system.

rsqrt with a Newton iteration (like compilers use at -Ofast for this) is probably faster or similar to Quake's fast invsqrt, but with about 23 bits of precision.

Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with *(long*)float type punning)

1 Answers1

Bugfixed version: GCC and clang make similar asm

Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with (long)float type punning)