Why is my SSE assembly slower in release builds?

Question

I've been playing around with some x64 assembly and the XMM registers to do some float math, and I'm seeing some performance that is puzzling me.

As a self-learning exercise, I wrote some SSE assembly to approximate the 'sin' function (using the Taylor series), and called this from some basic C++ in a loop to compare to the standard library version. Code is below, and I've pasted the output for some typical runs after that. (I'm not looking for a critique of the code or approach here, just trying to understand the perf numbers).

What I don't get is why with a "Release" build, where the actual running assembly is identical (I've stepped though the debugger to double check), is consistently about 40 - 50 cycles slower. (Uncommenting the LFENCE instructions adds about 100 cycles to both Debug and Release, so the delta remains the same). As a bonus question, why is the very first iteration typically in the thousands!!

I get this stuff is very complex and subtly impacted by numerous factors, but everything that pops in my head as a potential cause here just doesn't make sense.

I've checked the MSCSR flags in both runs, and this is identical across builds also (with the default value of 1f80h which has all exceptions masked).

Any idea what would cause this? What further analysis could I do to figure this out an an even deeper level?

Assembly

_RDATA segment
    pi  real4 3.141592654
    rf3 real4 0.1666666667
    rf5 real4 0.008333333333
    rf7 real4 0.0001984126984
_RDATA ends


_TEXT segment

; float CalcSin(float rads, int* cycles)
CalcSin PROC
    ; "leaf" function - doesn't use the stack or any non-volatile registers
    mov r8, rdx                ; Save the 'cycles' pointer into R8
    rdtsc                      ; Get current CPU cyles in EDX:EAX
;    lfence                     ; Ensure timer is taken before executing the below
    mov ecx, eax               ; Save the low 32 bits of the timer into ECX

    movss xmm2, xmm0
    mulss xmm2, xmm2           ; X^2
    movss xmm3, xmm0

    mulss xmm3, xmm2           ; x^3
    movss xmm4, rf3            ; 1/3!
    mulss xmm4, xmm3           ; x^3 / 3!
    subss xmm0, xmm4           ; x - x^3 / 3!

    mulss xmm3, xmm2           ; x^5
    movss xmm4, rf5            ; 1/5!
    mulss xmm4, xmm3           ; x^5 / 5!
    addss xmm0, xmm4           ; x - x^3 / 3! + x^5 / 5!

    mulss xmm3, xmm2           ; x^7
    movss xmm4, rf7            ; 1/7!
    mulss xmm4, xmm3           ; x^7 / 7!
    subss xmm0, xmm4           ; x - x^3 / 3! + x^5 / 5! - x^7 / 7!

;    lfence                     ; Ensure above completes before taking the timer again
    rdtsc                      ; Get the timer now
    sub eax, ecx               ; Get the difference in cycles
    mov dword ptr [r8], eax
    ret
CalcSin ENDP

_TEXT ends

END

C++

#include <stdio.h>
#include <math.h>
#include <vector>

const float PI = 3.141592654f;

extern "C" float CalcSin(float rads, int* cycles);

void DoCalcs(float rads) {
    int cycles;
    float result = CalcSin(rads, &cycles);
    printf("Sin(%.8f) = %.8f.  Took %d cycles\n", rads, result, cycles);
    printf("C library = %.8f\n", sin(rads));
}

int main(int argc, char* argv[]) {
    std::vector<float> inputs{PI / 1000, PI / 2 - PI / 1000, PI / 4, 0.0001f, PI / 2};
    for (auto val : inputs) {
        DoCalcs(val);
    }
    return 0;
}

With a "Debug" build (I'm using Visual Studio 2019), I typically see the below timing reported:

Sin(0.00314159) = 0.00314159.  Took 3816 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086.  Took 18 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647.  Took 18 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000.  Took 18 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306.  Took 18 cycles
C library = 1.00000000

The exact same code with a "Release" build, I typically see the below:

Sin(0.00314159) = 0.00314159.  Took 4426 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086.  Took 70 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647.  Took 62 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000.  Took 64 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306.  Took 62 cycles
C library = 1.00000000

====UPDATE 1====

I changed the code to load the constants as immediates, instead of referencing the .rdata segment as Peter mentioned, and this got rid of the slow first iteration, i.e. replaced the commented out line with the 2 lines following:

;    movss xmm4, rf5            ; 1/5!
    mov eax, 3C088889h         ; 1/5! float representation
    movd xmm4, eax

Warming up the CPU didn't help, but I did notice the first iteration in Release was now just as fast as debug, and the rest were still slow. As the printf isn't called until after the first calculation, I wondered if this had an impact. I change the code to just store the results as it ran, and print them once complete, and now Release is just as fast. i.e.

Updated C++ code

extern "C" float CalcSin(float rads, int* cycles);

std::vector<float> values;
std::vector<int> rdtsc;

void DoCalcs(float rads) {
    int cycles;
    float result = CalcSin(rads, &cycles);
    values.push_back(result);
    rdtsc.push_back(cycles);
    // printf("Sin(%.8f) = %.8f.  Took %d cycles\n", rads, result, cycles);
    // printf("C library = %.8f\n", sin(rads));
}

int main(int argc, char* argv[]) {
    std::vector<float> inputs{PI / 1000, PI / 2 - PI / 1000, PI / 4, 0.0001f, PI / 2};
    for (auto val : inputs) {
        DoCalcs(val);
    }

    auto cycle_iter = rdtsc.begin();
    auto value_iter = values.begin();
    for (auto& input : inputs) {
        printf("Sin(%.8f) = %.8f.  Took %d cycles\n", input, *value_iter++, *cycle_iter++);
        printf("C library = %.8f\n", sin(input));
    }
    return 0;
}

And now Release is pretty much identical to debug, i.e. around 18 - 24 cycles consistently on each call.

I'm not sure what the printf call is doing in Release builds, or maybe the way it was linked/optimized with Release settings, but strange it negatively impacted the identical and distinct assembly calls as it did.

Sin(0.00314159) = 0.00314159.  Took 18 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086.  Took 18 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647.  Took 24 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000.  Took 20 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306.  Took 24 cycles
C library = 1.00000000

====UPDATE 2====

To rule out the CPU ramp-up down, I went in and tweaked a few bios settings (disabled Turbo, set a consistent core voltage, etc.), and can now see via the "AI Suite" ASUS app for the motherboard the CPU is a consistent 3600MHz. (I'm running an Intel Core i9-9900k @ 3.6GHz on Windows 10 x64).

After setting that... still no change.

Next thing that occurred to me is that with the 'printf' I have a call out to the C-runtime library between each loop, which is a different DLL between Debug and Release builds. To remove any other variation I starting building from the command-line instead of VS. Compiling with maximum speed optimizations and the release CRT DLLs (/O2 and /MD respectively), I still see the same slow-down. Switching to the debug CRT DLLs, I see some improvement. If I switch static linking in the CRT, then it doesn't matter if I use the debug or release versions, or if I compile with optimizations or not, I regularly see the 24 cycles per call, i.e.

ml64 /c ..\x64simd.asm
cl.exe /Od /MT /Feapp.exe ..\main.cpp x64simd.obj

>app.exe
Sin(0.00314159) = 0.00314159.  Took 24 cycles
Sin(1.56765473) = 0.99984086.  Took 24 cycles
Sin(0.78539819) = 0.70710647.  Took 24 cycles
Sin(0.00010000) = 0.00010000.  Took 24 cycles
Sin(1.57079637) = 0.99984306.  Took 24 cycles

So it's definitely something in calling out to the CRT Release DLLs causing the slow-down. I'm still puzzled as to why, especially as the Debug build in VS is also using CRT via DLLs.

`printf` in a debug build might still be causing your clock speed to dip while your process sleeps while waiting for the terminal to draw the output (if that's how Windows works?), but maybe giving it time to jump back up before execution actually reaches a timed region in the next iteration? IDK how much slower debug libraries could be / how much more work they could plausibly do after a system call that would give the HW CPU governor time to jump back up. — Peter Cordes, Dec 31 '19 at 03:10

Peter Cordes · Answer 1 · 2019-12-29T19:28:57.100

You're timing in reference cycles with rdtsc, not core clock cycles. It's probably the same speed both times, in core clock cycles, but with the CPU running at different frequencies.

Probably a debug build gives the CPU time to ramp up to max turbo (more core cycles per reference cycle) before your function gets called. Because the calling code compiles to slower asm. And especially with MSVC, a debug build adds extra stuff like poisoning the stack frame to catch use of uninitialized vars. And also overhead for incremental linking.

None of this slows down your hand-written function itself, it's just "warm up" that you neglected to do manually in your microbenchmark.

See How to get the CPU cycle count in x86_64 from C++? for lots more details about RDTSC.

A factor of ~3 between idle CPU clock and max-turbo (or some higher clock) is very plausible for modern x86 CPUs. My i7-6700k idles at 0.8GHz with rated frequency of 4.0GHz, max single-core turbo of 4.2. But many laptop CPUs much lower non-turbo max (and might only ramp to non-turbo initially, not max turbo right away, depending on energy_performance_preference HW governor, or especially software governor on older CPUs.)

As a bonus question, why is the very first iteration typically in the thousands!!

Probably dTLB miss and cache miss for loading rf3 from data memory. You could try loading those from C (by declaring extern volatile float rf3) to prime the TLB + cache for that block of constants, assuming they're all in the same cache line.

Possibly also an I-cache miss after the rdtsc, but the first load is probably before the end of an I-cache line so those could happen in parallel. (Putting the rdtsc inside your asm function means we probably aren't waiting for an iTLB miss or i-cache miss inside the timed region to even fetch the first byte of the function).

Code review:

Don't use movss between XMM registers unless you want to blend the low 4 bytes into the old value of the destination. Use movaps xmm2, xmm0 to copy the whole register; it's much more efficient.

movaps can be handled by register renaming without needing any back-end execution unit, vs. movss only running on one execution unit in Intel CPUs, port 5. https://agner.org/optimize/. Also, movaps avoids a false dependency on the old value of the register because it overwrites the full reg, allowing out-of-order exec to work properly.

movss xmm, [mem] is fine, though: as a load it zero-extends into the full register.

Thanks! The memory load makes sense. To test, I changed it to load the float as a constant into the XMM register via EAX, and the overhead in the first iteration disappeared. I was aware the rdtsc value was in wall-clock time, but hadn't considered the CPU speed variations. I added a million pointless calculations before calling my code to ensure it had warmed up, and the strangest thing happened. Debug build stayed identical at 18 to 20 cycles per call, and release stayed identical at 64 - 70 per call **except** the first call, which now matches the debug speed of 18 - 20 cycles. Weird — Bill Ticehurst, Dec 29 '19 at 22:26
To clarify... the CPU warm-up doesn't matter above. If I remove that "million calculations" code I added, it's still now 18-20 cycles first iteration, then back to 62 - 70 cycles for the next 4 iterations in Release builds. — Bill Ticehurst, Dec 29 '19 at 22:33
@BillTicehurst: are you sure the release build doesn't optimize away your empty loop? Perhaps increment a `volatile int` inside it. Or maybe your CPU clock speed is dipping when your program uses `printf` and has to wait for the I/O to complete. Maybe your OS sees that as a sleep and lets the CPU clock go down, or migrates it to another core? **What exact CPU and OS?** 1 million iterations could still only be a fraction of a millisecond if not optimized away, with the loop counter in a register so it can run 1 cycle per iteration. — Peter Cordes, Dec 30 '19 at 03:57
See "Update 2" just posted above. FWIW: My loop was summing calculations via values fetched using "rand()" and printing to the console, so shouldn't be optimized away in any form. — Bill Ticehurst, Dec 30 '19 at 21:33

Why is my SSE assembly slower in release builds?

1 Answers1