I've been playing around with some x64 assembly and the XMM registers to do some float math, and I'm seeing some performance that is puzzling me.
As a self-learning exercise, I wrote some SSE assembly to approximate the 'sin' function (using the Taylor series), and called this from some basic C++ in a loop to compare to the standard library version. Code is below, and I've pasted the output for some typical runs after that. (I'm not looking for a critique of the code or approach here, just trying to understand the perf numbers).
What I don't get is why with a "Release" build, where the actual running assembly is identical (I've stepped though the debugger to double check), is consistently about 40 - 50 cycles slower. (Uncommenting the LFENCE instructions adds about 100 cycles to both Debug and Release, so the delta remains the same). As a bonus question, why is the very first iteration typically in the thousands!!
I get this stuff is very complex and subtly impacted by numerous factors, but everything that pops in my head as a potential cause here just doesn't make sense.
I've checked the MSCSR flags in both runs, and this is identical across builds also (with the default value of 1f80h which has all exceptions masked).
Any idea what would cause this? What further analysis could I do to figure this out an an even deeper level?
Assembly
_RDATA segment
pi real4 3.141592654
rf3 real4 0.1666666667
rf5 real4 0.008333333333
rf7 real4 0.0001984126984
_RDATA ends
_TEXT segment
; float CalcSin(float rads, int* cycles)
CalcSin PROC
; "leaf" function - doesn't use the stack or any non-volatile registers
mov r8, rdx ; Save the 'cycles' pointer into R8
rdtsc ; Get current CPU cyles in EDX:EAX
; lfence ; Ensure timer is taken before executing the below
mov ecx, eax ; Save the low 32 bits of the timer into ECX
movss xmm2, xmm0
mulss xmm2, xmm2 ; X^2
movss xmm3, xmm0
mulss xmm3, xmm2 ; x^3
movss xmm4, rf3 ; 1/3!
mulss xmm4, xmm3 ; x^3 / 3!
subss xmm0, xmm4 ; x - x^3 / 3!
mulss xmm3, xmm2 ; x^5
movss xmm4, rf5 ; 1/5!
mulss xmm4, xmm3 ; x^5 / 5!
addss xmm0, xmm4 ; x - x^3 / 3! + x^5 / 5!
mulss xmm3, xmm2 ; x^7
movss xmm4, rf7 ; 1/7!
mulss xmm4, xmm3 ; x^7 / 7!
subss xmm0, xmm4 ; x - x^3 / 3! + x^5 / 5! - x^7 / 7!
; lfence ; Ensure above completes before taking the timer again
rdtsc ; Get the timer now
sub eax, ecx ; Get the difference in cycles
mov dword ptr [r8], eax
ret
CalcSin ENDP
_TEXT ends
END
C++
#include <stdio.h>
#include <math.h>
#include <vector>
const float PI = 3.141592654f;
extern "C" float CalcSin(float rads, int* cycles);
void DoCalcs(float rads) {
int cycles;
float result = CalcSin(rads, &cycles);
printf("Sin(%.8f) = %.8f. Took %d cycles\n", rads, result, cycles);
printf("C library = %.8f\n", sin(rads));
}
int main(int argc, char* argv[]) {
std::vector<float> inputs{PI / 1000, PI / 2 - PI / 1000, PI / 4, 0.0001f, PI / 2};
for (auto val : inputs) {
DoCalcs(val);
}
return 0;
}
With a "Debug" build (I'm using Visual Studio 2019), I typically see the below timing reported:
Sin(0.00314159) = 0.00314159. Took 3816 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086. Took 18 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647. Took 18 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000. Took 18 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306. Took 18 cycles
C library = 1.00000000
The exact same code with a "Release" build, I typically see the below:
Sin(0.00314159) = 0.00314159. Took 4426 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086. Took 70 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647. Took 62 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000. Took 64 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306. Took 62 cycles
C library = 1.00000000
====UPDATE 1====
I changed the code to load the constants as immediates, instead of referencing the .rdata segment as Peter mentioned, and this got rid of the slow first iteration, i.e. replaced the commented out line with the 2 lines following:
; movss xmm4, rf5 ; 1/5!
mov eax, 3C088889h ; 1/5! float representation
movd xmm4, eax
Warming up the CPU didn't help, but I did notice the first iteration in Release was now just as fast as debug, and the rest were still slow. As the printf
isn't called until after the first calculation, I wondered if this had an impact. I change the code to just store the results as it ran, and print them once complete, and now Release is just as fast. i.e.
Updated C++ code
extern "C" float CalcSin(float rads, int* cycles);
std::vector<float> values;
std::vector<int> rdtsc;
void DoCalcs(float rads) {
int cycles;
float result = CalcSin(rads, &cycles);
values.push_back(result);
rdtsc.push_back(cycles);
// printf("Sin(%.8f) = %.8f. Took %d cycles\n", rads, result, cycles);
// printf("C library = %.8f\n", sin(rads));
}
int main(int argc, char* argv[]) {
std::vector<float> inputs{PI / 1000, PI / 2 - PI / 1000, PI / 4, 0.0001f, PI / 2};
for (auto val : inputs) {
DoCalcs(val);
}
auto cycle_iter = rdtsc.begin();
auto value_iter = values.begin();
for (auto& input : inputs) {
printf("Sin(%.8f) = %.8f. Took %d cycles\n", input, *value_iter++, *cycle_iter++);
printf("C library = %.8f\n", sin(input));
}
return 0;
}
And now Release is pretty much identical to debug, i.e. around 18 - 24 cycles consistently on each call.
I'm not sure what the printf
call is doing in Release builds, or maybe the way it was linked/optimized with Release settings, but strange it negatively impacted the identical and distinct assembly calls as it did.
Sin(0.00314159) = 0.00314159. Took 18 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086. Took 18 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647. Took 24 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000. Took 20 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306. Took 24 cycles
C library = 1.00000000
====UPDATE 2====
To rule out the CPU ramp-up down, I went in and tweaked a few bios settings (disabled Turbo, set a consistent core voltage, etc.), and can now see via the "AI Suite" ASUS app for the motherboard the CPU is a consistent 3600MHz. (I'm running an Intel Core i9-9900k @ 3.6GHz on Windows 10 x64).
After setting that... still no change.
Next thing that occurred to me is that with the 'printf' I have a call out to the C-runtime library between each loop, which is a different DLL between Debug and Release builds. To remove any other variation I starting building from the command-line instead of VS. Compiling with maximum speed optimizations and the release CRT DLLs (/O2 and /MD respectively), I still see the same slow-down. Switching to the debug CRT DLLs, I see some improvement. If I switch static linking in the CRT, then it doesn't matter if I use the debug or release versions, or if I compile with optimizations or not, I regularly see the 24 cycles per call, i.e.
ml64 /c ..\x64simd.asm
cl.exe /Od /MT /Feapp.exe ..\main.cpp x64simd.obj
>app.exe
Sin(0.00314159) = 0.00314159. Took 24 cycles
Sin(1.56765473) = 0.99984086. Took 24 cycles
Sin(0.78539819) = 0.70710647. Took 24 cycles
Sin(0.00010000) = 0.00010000. Took 24 cycles
Sin(1.57079637) = 0.99984306. Took 24 cycles
So it's definitely something in calling out to the CRT Release DLLs causing the slow-down. I'm still puzzled as to why, especially as the Debug build in VS is also using CRT via DLLs.