Performance difference between Windows and Linux using Intel compiler: looking at the assembly

Question

I am running a program on both Windows and Linux (x86-64). It has been compiled with the same compiler (Intel Parallel Studio XE 2017) with the same options, and the Windows version is 3 times faster than the Linux one. The culprit is a call to std::erf which is resolved in the Intel math library for both cases (by default, it is linked dynamically on Windows and statically on Linux but using dynamic linking on Linux gives the same performance).

Here is a simple program to reproduce the problem.

#include <cmath>
#include <cstdio>

int main() {
  int n = 100000000;
  float sum = 1.0f;

  for (int k = 0; k < n; k++) {
    sum += std::erf(sum);
  }

  std::printf("%7.2f\n", sum);
}

When I profile this program using vTune, I find that the assembly is a bit different in between the Windows and the Linux version. Here is the call site (the loop) on Windows

Block 3:
"vmovaps xmm0, xmm6"
call 0x1400023e0 <erff>
Block 4:
inc ebx
"vaddss xmm6, xmm6, xmm0"
"cmp ebx, 0x5f5e100"
jl 0x14000103f <Block 3>

And the beginning of the erf function called on Windows

Block 1:
push rbp
"sub rsp, 0x40"
"lea rbp, ptr [rsp+0x20]"
"lea rcx, ptr [rip-0xa6c81]"
"movd edx, xmm0"
"movups xmmword ptr [rbp+0x10], xmm6"
"movss dword ptr [rbp+0x30], xmm0"
"mov eax, edx"
"and edx, 0x7fffffff"
"and eax, 0x80000000"
"add eax, 0x3f800000"
"mov dword ptr [rbp], eax"
"movss xmm6, dword ptr [rbp]"
"cmp edx, 0x7f800000"
...

On Linux, the code is a bit different. The call site is:

Block 3
"vmovaps %xmm1, %xmm0"
"vmovssl  %xmm1, (%rsp)"
callq  0x400bc0 <erff>
Block 4
inc %r12d
"vmovssl  (%rsp), %xmm1"
"vaddss %xmm0, %xmm1, %xmm1"   <-------- hotspot here
"cmp $0x5f5e100, %r12d"
jl 0x400b6b <Block 3>

and the beginning of the called function (erf) is:

"movd %xmm0, %edx"
"movssl  %xmm0, -0x10(%rsp)"   <-------- hotspot here
"mov %edx, %eax"
"and $0x7fffffff, %edx"
"and $0x80000000, %eax"
"add $0x3f800000, %eax"
"movl  %eax, -0x18(%rsp)"
"movssl  -0x18(%rsp), %xmm0"
"cmp $0x7f800000, %edx"
jnl 0x400dac <Block 8>
...

I have shown the 2 points where the time is lost on Linux.

Does anyone understand assembly enough to explain me the difference of the 2 codes and why the Linux version is 3 times slower?

Yes, same hardware. I have tested this case on a core i7 Haswell for both Windows and Linux, and on a Xeon Broadwell for both Windows and Linux. Same result. On the core i7 I have also tested it on macOS, and the speed is the same as on the Windows version. — InsideLoop, Nov 10 '16 at 08:30
A virtual machine on the core i7, and a dual boot on the Xeon. It makes no difference. — InsideLoop, Nov 10 '16 at 08:39
Are the results numerically identical? It might be that the Intel implementation is more accurate. Of course, determining that is non-trivial. — MSalters, Nov 10 '16 at 09:06
The Linux version is saving and later restoring xmm1 to / from ram in block 3 and block 4, but the Windows version is saving (and I assume later restoring, but it's not shown above) xmm6 to / from ram. — rcgldr, Nov 10 '16 at 09:11
I'm sure I'm not the only one who initially doubted that the bottleneck was the function call overhead instead of the computation of erf() itself (which is a very expensive function). The error function converges extremely quickly to `1.0` for large positive inputs. The `sum` variable gets large at a steady rate. So after a few iterations, the erf() function is "early-out'ing" and returning `1.0` when it sees a large input. So it's not doing any computation at all. — Mysticial, Nov 10 '16 at 17:24

score 42 · Accepted Answer · edited Nov 13 '16 at 06:44

42

In both cases the arguments and results are passed only in registers, as per the respective calling conventions on Windows and GNU/Linux.

In the GNU/Linux variant, the xmm1 is used for accumulating the sum. Since it's a call-clobbered register (a.k.a caller-saved) it's stored (and restored) in the stack frame of the caller on each call.

In the Windows variant, the xmm6 is used for accumulating the sum. This register is callee-saved in the Windows calling convention (but not in the GNU/Linux one).

So, in summary, the GNU/Linux version saves/restores both xmm0 (in the callee[1]) and xmm1 (in the caller), whereas the Windows version saves/restores only xmm6 (in the callee).

[1] need to look at std::errf to figure out why.

edited Nov 13 '16 at 06:44

Leon

31,443
4
72
97

answered Nov 10 '16 at 10:02

chill

16,470
2
40
44

Is the fact that the register is callee-saved something that is always followed on Windows and never on Linux? – InsideLoop Nov 10 '16 at 13:03
2

The compilers *always* respect the ABI, just different ABIs define the sets of caller- and callee- saved registers in different ways. – chill Nov 10 '16 at 14:08
10

Actually the ABI need only be respected for external calls where the compiler cannot see the definition. Otherwise (when it can see the definition of the callee) it can perform any transformation it likes that doesn't change the results of well-defined code, including inlining or use of a custom calling convention. – R.. GitHub STOP HELPING ICE Nov 10 '16 at 16:47
@R., indeed, for "non-exported" functions and when all call sites are known. – chill Nov 10 '16 at 18:19
6

@chill: It's not necessary that all call sites are known. The compiler can (and gcc does) emit multiple versions of a function when it's both externally reachable (not all call sites known) and used locally in a way that could benefit from a different calling convention (or inter-procedural constant propagation, etc.). – R.. GitHub STOP HELPING ICE Nov 10 '16 at 19:19
TL:DR: The x86 (32 and 64-bit) System V ABI doesn't have any call-preserved FP registers, but the Windows ABI does have some (but only XMM, not the YMM or ZMM upper halves). So on Linux, non-inlined function calls suck a lot for FP vars. – Peter Cordes Nov 11 '16 at 03:38
@R.. do you have an example for the case that gcc emits multiple versions of a function? – phuclv Feb 28 '17 at 14:45

rcgldr · Answer 2 · 2016-11-13T15:57:39.217

Using Visual Studio 2015, Win 7 64 bit mode, I find the following code for some of the paths used in erf() (not all paths shown). Each path involves up to 8 (maybe more for other paths) constants read from memory, so a single store / load to save a register seems unlikely to result in a 3x speed differential between Linux and Windows. As far for save / restores, this example saves and restores xmm6 and xmm7. As for the time, the program in the original post takes about 0.86 seconds on an Intel 3770K (3.5ghz cpu) (VS2015 / Win 7 64 bit). Update - I later determined the overhead for a save and restore of a xmm register is about 0.03 seconds in the case of the programs 10^8 loops (about 3 nanoseconds per loop).

000007FEEE25CF90  mov         rax,rsp  
000007FEEE25CF93  movss       dword ptr [rax+8],xmm0  
000007FEEE25CF98  sub         rsp,48h  
000007FEEE25CF9C  movaps      xmmword ptr [rax-18h],xmm6  
000007FEEE25CFA0  lea         rcx,[rax+8]  
000007FEEE25CFA4  movaps      xmmword ptr [rax-28h],xmm7  
000007FEEE25CFA8  movaps      xmm6,xmm0  
000007FEEE25CFAB  call        000007FEEE266370  
000007FEEE25CFB0  movsx       ecx,ax  
000007FEEE25CFB3  test        ecx,ecx  
000007FEEE25CFB5  je          000007FEEE25D0AF  
000007FEEE25CFBB  sub         ecx,1  
000007FEEE25CFBE  je          000007FEEE25D08F  
000007FEEE25CFC4  cmp         ecx,1  
000007FEEE25CFC7  je          000007FEEE25D0AF  
000007FEEE25CFCD  xorps       xmm7,xmm7  
000007FEEE25CFD0  movaps      xmm2,xmm6  
000007FEEE25CFD3  comiss      xmm7,xmm6  
000007FEEE25CFD6  jbe         000007FEEE25CFDF  
000007FEEE25CFD8  xorps       xmm2,xmmword ptr [7FEEE2991E0h]  
000007FEEE25CFDF  movss       xmm0,dword ptr [7FEEE298E50h]  
000007FEEE25CFE7  comiss      xmm0,xmm2  
000007FEEE25CFEA  jbe         000007FEEE25D053  
000007FEEE25CFEC  movaps      xmm2,xmm6  
000007FEEE25CFEF  mulss       xmm2,xmm6  
000007FEEE25CFF3  movaps      xmm0,xmm2  
000007FEEE25CFF6  movaps      xmm1,xmm2  
000007FEEE25CFF9  mulss       xmm0,dword ptr [7FEEE298B34h]  
000007FEEE25D001  mulss       xmm1,dword ptr [7FEEE298B5Ch]  
000007FEEE25D009  addss       xmm0,dword ptr [7FEEE298B8Ch]  
000007FEEE25D011  addss       xmm1,dword ptr [7FEEE298B9Ch]  
000007FEEE25D019  mulss       xmm0,xmm2  
000007FEEE25D01D  mulss       xmm1,xmm2  
000007FEEE25D021  addss       xmm0,dword ptr [7FEEE298BB8h]  
000007FEEE25D029  addss       xmm1,dword ptr [7FEEE298C88h]  
000007FEEE25D031  mulss       xmm0,xmm2  
000007FEEE25D035  mulss       xmm1,xmm2  
000007FEEE25D039  addss       xmm0,dword ptr [7FEEE298DC8h]  
000007FEEE25D041  addss       xmm1,dword ptr [7FEEE298D8Ch]  
000007FEEE25D049  divss       xmm0,xmm1  
000007FEEE25D04D  mulss       xmm0,xmm6  
000007FEEE25D051  jmp         000007FEEE25D0B2  
000007FEEE25D053  movss       xmm1,dword ptr [7FEEE299028h]  
000007FEEE25D05B  comiss      xmm1,xmm2  
000007FEEE25D05E  jbe         000007FEEE25D076  
000007FEEE25D060  movaps      xmm0,xmm2  
000007FEEE25D063  call        000007FEEE25CF04  
000007FEEE25D068  movss       xmm1,dword ptr [7FEEE298D8Ch]  
000007FEEE25D070  subss       xmm1,xmm0  
000007FEEE25D074  jmp         000007FEEE25D07E  
000007FEEE25D076  movss       xmm1,dword ptr [7FEEE298D8Ch]  
000007FEEE25D07E  comiss      xmm7,xmm6  
000007FEEE25D081  jbe         000007FEEE25D08A  
000007FEEE25D083  xorps       xmm1,xmmword ptr [7FEEE2991E0h]  
000007FEEE25D08A  movaps      xmm0,xmm1  
000007FEEE25D08D  jmp         000007FEEE25D0B2  
000007FEEE25D08F  mov         eax,8000h  
000007FEEE25D094  test        word ptr [rsp+52h],ax  
000007FEEE25D099  je          000007FEEE25D0A5  
000007FEEE25D09B  movss       xmm0,dword ptr [7FEEE2990DCh]  
000007FEEE25D0A3  jmp         000007FEEE25D0B2  
000007FEEE25D0A5  movss       xmm0,dword ptr [7FEEE298D8Ch]  
000007FEEE25D0AD  jmp         000007FEEE25D0B2  
000007FEEE25D0AF  movaps      xmm0,xmm6  
000007FEEE25D0B2  movaps      xmm6,xmmword ptr [rsp+30h]  
000007FEEE25D0B7  movaps      xmm7,xmmword ptr [rsp+20h]  
000007FEEE25D0BC  add         rsp,48h  
000007FEEE25D0C0  ret

*Each path involves up to 8 (maybe more for other paths) constants read from memory,* That only takes 4 cycles of throughput on modern CPUs (Intel SnB-family, or AMD k8 and later), and as for latency: out-of-order execution can overlap it with anything since the addresses are known way ahead of time. i.e. they can be done and ready by the time the register input to the instruction is ready, so they don't necessarily lengthen the dependency chain. I'd be much more worried about the mulss/addss chain! — Peter Cordes, Nov 11 '16 at 03:44
You're right that it looks weird. From the C, the OP's test function should just bottleneck on the latency of `erf()`, plus 3c for FP add (or 4 on SKL), and optionally + another 5 or 6 cycles for XMM spill/reload. I didn't read the asm carefully. Maybe the store/reload makes something else less efficient. — Peter Cordes, Nov 11 '16 at 03:49
@PeterCordes - follow up, I replaced erf with an assembly routine that just returns and one that stores / loads xmm0 and returns. The store / load of xmm0 overhead is 0.03 seconds with 10^8 loops, == 3 nano-seconds per store / load pair of instructions. Compare the .03 second store / load overhead to the 0.86 second total time using erf() (again 10^8 loops). — rcgldr, Nov 11 '16 at 05:48

Performance difference between Windows and Linux using Intel compiler: looking at the assembly

2 Answers2