I recognized that Clang (10.0) and MSVC (16.7) generate assembly with dramatically different performance (~3.3ns
for Clang, ~8ns
for MSVC) from the same piece of C++ code with some AVX2 intrinsics.
The code is computing the maximum exponent of 10
that can be factored out from a given integer (rcx
), and also computes the quotient after factoring out the maximum power of 10
.
To clarify what the code does, it counts (base 10) trailing zeros of the given integer, and also computes the number you obtain by removing those trailing zeros. For example, when the input is 63700
, the output is {637, 2}
, and when the input is 1230000
, the output is {123, 4}
.
Here is a godbolt link that can reproduce the code: https://godbolt.org/z/5xoK3G
In line 231, you can pick the one that uses lambda (generates a vmovups
the reloads 2 scalar stores, and copies it) and one that doesn't use lambda (does not generate those vmovups
).
To figure out why, I did several tests. And it turned out, the difference is coming from the fact that the version for MSVC uses two vmovups
(one for load, one for store) to write the result back, while Clang uses two mov
(both for stores) instead.
To confirm this hypothesis, I did several things to enforce MSVC to generate mov
instead of vmovups
, and the resulting code performed almost same as the code generated by Clang.
The question is, why usages of vmovups
(or other similar vector-move instructions) degrades the performance so much in this case?
I attach two assembly outputs generated by MSVC, one with vmovups
(thus performing ~8ns
) and one without vmovups
(thus performing ~3.3ns
):
With vmovups
: (this version also inc/dec the stack pointer; is this necessary?)
00007FF7227114B0 sub rsp,18h
00007FF7227114B4 mov r8,rcx
00007FF7227114B7 lea rcx,[divtest_table_holder<unsigned int,5,9>::table (07FF7227133B0h)]
00007FF7227114BE vmovd xmm0,r8d
00007FF7227114C3 vpbroadcastd ymm0,xmm0
00007FF7227114C8 vpmulld ymm1,ymm0,ymmword ptr [divtest_table_holder<unsigned int,5,9>::table+4h (07FF7227133B4h)]
00007FF7227114D1 vpminud ymm0,ymm1,ymmword ptr [divtest_table_holder<unsigned int,5,9>::table+28h (07FF7227133D8h)]
00007FF7227114DA vpcmpeqd ymm1,ymm0,ymm1
00007FF7227114DE vpmovmskb eax,ymm1
00007FF7227114E2 popcnt edx,eax
00007FF7227114E6 tzcnt eax,r8d
00007FF7227114EB shr edx,2
00007FF7227114EE cmp eax,edx
00007FF7227114F0 cmovl edx,eax
00007FF7227114F3 movsxd rax,edx
00007FF7227114F6 mov dword ptr [rsp+8],edx
00007FF7227114FA mov ecx,dword ptr [rcx+rax*4]
00007FF7227114FD imul rcx,r8
00007FF722711501 mov eax,edx
00007FF722711503 shrx rcx,rcx,rax
00007FF722711508 mov qword ptr [rsp],rcx
00007FF72271150C vmovups xmm0,xmmword ptr [rsp]
00007FF722711511 vmovups xmmword ptr [rsp],xmm0
00007FF722711516 vzeroupper
00007FF722711519 add rsp,18h
00007FF72271151D ret
Without vmovups
:
00007FF7475414B0 mov r8,rcx
00007FF7475414B3 lea rcx,[divtest_table_holder<unsigned int,5,9>::table (07FF7475433B0h)]
00007FF7475414BA vmovd xmm0,r8d
00007FF7475414BF vpbroadcastd ymm0,xmm0
00007FF7475414C4 vpmulld ymm2,ymm0,ymmword ptr [divtest_table_holder<unsigned int,5,9>::table+4h (07FF7475433B4h)]
00007FF7475414CD vpminud ymm1,ymm2,ymmword ptr [divtest_table_holder<unsigned int,5,9>::table+28h (07FF7475433D8h)]
00007FF7475414D6 vpcmpeqd ymm2,ymm1,ymm2
00007FF7475414DA vpmovmskb eax,ymm2
00007FF7475414DE popcnt edx,eax
00007FF7475414E2 tzcnt eax,r8d
00007FF7475414E7 shr edx,2
00007FF7475414EA cmp eax,edx
00007FF7475414EC cmovl edx,eax
00007FF7475414EF movsxd rax,edx
00007FF7475414F2 mov ecx,dword ptr [rcx+rax*4]
00007FF7475414F5 imul rcx,r8
00007FF7475414F9 mov eax,edx
00007FF7475414FB shrx rcx,rcx,rax
00007FF747541500 mov qword ptr [rsp+8],rcx
00007FF747541505 mov dword ptr [rsp+8],edx
00007FF747541509 vzeroupper
00007FF74754150C ret
Also, as this is the first SIMD code I've ever written, so I would appreciate any advice if there is anything I'm doing wrong.