Dramatic slowdown of a function with AVX2

Question

I recognized that Clang (10.0) and MSVC (16.7) generate assembly with dramatically different performance (~3.3ns for Clang, ~8ns for MSVC) from the same piece of C++ code with some AVX2 intrinsics.

The code is computing the maximum exponent of 10 that can be factored out from a given integer (rcx), and also computes the quotient after factoring out the maximum power of 10.

To clarify what the code does, it counts (base 10) trailing zeros of the given integer, and also computes the number you obtain by removing those trailing zeros. For example, when the input is 63700, the output is {637, 2}, and when the input is 1230000, the output is {123, 4}.

Here is a godbolt link that can reproduce the code: https://godbolt.org/z/5xoK3G

In line 231, you can pick the one that uses lambda (generates a vmovups the reloads 2 scalar stores, and copies it) and one that doesn't use lambda (does not generate those vmovups).

To figure out why, I did several tests. And it turned out, the difference is coming from the fact that the version for MSVC uses two vmovups (one for load, one for store) to write the result back, while Clang uses two mov (both for stores) instead.

To confirm this hypothesis, I did several things to enforce MSVC to generate mov instead of vmovups, and the resulting code performed almost same as the code generated by Clang.

The question is, why usages of vmovups (or other similar vector-move instructions) degrades the performance so much in this case?

I attach two assembly outputs generated by MSVC, one with vmovups (thus performing ~8ns) and one without vmovups (thus performing ~3.3ns):

With vmovups: (this version also inc/dec the stack pointer; is this necessary?)

00007FF7227114B0  sub         rsp,18h  
00007FF7227114B4  mov         r8,rcx  
00007FF7227114B7  lea         rcx,[divtest_table_holder<unsigned int,5,9>::table (07FF7227133B0h)]  
00007FF7227114BE  vmovd       xmm0,r8d  
00007FF7227114C3  vpbroadcastd ymm0,xmm0  
00007FF7227114C8  vpmulld     ymm1,ymm0,ymmword ptr [divtest_table_holder<unsigned int,5,9>::table+4h (07FF7227133B4h)]  
00007FF7227114D1  vpminud     ymm0,ymm1,ymmword ptr [divtest_table_holder<unsigned int,5,9>::table+28h (07FF7227133D8h)]  
00007FF7227114DA  vpcmpeqd    ymm1,ymm0,ymm1  
00007FF7227114DE  vpmovmskb   eax,ymm1  
00007FF7227114E2  popcnt      edx,eax  
00007FF7227114E6  tzcnt       eax,r8d  
00007FF7227114EB  shr         edx,2  
00007FF7227114EE  cmp         eax,edx  
00007FF7227114F0  cmovl       edx,eax  
00007FF7227114F3  movsxd      rax,edx  
00007FF7227114F6  mov         dword ptr [rsp+8],edx  
00007FF7227114FA  mov         ecx,dword ptr [rcx+rax*4]  
00007FF7227114FD  imul        rcx,r8  
00007FF722711501  mov         eax,edx  
00007FF722711503  shrx        rcx,rcx,rax  
00007FF722711508  mov         qword ptr [rsp],rcx  
00007FF72271150C  vmovups     xmm0,xmmword ptr [rsp]  
00007FF722711511  vmovups     xmmword ptr [rsp],xmm0  
00007FF722711516  vzeroupper  
00007FF722711519  add         rsp,18h  
00007FF72271151D  ret

Without vmovups:

00007FF7475414B0  mov         r8,rcx  
00007FF7475414B3  lea         rcx,[divtest_table_holder<unsigned int,5,9>::table (07FF7475433B0h)]  
00007FF7475414BA  vmovd       xmm0,r8d  
00007FF7475414BF  vpbroadcastd ymm0,xmm0  
00007FF7475414C4  vpmulld     ymm2,ymm0,ymmword ptr [divtest_table_holder<unsigned int,5,9>::table+4h (07FF7475433B4h)]  
00007FF7475414CD  vpminud     ymm1,ymm2,ymmword ptr [divtest_table_holder<unsigned int,5,9>::table+28h (07FF7475433D8h)]  
00007FF7475414D6  vpcmpeqd    ymm2,ymm1,ymm2  
00007FF7475414DA  vpmovmskb   eax,ymm2  
00007FF7475414DE  popcnt      edx,eax  
00007FF7475414E2  tzcnt       eax,r8d  
00007FF7475414E7  shr         edx,2  
00007FF7475414EA  cmp         eax,edx  
00007FF7475414EC  cmovl       edx,eax  
00007FF7475414EF  movsxd      rax,edx  
00007FF7475414F2  mov         ecx,dword ptr [rcx+rax*4]  
00007FF7475414F5  imul        rcx,r8  
00007FF7475414F9  mov         eax,edx  
00007FF7475414FB  shrx        rcx,rcx,rax  
00007FF747541500  mov         qword ptr [rsp+8],rcx  
00007FF747541505  mov         dword ptr [rsp+8],edx  
00007FF747541509  vzeroupper  
00007FF74754150C  ret

Also, as this is the first SIMD code I've ever written, so I would appreciate any advice if there is anything I'm doing wrong.

Post the source so we can play with it on https://godbolt.org/ and maybe find a way to work around the MSVC missed optimization. You even tagged this [C++] but only posted asm. Also, what hardware are you testing on? — Peter Cordes, Aug 19 '20 at 01:44
Yeah, I'm having hard time figuring out how to isolate the code piece from my project, but I think I can do it. I'll edit the question accordingly. The CPU I'm using is Intel i7-7700HQ (Kaby Lake). — Junekey Jeon, Aug 19 '20 at 01:51
Also, can't you avoid `vpmulld` by factoring that into the table? Or better, `tzcnt rax, rcx` and then look up a value to compare against: every possible `floor(log2(x))` has at most 2 possible `floor(log10(x))` depending on x. So you just need a `pow10[lzc] + (x >= cutoff[lzc])` or something like that, just `lzcnt` => `mov`-load, memory-source `cmp` from the table, `setcc`, and `add`. See [performance of log10 function returning an int](https://stackoverflow.com/q/25892665) for an implementation. — Peter Cordes, Aug 19 '20 at 01:51
But anyway yes, looks like MSVC is causing a store-forwarding stall. `mov qword ptr [rsp],rcx` store reloaded by a wider `vmovups xmm0,xmmword ptr [rsp]` is just dumb; IDK why MSVC is doing that. It doesn't seem to make much sense; that stack memory is immediately deallocated at the end of the function. (And clang is just using the shadow space). So neither of these put the result anywhere the caller can get it. Did you abuse `volatile` to create a benchmark that doesn't optimize away or something? If so, how? — Peter Cordes, Aug 19 '20 at 01:56
Note that what I'm doing is to count trailing zeros in decimal and find the number you obtain by removing those zeros, e.g., obtaining {673,2} from 67300. I think you are talking about integer log10 implementation. — Junekey Jeon, Aug 19 '20 at 04:06
And about memory stall, I think you are right that that's the cause of slowdown. I did do some `volatile` abusing, but the code using `vmovups` was generated not because of that. After doing some work on isolation, I figured out that `vmovups` is never generated if I don't enclose some block of code inside a lambda. Perhaps a bug related to the lambda processor I guess? — Junekey Jeon, Aug 19 '20 at 04:10
Oh I see, that example made it clear what you meant by "factor out". I wasn't sure before and guessed wrong. You should your question with that, and put that near the top so other readers wondering about the point of this code don't also have to go digging. That's something I want to know *before* I read the asm; it's easier to see what the asm is doing if you know the overall purpose. — Peter Cordes, Aug 19 '20 at 04:13
As for MSVC, I'd expect that the missed-optimization is still related to volatile, like a lambda might be fine without the volatile. IDK, MSVC is not the best compiler; I don't recommend it. Especially for intrinsics, unless you really really want a mostly literal mapping from intrinsics to asm instructions, without trying to optimize them. clang has a very good shuffle optimizer that can often spot easier ways to do things. — Peter Cordes, Aug 19 '20 at 04:15
Yeah, I mean, `volatile` certainly takes some part of the reason for the stupid code gen, but `vmovups` was generated even without that. So I guess it's more like that just MSVC is not good at intrinsics compared to clang. — Junekey Jeon, Aug 19 '20 at 04:26
@PeterCordes I edited the question to reflect your advice (thanks!) and also attached a link to a working code that you can play with. — Junekey Jeon, Aug 19 '20 at 04:58
I fixed your question for you to actually put the critical info about what it does at the *top*, where I suggested, not at the bottom in a section marked "edit". (Don't signal edits in the text, that's what the edit history / changelog message is for.) — Peter Cordes, Aug 19 '20 at 05:20

Dramatic slowdown of a function with AVX2

0 Answers0