Why does this SIMD code run slower than scalar equivalent?

Question

This is one of those n00b questions where I'm doing something wrong but I don't fully understand yet.

The xxhash32 algorithm has a nice 16 byte inner loop that can be made faster with SIMD, so, as an exercise to myself, this is what I'm trying to do.

The body of the loop looks like this (numBytes is some multiple of 16):

// C# that gets auto-vectorized.  uint4 is a vector of 4 elements
uint4 state = new uint4(Prime1 + Prime2, Prime2, 0, (uint)-Prime1) + seed;

int count = numBytes >> 4;
for (int i = 0; i < count; ++i) {
    state += *p++ * Prime2;
    state = (state << 13) | (state >> 19);
    state *= Prime1;
}

hash = rol(state.x, 1) + rol(state.y, 7) + rol(state.z, 12) + rol(state.w, 18);

I've translated this into the following SSE2/SSE4.1 intrinsics:

auto prime1 = _mm_set1_epi32(kPrime1);
auto prime2 = _mm_set1_epi32(kPrime2);

auto state = _mm_set_epi32(seed + kPrime1 + kPrime2, seed + kPrime2, seed, seed - kPrime1);

int32_t count = size >> 4;  // =/16
for (int32_t i = 0; i < count; i++) {
    state = _mm_add_epi32(state, _mm_mullo_epi32(_mm_loadu_si128(p128++), prime2));
    state = _mm_or_si128(_mm_sll_epi32(state, _mm_cvtsi32_si128(13)), _mm_srl_epi32(state, _mm_cvtsi32_si128(19)));
    state = _mm_mullo_epi32(state, prime1);
}

uint32_t temp[4];
_mm_storeu_si128(state, temp);
hash = _lrotl(temp[0], 1) + _lrotl(temp[1], 7) + _lrotl(temp[2], 12) + _lrotl(temp[3], 18);

Here's the disassembly of the inner loop body:

mov         rax,qword ptr [p128]  
mov         qword ptr [rsp+88h],rax  
mov         rax,qword ptr [rsp+88h]  
movdqu      xmm0,xmmword ptr [rax]  
movdqa      xmmword ptr [rsp+90h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+90h]  
movdqa      xmmword ptr [rsp+120h],xmm0  
mov         rax,qword ptr [p128]  
add         rax,10h  
mov         qword ptr [p128],rax  
movdqa      xmm0,xmmword ptr [prime2]  
movdqa      xmmword ptr [rsp+140h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+120h]  
movdqa      xmmword ptr [rsp+130h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+130h]  
pmulld      xmm0,xmmword ptr [rsp+140h]  
movdqa      xmmword ptr [rsp+150h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+150h]  
movdqa      xmmword ptr [rsp+160h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+160h]  
movdqa      xmmword ptr [rsp+170h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+20h]  
movdqa      xmmword ptr [rsp+100h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+100h]  
paddd       xmm0,xmmword ptr [rsp+170h]  
movdqa      xmmword ptr [rsp+180h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+180h]  
movdqa      xmmword ptr [rsp+190h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+190h]  
movdqa      xmmword ptr [rsp+20h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+20h]  
movdqa      xmmword ptr [rsp+1A0h],xmm0  
mov         eax,13h  
movd        xmm0,eax  
movdqa      xmmword ptr [rsp+1B0h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+1A0h]  
psrld       xmm0,xmmword ptr [rsp+1B0h]  
movdqa      xmmword ptr [rsp+1C0h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+1C0h]  
movdqa      xmmword ptr [rsp+200h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+20h]  
movdqa      xmmword ptr [rsp+1D0h],xmm0  
mov         eax,0Dh  
movd        xmm0,eax  
movdqa      xmmword ptr [rsp+1E0h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+1D0h]  
pslld       xmm0,xmmword ptr [rsp+1E0h]  
movdqa      xmmword ptr [rsp+1F0h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+1F0h]  
movdqa      xmmword ptr [rsp+210h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+200h]  
movdqa      xmmword ptr [rsp+230h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+210h]  
movdqa      xmmword ptr [rsp+220h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+220h]  
por         xmm0,xmmword ptr [rsp+230h]  
movdqa      xmmword ptr [rsp+240h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+240h]  
movdqa      xmmword ptr [rsp+250h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+250h]  
movdqa      xmmword ptr [rsp+20h],xmm0  
movdqa      xmm0,xmmword ptr [prime1]  
movdqa      xmmword ptr [rsp+280h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+20h]  
movdqa      xmmword ptr [rsp+270h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+270h]  
pmulld      xmm0,xmmword ptr [rsp+280h]  
movdqa      xmmword ptr [rsp+290h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+290h]  
movdqa      xmmword ptr [rsp+2A0h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+2A0h]  
movdqa      xmmword ptr [rsp+20h],xmm0

Some questions about the disassembly:

Why so many movdqa instructions (I thought the purpose of intrinsics was that they mapped to specific hardware instructions.)?
Why is only xmm0 used, it looks to me like it is shuffling memory in and out of the vector pipeline (I'm expecting more xmmN registers to be used)?

This is compiled with Visual C++ 2017, I haven't enabled additional optimizations.

When I run these two snippets over a block of 64 MiB, many times over, the scalar code is about 3 timers faster. This is not what I expect to happen, what have I missed?

`_mm_sll_epi32` with a vector 13 is probably part of the perf problem (once you enable optimization). Use `_mm_slli_epi32` which uses the immediate form, `pslld xmm, 13`. MSVC is very literal about intrinsics and won't do this optimization for you. Extra shift uops are probably competing for some of the same execution ports that `_mm_mullo_epi32` needs - it's unfortunately slowish. — Peter Cordes, Jul 29 '20 at 15:33
IDK if it would be worth using SIMD to try emulate 4x `_lrotl`. Maybe only if you had AVX2 for variable-count shifts. Are you sure your loop counts are right, and you're not doing 4x as much work in the SIMD version? both use `size >> 4;`, but doing 4 elements per vector should mean you only do 1/4 the iterations. — Peter Cordes, Jul 29 '20 at 15:35
Oh wait, both loops are just written like they were SIMD. The compiler may auto-vectorize the first. But the first doesn't even look like C++. `uint4 state = new ...` won't compile unless `uint4` is actually a typedef for a pointer type. Perhaps that's C#? Probably best to indicate the different language with a comment, if you want to talk about the speed comparison. — Peter Cordes, Jul 29 '20 at 15:48
@PeterCordes Yes, the first example is a C# reference implementation from Unity that when compiled with their Burst compiler infrastructure does auto vectorization. I'm trying to learn more about how that happens, and I wanna write some SIMD myself. — John Leidegren, Jul 30 '20 at 06:20
Then you should compare the asm generated by that with what you're getting from manual vectorization. — Peter Cordes, Jul 30 '20 at 06:24
@PeterCordes I've actually put that information in the fenced code blocks but it doesn't translate trough. Sorry about that. — John Leidegren, Jul 30 '20 at 06:24
Yeah, language name after the triple-backticks is only visible to the syntax-highlighting markdown processor. Any code in a question that doesn't match the question's language tags should get flagged more explicitly than that; fixed it for you. — Peter Cordes, Jul 30 '20 at 06:26
I tried `_mm_slli_epi32` and it's right shift equivalent didn't change anything but I don't think the loop is starved for registers, it only uses xmm0-5. I don't think it really matters what happens outside the loop. Because we're hashing gigabytes of data. — John Leidegren, Jul 30 '20 at 06:40
Amount of registers isn't the point! `psrld xmm0, xmm1` is 2 uops on Skylake, and thus slower than `psrld xmm0, 13` (1 uop) on most CPUs, including Skylake and Zen (https://www.uops.info/table.html?search=psrld&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKL=on&cb_ICL=on&cb_ZEN2=on&cb_measurements=on&cb_base=on&cb_bmi=on&cb_sse=on) — Peter Cordes, Jul 30 '20 at 06:43
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/218866/discussion-between-john-leidegren-and-peter-cordes). — John Leidegren, Jul 30 '20 at 06:45

John Leidegren · Answer 1 · 2020-07-29T15:22:19.217

Okay, this has everything to do with compiler optimization flags and is totally Visual C++ specific.

As I enable additional compiler optimization switches the code gets so much faster.

The inner loop turns into this:

pmulld      xmm0,xmm5  
paddd       xmm0,xmm3  
movdqa      xmm3,xmm0  
pslld       xmm3,xmm2  
psrld       xmm0,xmm1  
por         xmm3,xmm0  
pmulld      xmm3,xmm4

While the documentation says that /Ox is equivalent to some other switches, it wasn't until I actually compiled with /Ox or /O2 that the code ended up looking like that.

Edit: the SIMD result ended up being just 8% faster. The xxhash32 algorithm is very good superscalar code so while I expected more, this is what I got. There's some notes about this in the original source.

Some numbers from my computer (Ryzen 1700).

memcpy 11.334895 GiB/s
SIMD    5.737743 GiB/s
Scalar  5.286924 GiB/s

I was hoping to try and make the xxhash32 algorithm almost as fast as memcpy. I've seen some benchmarks that suggest that this could be improved but it's difficult to compare without having a comparable baseline, that's why I bench against my computers memcpy performance.

Yup, the code in the question looked like debug-mode code, spilling/reloading every object between every statement. [Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](https://stackoverflow.com/q/53366394) and yes this is horrible for intrinsics code. GCC and clang are the same way: anti-optimized debug mode is special, and can't be overridden with combos of other options AFAIK. — Peter Cordes, Jul 29 '20 at 14:45

Why does this SIMD code run slower than scalar equivalent?

1 Answers1