Why is SIMD slower than scalar counterpart

Question

this is yet another SSE is slower than normal code! Why? type of question.
I know that there are a bunch of similar questions but they don't seem to match my situation.

I am trying to implement Miller-Rabin primality test with Montgomery Modular Multiplication for fast modulo operations.
I tried to implement it in both scalar and SIMD way and it turns out that the SIMD version was around 10% slower.
that [esp+16] or [esp+12] is pointing to the modulo inverse of N if there's anyone wondering.

I am really puzzled over the fact that a supposedly 1 Latency 1c Throughput 1uops instruction psrldq takes more than 3 Latency 0.5c Throughput 1uops pmuludq.

Below is the code and the run time analysis on visual studio ran on Ryzen 5 3600.

Any idea on how to improve SIMD code and/or why is it slower than a scalar code is appreciated.

P.S. Seems like the run time analysis is off by one instruction for some reason

EDIT 1: the comment on the image was wrong, I attached a fixed version below:

    ;------------------------------------------
    ; Calculate base^d mod x
    ;
    ; eax = 1
    ; esi = x
    ; edi = bases[eax]
    ; ebp = d
    ; while d do
    ;     if d & 1 then eax = (eax * edi) mod x
    ;     edi = (edi*edi) mod x
    ;     d >>= 1
    ; end
    ;------------------------------------------

Scalar code:

LOOP_MODEXP:
    push eax

    test ebp, 1
    jz @F

    mul edi
    mov ecx, edx
    imul eax, DWORD PTR [esp+16]
    mul esi
    xor ebx, ebx
    sub ecx, edx
    cmovs ebx, esi
    add ecx, ebx
    mov DWORD PTR [esp], ecx
@@:
    mov edx, edi
    mulx ecx, edx, edi
    imul edx, DWORD PTR [esp+16]
    mulx eax, ebx, esi
    xor ebx, ebx
    sub ecx, eax
    cmovs ebx, esi
    add ecx, ebx
    mov edi, ecx

    pop eax

    shr ebp, 1
    jnz LOOP_MODEXP

SIMD code

    movd xmm2, DWORD PTR [esp+12]
    movd xmm3, esi
    pshufd xmm2, xmm2, 0
    pshufd xmm3, xmm3, 0
    
    movd xmm1, edi

    pshufd xmm1, xmm1, 0
    movdqa xmm0, xmm1

    pinsrd xmm0, eax, 2

LOOP_MODEXP:
    movdqa xmm4, xmm0
    pmuludq xmm0, xmm1
    movdqa xmm1, xmm0
    pmuludq xmm0, xmm2
    pmuludq xmm0, xmm3
    psubd xmm1, xmm0
    
    psrldq xmm1, 4
    pxor xmm0, xmm0
    pcmpgtd xmm0, xmm1
    blendvps xmm0, xmm3, xmm0
    paddd xmm0, xmm1

    movddup xmm1, xmm0

    test ebp, 1
    jnz @F
    blendps xmm0, xmm4, 4

@@:
    shr ebp, 1
    jnz LOOP_MODEXP

    pextrd eax, xmm0, 2

I have to be honest. I don't understand how the SIMD code is an implementation of the given algorithm. I don't see how the right shift and the subtraction come into play. I also don't understand what are you parallelizing and what is in `eax`. It also seems to me that sometimes you consider a register holding doubles and sometimes floats but it may be me not being used to read SIMD code. Surely it would help to have that code commented along with explaining how you are exploiting the parallelism. Wait, is that right shift to compute 4 values at a time? — Margaret Bloom, May 30 '21 at 14:55
it's the REDC algorithm explained in the Wikipedia page I linked about Montgomery modular multiplication. It's essentially (x - N * ((x * N') mod 2^32)) / 2^32 where x is eax\*edi (on the top of the scalar code) or edi*edi (on the bottom). SIMD code calculates both multiplications in one shot, and the shift is for the / 2^32 part. As the high 32 bit portion is stored in edx in the scalar code, there's no need to shift things around, but it is necessary with the SIMD. Also I treat xmm registers as 2 QWORD registers or 4 DWORD registers not double/floats — quaver, May 30 '21 at 15:36
Actually, all I'm wondering is why do single uops instructions like paddd/psubd/psrldq/movddup take so long to execute. The actual algorithm doesn't matter at all, but they both do precisely the same thing FYI. — quaver, May 30 '21 at 15:40
`why do single uops instructions like paddd/psubd/psrldq/movddup take so long to execute` - high sample count on an instruction does not mean it is that particular instruction that is being slow. Your CPU is out of order and superscalar, there are many instructions in progress at the same time, sample will not be very accurate and will need further analysis to understand what the high percentage might mean at instruction level granularity. — stepan, May 30 '21 at 16:05
Without digging into the algorithm, your SIMD code seems to carry dependency through all `pmuludq` and the following instructions, which means the code is latency-bound. This is not the case in the scalar code, where `mulx`/`imul` can execute in parallel. The rest of the instructions are less expensive, so the sequence of multiplications make a major contribution to the loop performance. — Andrey Semashev, May 30 '21 at 16:09
Oh right `psrldq` uses bytes units for the shift, I see what you did. I was confused by the algorithm posted at the top of the question. I think that using the SIMD just to perform two computations in parallel won't give much of a benefit since the scalar code is already using two dep chains (ad Andrey pointed out) and don't need all that shuffling around. I don't know how VS is profiling the code but `pxor xmm0, xmm0` is shown in red with a high count (of what?) which is really odd. — Margaret Bloom, May 30 '21 at 16:42
You'd probably get a bigger speedup from using 64-bit code. `64 * 64` (amount of work done by `mulx r64`) is greater than 2x `32 * 32` (amount of work done by `pmuludq`), and you don't have to add the pieces so it's getting more work done than even AVX2 `vpmuludq ymm`. If you're restricted to 32-bit code, then maybe this is worth considering, though. — Peter Cordes, May 30 '21 at 19:26
@stepan: high sample counts on an instruction usually mean that it *had to wait for its inputs*, e.g. a cheap add after a cache-miss load. Or an instruction that had to wait for `pmuludq`. — Peter Cordes, May 30 '21 at 19:28
@PeterCodes If I use 64 bit multiplication I need an extra shift to carry out / 2^32 when it's free on 32 bit multiplication — quaver, May 30 '21 at 20:31
@PeterCordes thanks, btw do you know why `pxor` has such high sample count? — stepan, May 30 '21 at 20:44
@stepan I think it's off by one instruction. There's simply no way `pxor` costs that much. if you think this way, it makes some sense. `psrldq` is the culprit instead of `pxor`. — quaver, May 30 '21 at 20:47
@MargaretBloom I think you're right. I tried to squeeze every bit of performance out of this algorithm but the scalar version seems pretty optimal to me. — quaver, May 30 '21 at 20:54
@stepan: Agreed with quaver, those pxor-zeroing counts very likely belong to `psrldq`. There's also a "skew" effect, where a later instruction gets blamed. e.g. an instruction is waiting for a slow result when the counter overflows, and the interrupt is handled *after* it retires, blaming whatever instruction is next. pxor-zeroing is dependency-breaking on Zen, although it does need an execution unit to write the zero (unlike Sandybridge-family where it's literally as cheap as a nop), but still there's no way it has significant cost unless there's some funky front-end effect. (Unlikely). — Peter Cordes, May 30 '21 at 21:21

Soonts · Accepted Answer · 2021-05-30T16:13:46.080

Your SIMD code wastes time mispredicting that test ebp, 1 / jnz branch. There’s no conditional move instruction in SSE, but you can still optimize away that test + branch with a few more instructions like this:

mov      ebx, ebp
and      ebx, 1
sub      ebx, 1
pxor     xmm5, xmm5
pinsrd   xmm5, ebx, 2
blendvps xmm0, xmm4, xmm5

instead of your

test    ebp, 1
jnz @F
blendps xmm0, xmm4, 4

The above code computes ebx = ( ebp & 1 ) ? 0 : -1;, inserts that integer into 3-rd lane of a zero vector, and uses that vector for the selector in blendvps instruction.

This instruction is not needed: pcmpgtd xmm0, xmm1 Along with previous and next one, it computes this:

xmm0 = _mm_cmplt_epi32( xmm1, _mm_setzero_si128() );
xmm0 = _mm_blendv_ps( xmm0, xmm3, xmm0 );

Here’s an equivalent:

xmm0 = _mm_blendv_ps( _mm_setzero_si128(), xmm3, xmm1 );

That comparison instruction compares int32 lanes for xmm1 < 0. This results in the sign bit of these integers. _mm_blendv_ps instruction only tests the high bits in 32-bit lanes, you don’t really need to compare for xmm1 < 0 before that.

Unless you need to support CPUs without AVX, you should use VEX encoding of the instructions, even for code dealing with 16-byte vectors. Your SIMD code uses legacy encoding, most of them take 2 arguments and write the result in the first one. Most VEX instructions take 3 arguments and write result into another one. This should get rid of the couple redundant move instructions like movdqa xmm4, xmm0.

For some reason, the code runs even slower with its AVX counterpart. also `blendvps` requires xmm0 to be the third operand so I had to add extra movaps — quaver, May 30 '21 at 20:37
@quaver: AVX will allow some optimizations, but you have to actually take advantage of it manually in hand-written asm; without seeing your change, IDK why it'd be slower. (I forget if Zen2 has SSE/AVX transition penalties, but shouldn't matter with only 128-bit AVX instructions. Of course, Zen2 has 256-bit vector ALUs, so use them if you can also assume AVX2.) See also [Can long integer routines benefit from SSE?](https://stackoverflow.com/q/8866973) for tricks using double-precision mantissas for 52-bit integer FMA. — Peter Cordes, May 30 '21 at 21:25
Note that AVX `VBLENDVPS xmm1, xmm2, xmm3/m128, xmm4` has the blend control register number as an immediate operand, so it isn't stuck using xmm0/ymm0. (https://www.felixcloutier.com/x86/blendvps). (Fun fact: on Intel Skylake, `blendvps xmm` is single-uop, but `vblendvps` is 2 uops. Zen always runs it as a single uop, though. https://uops.info/table.html. Zen2's integer `vpblendvb` is also single-uop, but can't run on as many ports) — Peter Cordes, May 30 '21 at 21:28

Why is SIMD slower than scalar counterpart

1 Answers1