Does mixing 4-wide and 8-wide instructions suppose to harm performance that bad?

Question

I have this AVX code that runs much slower than the SSE4 version and I'm trying to figure out why.

This smallish loop in SSE4:

(asm by gcc 13.1)

.L6:
        movaps  xmm1, XMMWORD PTR [rbx+rsi]
        movaps  xmm3, XMMWORD PTR [rbp+0+rsi]
        movaps  xmm2, xmm4
        lea     eax, [0+rcx*4]
        movd    xmm0, eax
        add     rcx, 1
        add     rsi, 16
        addps   xmm3, xmm1
        cmpleps xmm1, xmm5
        pshufd  xmm0, xmm0, 0
        paddd   xmm0, xmm6
        cmpleps xmm2, xmm3
        pand    xmm1, xmm2
        movmskps edi, xmm1
        mov     rax, rdi
        sal     rdi, 4
        pshufb  xmm0, XMMWORD PTR shufmasks.0[rdi]
        popcnt  eax, eax
        movups  XMMWORD PTR [r8], xmm0
        lea     r8, [r8+rax*4]
        cmp     rcx, rdx
        jne     .L6

Because AVX doesn't have 8-wide integer, I use 8-wide registers so that I can do the float math 8-wide, but for the integer stuff, I split them on-place to compute them in two 4-wide instructions and put them back together:

(asm by gcc 13.1)

.L6:
        lea     eax, [0+rcx*8]
        vmovaps ymm7, YMMWORD PTR [rbx+rsi]
        xor     r10d, r10d
        add     rcx, 1
        vmovd   xmm1, eax
        xor     eax, eax
        vpshufd xmm0, xmm1, 0
        vcmpleps ymm6, ymm7, ymm3
        vmovdqa xmm1, xmm0
        vpaddd  xmm0, xmm0, xmm5
        vpaddd  xmm1, xmm1, xmm2
        vinsertf128 ymm0, ymm0, xmm1, 0x1
        vaddps  ymm1, ymm7, YMMWORD PTR [r12+rsi]
        add     rsi, 32
        vcmpleps ymm1, ymm4, ymm1
        vandps  ymm1, ymm1, ymm6
        vmovaps xmm6, xmm0
        vextractf128 xmm0, ymm0, 0x1
        vmovaps xmm7, xmm1
        vextractf128 xmm1, ymm1, 0x1
        vmovmskps edx, xmm7
        popcnt  eax, edx
        sal     rdx, 4
        vpshufb xmm6, xmm6, XMMWORD PTR shufmasks.0[rdx]
        vmovmskps edx, xmm1
        popcnt  r10d, edx
        sal     rdx, 4
        vmovdqa XMMWORD PTR [rsp+32], xmm6
        vpshufb xmm0, xmm0, XMMWORD PTR shufmasks.0[rdx]
        movsx   rdx, eax
        add     eax, r10d
        vmovups XMMWORD PTR [rsp+32+rdx*4], xmm0
        vmovdqa ymm6, YMMWORD PTR [rsp+32]
        cdqe
        vmovdqu YMMWORD PTR [rdi], ymm6
        lea     rdi, [rdi+rax*4]
        cmp     rcx, r8
        jne     .L6

(asm by gcc 12.2.0)

.L4:
    vmovaps ymm11, YMMWORD PTR [rbx+rsi]
    vaddps  ymm12, ymm11, YMMWORD PTR [r12+rsi]
    xor eax, eax
    add rsi, 32
    lea r11d, 0[0+rcx*8]
    add rcx, 1
    vcmpleps    ymm14, ymm11, ymm3
    vmovd   xmm1, r11d
    xor r11d, r11d
    vcmpleps    ymm13, ymm4, ymm12
    vpshufd xmm0, xmm1, 0
    vpaddd  xmm7, xmm0, xmm5
    vpaddd  xmm9, xmm0, xmm8
    vinsertf128 ymm10, ymm7, xmm9, 0x1
    vandps  ymm15, ymm13, ymm14
    vextractf128    xmm0, ymm10, 0x1
    vextractf128    xmm1, ymm15, 0x1
    vmovmskps   r13d, xmm15
    vmovmskps   r14d, xmm1
    popcnt  eax, r13d
    sal r13, 4
    movsx   rdx, eax
    popcnt  r11d, r14d
    sal r14, 4
    vpshufb xmm7, xmm10, XMMWORD PTR [r8+r13]
    add eax, r11d
    vpshufb xmm9, xmm0, XMMWORD PTR [r8+r14]
    vmovdqa XMMWORD PTR 32[rsp], xmm7
    cdqe
    vmovups XMMWORD PTR 32[rsp+rdx*4], xmm9
    vmovdqa ymm10, YMMWORD PTR 32[rsp]
    vmovdqu YMMWORD PTR [rdi], ymm10
    lea rdi, [rdi+rax*4]
    cmp rcx, r9
    jne .L4

The AVX version is 2x longer, but does 2x more work, so it should equalize. But when I measure this code, the AVX version runs as slow as the scalar version. Does anything stand out to being particularly slow in the AVX loop? Is mixing 4-wide and 8-wide instructions in a same loop known to harm performance that bad? Or is it something else? Is that something that I can fix to make the AVX version at least catch up to the SSE4 version?

What CPU are you running on; does it only have AVX1 and not AVX2 for 256-bit `vpaddd ymm`? There's no inherent penalty for mixing 128-bit and 256-bit vector widths, as long as both use VEX encodings to avoid SSE/AVX transition penalties. — Peter Cordes, Jul 24 '23 at 01:09
https://uica.uops.info/ predicts that both loops will run about the same bytes/cycle on Skylake, with the SSE version running 5.5c/iter and the AVX version running 10.05c/iter, mostly limited by a bottleneck on front-end issue bandwidth. So it would help to save front-end uops, like increment a vector by 4 instead of `lea eax, [rcx*4]` / `add rcx, 1` / `movd xmm0, eax` / `pshufd xmm0, xmm0, 0` (broadcast). Also, what exactly is this doing? A SIMD compare and using its popcount to look up a shuffle control for a scaled counter? — Peter Cordes, Jul 24 '23 at 01:12
Also, the AVX1 version has a `cdqe` that you could probably eliminate by using `unsigned` when appropriate to avoid sign-extension from `int` to pointer width. And `vinsertf128` / `vextractf128` aren't free. — Peter Cordes, Jul 24 '23 at 01:15
@PeterCordes I'm running this on a i5 11600k, so it's not the hardware it's meant to run on. The code is meant for the few gens of intel CPUs that do not have avx2 but does have avx1, like Sandy Bridge. (and AMD alike) — aganm, Jul 24 '23 at 01:18
@PeterCordes Oh and what it does, it's a filter, it checks whether float values are within a range of values and stores the indices of those values in an output buffer. — aganm, Jul 24 '23 at 01:20
@PeterCordes Thank you for pointing out the sign-extension, that's something I should definitely take care of. — aganm, Jul 24 '23 at 01:22
Have you considered making 2 versions of the code, an SSE version and an AVX2 version, and skipping AVX1-without-AVX2? That's only Sandy/Ivy Bridge and Bulldozer family (and low-power Jaguar), all long obsolete from 10 years ago and more. And as you found, the 128-bit version using SSE isn't faster. Of course, left-packing with AVX2 is also a problem to do efficiently, especially for AMD Zen1/2 where `pext` is slow: [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240) . — Peter Cordes, Jul 24 '23 at 01:41
Your Rocket Lake CPU could do it *much* faster with AVX-512 `vpcompressd` to shuffle 256-bit or 512-bit vectors, and `kmov` / `popcnt` for the pointer increment. Same for Zen 4. But AVX2 is kind of stuck in the middle, with too many bits to index a big table of shuffle vectors, but without left-packing as a hardware-supported primitive operation (except for bits via BMI2 `pext`.) So 128-bit shuffle lookups are probably the way to go if you want an AVX2 version that doesn't face-plant on Zen 1 / Zen 2. — Peter Cordes, Jul 24 '23 at 01:41
When you say "much slower", what speed ratio? https://uica.uops.info/ predicts 4.52c / iter (128 bit = 8 elements) vs. 7.96c / iter (256-bit = 8 elements), so it predicts a speedup in terms of bytes/cycle for your AVX version vs. the SSE version. You do have a fairly decent amount of SIMD FP work to do before splitting into 128-bit halves for left-packing. (Or with AVX2, possibly `vmovdqu` / `vinserti128` loads from the LUT to set up for one `vpshufb ymm` and `vmovdqu` store / `vextractf128` store of the halves of the index vector after a single `vpaddd ymm`.) — Peter Cordes, Jul 24 '23 at 01:44
@PeterCordes On a small benchmark with 10k random float values, the AVX version is 4x slower than the SSE4 version. The scalar version is also 4x slower than the SSE4 version. I have considered dropping the AVX version altogether but I would like to figure out why the AVX is sooo slow. I will try to put in practice what you have told me so far and see if I can fix it. — aganm, Jul 24 '23 at 01:54
I think we need a [mcve] of your whole benchmark, since your AVX version should be faster on your CPU. If you're timing both in the same program, it could be warm-up effects if you're timing the AVX version first for only one pass over the array. ([Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)). I don't think 4x is explainable by anything in the actual loop itself, unless uiCA missed a false dependency or something. (But popcnt doesn't have one). — Peter Cordes, Jul 24 '23 at 02:20
Or it could perhaps be soft-throttling of 256-bit vector throughput, if your CPU gets up to max turbo for scalar / 128-bit vectors and then soft-throttles 256-bit `vaddps` / `vcmpps` if the voltage is below or frequency is above L1 "license" levels. ([SIMD instructions lowering CPU frequency](https://stackoverflow.com/a/56861355)). This is the effect Agner Fog guessed was due to waiting for the high halves of 256-bit execution units to "power on" in Haswell / Skylake, but the actual cause is soft throttling to avoid current spikes when the CPU voltage is minimal. — Peter Cordes, Jul 24 '23 at 02:20
@PeterCordes Progress: I wrote a minimal example of mixing 4-wide and 8-wide instructions together, and it ran fine, so the problem must be with something else. — aganm, Jul 24 '23 at 08:45
@PeterCordes Big progress: I ran it on my native windows machine, and it ran fine, slower still, but not by a ridiculous margin like before. I ran it on my native arch machine, and it ran fine, slower still, but not by a ridiculous margin like before. It's either of these two things: debian's fault: I don't have a native debian installed right now so I can't be sure, or the VM's fault: I've never had any performance issue with it, except for this. I place my bet that it's the VM's fault, maybe it's a bug in it? Many thanks for your help. I'm not sure what to do next other than ignore it. — aganm, Jul 24 '23 at 08:56
A VM can make page faults and TLB misses even slower, but if that's what your benchmark is bottlenecking on, you're benchmarking wrong. i5 11600k doesn't have any weird alignment-sensitive behaviour AFAIK, not like the Skylake JCC erratum. Maybe your CPU frequency "governor" isn't ramping up to max turbo quickly, if your benchmark is badly written without warm-up runs? On i5 11600k. the AVX version should be *faster* for the same sized array, if you use compiler options that make the same asm you showed in the question for the inner loops. — Peter Cordes, Jul 24 '23 at 16:18
@PeterCordes Okay.. so I tried to compile with clang in the VM instead of gcc and it's running good. So now I suspect a bug with gcc, but specifically the gcc that is shipped by debian 12, because I compiled it with gcc on my native arch machine and it ran fine. So the bad performance only happens under these conditions: when compiled inside the debian 12 VM with gcc (I tried both gcc 10 and gcc 12, both run slow). I haven't tested it with a native debian machine though, so I'm still not 100% sure if it's not something weird happening between debian gcc and the VM or it's only debian's fault. — aganm, Jul 24 '23 at 23:22
The Debian VM is on your Rocket Lake machine? (Or are all of these different distros + Windows on the same CPU?) And the inner loop's asm matches what you posted in the question? Different GCC versions can make different asm; GCC sometimes has regressions. Also, different toolchain versions can end up with different alignments for instructions. This inner loop should run from the LSD (loop buffer) on Rocket Lake and not be affected, though. — Peter Cordes, Jul 24 '23 at 23:58
If any of this testing involves a Skylake-family CPU, definitely see [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) — Peter Cordes, Jul 24 '23 at 23:58
@PeterCordes Yes, my Debian VM is on my Rocket lake machine. My Windows native is on the Rocket Lake too. And my native Arch is on a Kaby Lake. Also it's true that the asm was slightly different from compiler version to compiler version, I added the exact asm from gcc 12 in the question alongside the old one which was taken from gcc 13. — aganm, Jul 25 '23 at 00:27
https://uica.uops.info/ predicts only slightly slower performance on Rocket Lake for the GCC12 asm vs. the original code in the question. (8.0 c / iter limited by the LSD). It also predicts that Kaby Lake will hit the JCC erratum if the top of the loop is at a 32-byte alignment boundary. But that shouldn't affect Rocket Lake. Might be a good idea to disassemble the actual executable. Maybe also to profile it and make sure that almost all its time was actually spent in the loop you're asking about rather than elsewhere. `perf record` probably doesn't work in a VM, for HW events at least — Peter Cordes, Jul 25 '23 at 00:33

Does mixing 4-wide and 8-wide instructions suppose to harm performance that bad?

0 Answers0