1

Basically I am trying to understand why both gcc/clang use xmm register for their __builtin_memset even when the memory destination and size are both divisible by sizeof ymm (or zmm for that matter) and the CPU supports AVX2 / AVX512.

and why GCC implements __builtin_memset on medium sized values without any SIMD (again assuming CPU supports SIMD).

For example:

__builtin_memset(__builtin_assume_aligned(ptr, 64), -1, 64));

Will compile to:

        vpcmpeqd        %xmm0, %xmm0, %xmm0
        vmovdqa %xmm0, (%rdi)
        vmovdqa %xmm0, 16(%rdi)
        vmovdqa %xmm0, 32(%rdi)
        vmovdqa %xmm0, 48(%rdi)

I am trying to understand why this is chosen as opposed to something like

        vpcmpeqd        %ymm0, %ymm0, %ymm0
        vmovdqa %ymm0, (%rdi)
        vmovdqa %ymm0, 32(%rdi)

if you mix the __builtin_memset with AVX2 instructions they still use xmm so its definitely not to save the vzeroupper

Second for GCC's __builtin_memset(__builtin_assume_aligned(ptr, 64), -1, 512) gcc implements it as:

        movq    $-1, %rdx
        xorl    %eax, %eax
.L8:
        movl    %eax, %ecx
        addl    $32, %eax
        movq    %rdx, (%rdi,%rcx)
        movq    %rdx, 8(%rdi,%rcx)
        movq    %rdx, 16(%rdi,%rcx)
        movq    %rdx, 24(%rdi,%rcx)
        cmpl    $512, %eax
        jb      .L8
        ret

Why would gcc choose this over a loop with xmm (or ymm / zmm) registers?

Here is a godbolt link with the examples (and a few others)

Thank you.

Edit: clang uses ymm (but not zmm)

Noah
  • 1,647
  • 1
  • 9
  • 18
  • Some of https://stackoverflow.com/questions/52523349/avx-512-vs-avx2-performance-for-simple-array-processing-loops/52523647#52523647 might be relevant. – Nate Eldredge Jan 02 '21 at 02:07
  • Thanks! That seems mostly about the pros / cons of AVX512 vs AVX2 whereas GCC is still using SSE. – Noah Jan 02 '21 at 20:21
  • I would have used AVX2 `_mm256_xor_si256` in your ifdef part, instead of AVX512 `_mm256_xor_epi32` (`vpxord`). That lets it compile with older `-march` options. But yeah, looks like especially GCC could use some tuning work; even with `-march=knl` (Xeon Phi Knight's Landing, where there's zero advantage to avoiding ZMM), it still only uses XMM. – Peter Cordes Jan 04 '21 at 04:51
  • GCC has a silly missed-optimization with saving the incoming integer args across function calls: it aligns the stack so it can save *two vectors* https://godbolt.org/z/7eqEGK Clang does a scalar XOR into a call-preserved integer reg, then broadcasts from that into a vector! However, none of your VZERO stuff will let the non-inlined memcpy calls avoid `vzeroupper` themselves, which might encourage wider vectors given the `-mprefer-vector-width=256` default for CPUs like Skylake and Ice Lake. (Possibly should be 512 for ICL; Intel's manual says frequency penalties are lower.) – Peter Cordes Jan 04 '21 at 04:55
  • Yeah, you should probably just report this as a GCC missed-optimization bug on https://gcc.gnu.org/bugzilla/. Along with https://godbolt.org/z/f7o9sc - if you let the calls inline (so the largest one makes the earlier ones "dead"), GCC uses a scalar loop unrolled by 4, instead of SIMD stores!! (`-mno-vzeroupper` has no effect either). – Peter Cordes Jan 04 '21 at 04:59
  • 1
    Looks like GCC doesn't know how to use `vmovdqa` in a loop for inline expansion of memset. For up to 256 bytes, it will fully unroll with `vmovdqa xmm`, but for more bytes it will use 4x `movq %r64, m64` in a loop with a silly indexed addressing mode. (Or for `-march=sandybridge` where that would un-laminate, uses `rep stosq` https://godbolt.org/z/P145b6). – Peter Cordes Jan 04 '21 at 05:12
  • @PeterCordes the scalar loop seems pretty insane. Is there some rational for it (i.e some issue with a ```xmm``` / ```zmm``` loop ([glibc doesnt think so](https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S#L203))?). Is there any world where scalar loop > ```rep stosq``` > SIMD loop? (only case I can imagine is if SIMD units are in low power mode and its a small copy, but given that GCC emitted SIMD instructions before the scalar loop can't imagine that was part of the optimization plan). – Noah Jan 04 '21 at 18:10
  • 1
    @Noah: `rep stosq` has a lot of startup overhead. A scalar loop handles misalignment even on ancient CPUs like Core2 where `movdqu` is expensive and is not bad for small counts. But it's never going to be better than half as many `movdqa` in a loop for aligned data, probably not even if it decodes to multiple uops on Pentium-M (two 64-bit halves). GCC's memset expansion tuning probably hasn't been looked at for a while, maybe not since then? And doesn't take advantage of alignment, and is pretty primitive. – Peter Cordes Jan 04 '21 at 18:28
  • @PeterCordes GCC might be doing the right thing given [frequency, ipc, and voltage transitions from using AVX2 and AVX512](https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html). Depending on the length at least could see the case for sticking with ```xmm```. – Noah Feb 20 '21 at 23:22
  • 1
    @Noah: `movdqa xmm` never has that problem. It's always dumb to be using twice as many scalar stores. Avoiding YMM sometimes makes sense (if surrounding code never uses them either), but the `movq %rdx, (%rdi,%rcx)` ... loop never makes any sense. (Unless tuning for Core2 or earlier without known alignment, neither of which is true here.) – Peter Cordes Feb 20 '21 at 23:33
  • @PeterCordes yeah I was referring the choice to use ```xmm``` instead of ```ymm``` – Noah Feb 20 '21 at 23:37

0 Answers0