Why is vzeroupper being inserted at the end of this code?

Question

I noticed something strange when I compile this code on godbolt, with MSVC:

#include <intrin.h>
#include <cstdint>

void test(unsigned char*& pSrc) {
    __m256i data = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(pSrc));

    int32_t mask = _mm256_movemask_epi8(data);
    if (!mask) {
        ++pSrc;
    }
    else {
        unsigned long v;
        _BitScanForward(&v, mask);
        pSrc += v;
    }
}

I get this resulting assembly:

pSrc$ = 8
void test(unsigned char * &) PROC                                ; test, COMDAT
        mov     rdx, QWORD PTR [rcx]
        vmovdqu ymm0, YMMWORD PTR [rdx]
        vpmovmskb eax, ymm0
        test    eax, eax
        jne     SHORT $LN2@test
        mov     eax, 1
        add     rax, rdx
        mov     QWORD PTR [rcx], rax
        vzeroupper                                               ; Why is this being inserted?
        ret     0
$LN2@test:
        bsf     eax, eax
        add     rax, rdx
        mov     QWORD PTR [rcx], rax
        vzeroupper                                               ; Why is this being inserted?
        ret     0
void test(unsigned char * &) ENDP                                ; test

Why is vzeroupper being inserted at the end of each scope? I heard that it's because of switching between SSE and AVX, but I'm not doing that here. I'm using exclusively AVX code.

I was wondering, does this pose a performance problem?

You're returning from a YMM-using function to a caller that potentially isn't AVX-aware and will use SSE instructions. [Do I need to use \_mm256\_zeroupper in 2021?](https://stackoverflow.com/a/68738289) describes when and why compilers use `vzeroupper` automatically. (Not quite a duplicate because that's only part of the answer to that question.) — Peter Cordes, Aug 19 '21 at 04:57
And BTW, `++pSrc;` only increments by 1 byte, leaving 31 other bytes to be re-checked if you call this in a loop. Normally you want to distinguish between found somewhere vs. not found at all to enable a search loop to work properly. — Peter Cordes, Aug 19 '21 at 13:24
If the question is how to avoid `vzeroupper` in MSVC, then apparently it is not possible. Except if `test` function is inlined into outer loop, then `vzeroupper` will disappear from inlined code, as there would be `vzeroupper` at exits from outer function — Alex Guteniev, Aug 29 '21 at 20:32

Why is vzeroupper being inserted at the end of this code?

0 Answers0