Optimal code for creating this mask with intrinsics?

Question

This is related to, but distinct from, this question:

How to clear the upper 128 bits of __m256 value?

Let me start with what I believe to be the "correct" intrinsics code.

__m256i mask()
{
    return _mm256_zextsi128_si256(_mm_set1_epi8(-1));
}

This code sets the low 128-bits of the __m256i value to -1 (all-ones) and the high 128-bits to 0.

Here is the assembly I want to see:

vpcmpeqd %xmm0,%xmm0,%xmm0

At least, this is what I think I want to see, in that I believe it to be (a) correct and (b) optimal. Please correct me if I am wrong.

Now, never mind that GCC does not have _mm256_zextsi128_si256 prior to GCC 10. I have found no way to convince any of the compilers I have tried (Clang trunk, GCC trunk, Intel Compiler 19) to generate this simple one-insn output. Try for yourself on godbolt. Clang in particular does pretty poorly, since it "figures out" the constant and loads it from memory. And don't get me started on MSVC...

The GCC and IC19 outputs are not too bad; they just have one extra vmov... from %xmm0 to itself. But it still bothers me. Although maybe that is basically free and it shouldn't (?)

The only way I have found to generate this single insn is like so:

__m256i mask()
{
    __m256i result;
    __asm__ ("vpcmpeqd %%xmm0,%%xmm0,%%xmm0" : "=Yz" (result));
    return result;
}

This does what I want on GCC and IC19, but of course it does not work on MSVC. And it gives a compilation error on Clang (godbolt again). Aside: Should I report this as a Clang bug?

It seems to me this is a specific case of a more general problem, which is obtaining optimal code when I actually want to zero out the high part of a YMM register. The intrinsics support in the major compilers does not quite seem up to the task, and there is no inline asm constraint meaning "YMM register, but named as its XMM counterpart".

Am I missing something?

(Update)

I have filed bugs 45806 and 45808 against Clang and 94962 against GCC.

*Although maybe that is basically free and it shouldn't (?)* `mov` / `vmoddqa` is never free, even if it's eliminated in the back-end. [Can x86's MOV really be "free"? Why can't I reproduce this at all?](https://stackoverflow.com/q/44169342) It still takes a front-end up, and space in the I-cache. And space in the ROB, costing size in the out-of-order execution window. At least it avoids needing a vector ALU port on CPUs after IvyBridge / Bulldozer, but yes it seems to be a compiler missed-optimization bug. And yes, a single `vpcmpeqd xmm` is what you want here. — Peter Cordes, May 05 '20 at 02:59
https://godbolt.org/z/B2Lkfl - `_mm256_set_m128i(_mm_setzero_si128(), _mm_set1_epi8(-1));` is even worse. GCC wastes multiple instructions including `vinserti128`. Clang does constant-propagation and uses a 32-byte load. — Peter Cordes, May 05 '20 at 03:46
@PeterCordes: Yes I tried that too. Actually Clang seems to emit the same code for everything I try; it's pretty good at "figuring out" the constant and emitting what it wants. I am not sure how much worse the load is than `vpcmpeqd xmm` for my application anyway, tbh. Thanks for your interest; by the way. You are one of the people I was hoping would take a look. — Nemo, May 05 '20 at 04:03
Generally loading constants from memory is something compilers assume is cheap, or cheaper than spending code-size to construct them from mov-immediate or whatever. In practice that's true when you call a function frequently so it stays hot in cache. And if not, hopefully a miss is amortized over many loop iterations that use the constant. It's good for code-size, if not overall size. `vpcmpeqd` is almost certainly better than loading from memory as setup for a loop. And yes, clang has a good shuffle optimizer even for cases where the data isn't constant. — Peter Cordes, May 05 '20 at 04:36
@PeterCordes: Clang does ever weirder things with this [in context](https://gcc.godbolt.org/z/GPhJ6s). — Nemo, May 05 '20 at 19:45

Optimal code for creating this mask with intrinsics?

0 Answers0