This is related to, but distinct from, this question:
How to clear the upper 128 bits of __m256 value?
Let me start with what I believe to be the "correct" intrinsics code.
__m256i mask()
{
return _mm256_zextsi128_si256(_mm_set1_epi8(-1));
}
This code sets the low 128-bits of the __m256i
value to -1 (all-ones) and the high 128-bits to 0.
Here is the assembly I want to see:
vpcmpeqd %xmm0,%xmm0,%xmm0
At least, this is what I think I want to see, in that I believe it to be (a) correct and (b) optimal. Please correct me if I am wrong.
Now, never mind that GCC does not have _mm256_zextsi128_si256
prior to GCC 10. I have found no way to convince any of the compilers I have tried (Clang trunk, GCC trunk, Intel Compiler 19) to generate this simple one-insn output. Try for yourself on godbolt. Clang in particular does pretty poorly, since it "figures out" the constant and loads it from memory. And don't get me started on MSVC...
The GCC and IC19 outputs are not too bad; they just have one extra vmov...
from %xmm0
to itself. But it still bothers me. Although maybe that is basically free and it shouldn't (?)
The only way I have found to generate this single insn is like so:
__m256i mask()
{
__m256i result;
__asm__ ("vpcmpeqd %%xmm0,%%xmm0,%%xmm0" : "=Yz" (result));
return result;
}
This does what I want on GCC and IC19, but of course it does not work on MSVC. And it gives a compilation error on Clang (godbolt again). Aside: Should I report this as a Clang bug?
It seems to me this is a specific case of a more general problem, which is obtaining optimal code when I actually want to zero out the high part of a YMM register. The intrinsics support in the major compilers does not quite seem up to the task, and there is no inline asm constraint meaning "YMM register, but named as its XMM counterpart".
Am I missing something?
(Update)
I have filed bugs 45806 and 45808 against Clang and 94962 against GCC.