Unnecessary instructions generated for _mm_movemask_epi8 intrinsic in x64 mode

Question

The intrinsic function _mm_movemask_epi8 from SSE2 is defined by Intel with the following prototype:

  int _mm_movemask_epi8 (__m128i a);

This intrinsic function directly corresponds to the pmovmskb instruction, which is generated by all compilers.

According to this reference, the pmovmskb instruction can write the resulting integer mask to either a 32-bit or a 64-bit general purpose register in x64 mode. In any case, only 16 lower bits of the result can be nonzero, i.e. the result is surely within range [0; 65535].

Speaking of the intrinsic function _mm_movemask_epi8, its returned value is of type int, which as a signed integer of 32-bit size on most platforms. Unfortunately, there is no alternative function which returns a 64-bit integer in x64 mode. As a result:

Compiler usually generates pmovmskb instruction with 32-bit destination register (e.g. eax).
Compiler cannot assume that upper 32 bits of the whole register (e.g. rax) are zero.
Compiler inserts unnecessary instruction (e.g. mov eax, eax) to zero the upper half of 64-bit register, given that the register is later used as 64-bit value (e.g. as an index of array).

An example of code and generated assembly with such a problem can be seen in this answer. Also the comments to that answer contain some related discussion. I regularly experience this problem with MSVC2013 compiler, but it seems that it is also present on GCC.

The questions are:

Why is this happening?
Is there any way to reliably avoid generation of unnecessary instructions on popular compilers? In particular, when result is used as index, i.e. in x = array[_mm_movemask_epi8(xmmValue)];
What is the approximate cost of unnecessary instructions like mov eax, eax on modern CPU architectures? Is there any chance that these instructions are completely eliminated by CPU internally and they do not actually occupy time of execution units (Agner Fog's instruction tables document mentions such a possibility).

Regarding question #2, you can always drop down to inline assembly, although my memory is telling me that perhaps MSVC doesn't support that. — Jason R, Mar 15 '16 at 17:03
@JasonR: Yes, MSVC x64 has no inline assembly. Also, in MSVC x32 inline assembly inhibits optimization completely. But what is important in general case, using compiler instead of assembly allows composing efficient code from inlined functions. It allows even making highly configurable and readable code using C++ templates. Not to mention that writing in intrinsics is order of magnitude easier and faster for programmer and much more portable. — stgatilov, Mar 15 '16 at 17:19
I agree; intrinsics are certainly preferable. Regarding question #3, have you tried profiling it in your application? Unless this is in an inner loop, I doubt the extra instruction would have a meaningful impact, especially if you have memory accesses in there as well. — Jason R, Mar 15 '16 at 17:39
cdqe - extends 32-bit to 64-bit, executes with a 1 clock latency. If you are not touching the upper half in the register, I've seen gcc clear once (say outside a loop) and skip this. — ChipK, Mar 15 '16 at 17:56

Peter Cordes · Answer 1 · 2019-11-21T08:56:25.843

Why is this happening?

gcc's internal instruction definitions that tells it what pmovmskb does must be failing to inform it that the upper 32-bits of rax will always be zero. My guess is that it's treated like a function call return value, where the ABI allows a function returning a 32bit int to leave garbage in the upper 32bits of rax.

GCC does know about 32-bit operations in general zero-extending for free, but this missed optimization is widespread for intrinsics, also affecting scalar intrinsics like _mm_popcnt_u32.

There's also the issue of gcc (not) knowing that the actual result has set bits only in the low 16 of its 32-bit int result (unless you used AVX2 vpmovmskb ymm). So actual sign extension is unnecessary; implicit zero extension is totally fine.

Is there any way to reliably avoid generation of unnecessary instructions on popular compilers? In particular, when result is used as index, i.e. in x = array[_mm_movemask_epi8(xmmValue)];

No, other than fixing gcc. Has anyone reported this as a compiler missed-optimization bug?

clang doesn't have this bug. I added code to Paul R's test to actually use the result as an array index, and clang is still fine.

gcc always either zero or sign extends (to a different register in this case, perhaps because it wants to "keep" the 32-bit value in the bottom of RAX, not because it's optimizing for mov-elimination.

Casting to unsigned helps with GCC6 and later; it will use the pmovmskb result directly as part of an addressing mode, but also returning it results in a mov rax, rdx.

And with older GCC, at least gets it to use mov instead of movsxd or cdqe.

What is the approximate cost of unnecessary instructions like mov eax, eax on modern CPU architectures? Is there any chance that these instructions are completely eliminated by CPU internally and they do not actually occupy time of execution units (Agner Fog's instruction tables document mentions such a possibility).

mov same,same is never eliminated on SnB-family microarchitectures or AMD zen. mov ecx, eax would be eliminated. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for details.

Even if it doesn't take an execution unit, it still takes a slot in the fused-domain part of the pipeline, and a slot in the uop-cache. And code-size. If you're close to the front-end 4 fused-domain uops per clock limit (pipeline width), then it's a problem.

It also costs an extra 1c of latency in the dep chain.

(Back-end throughput is not a problem, though. On Haswell and newer, it can run on port6 which has no vector execution units. On AMD, the integer ports are separate from the vector ports.)

Paul R · Answer 2 · 2016-03-15T17:23:49.790

gcc.godbolt.org is a great online resource for testing this kind of issue with different compilers.

clang seems to do the best with this, e.g.

#include <xmmintrin.h>
#include <cstdint>

int32_t test32(const __m128i v) {
  int32_t mask = _mm_movemask_epi8(v);
  return mask;
}

int64_t test64(const __m128i v) {
  int64_t mask = _mm_movemask_epi8(v);
  return mask;
}

generates:

test32(long long __vector(2)):                         # @test32(long long __vector(2))
        vpmovmskb       eax, xmm0
        ret

test64(long long __vector(2)):                         # @test64(long long __vector(2))
        vpmovmskb       eax, xmm0
        ret

Whereas gcc generates an extra cdqe instruction in the 64-bit case:

test32(long long __vector(2)):
        vpmovmskb       eax, xmm0
        ret
test64(long long __vector(2)):
        vpmovmskb       eax, xmm0
        cdqe
        ret

Unnecessary instructions generated for _mm_movemask_epi8 intrinsic in x64 mode

2 Answers2