The intrinsic function _mm_movemask_epi8
from SSE2 is defined by Intel with the following prototype:
int _mm_movemask_epi8 (__m128i a);
This intrinsic function directly corresponds to the pmovmskb
instruction, which is generated by all compilers.
According to this reference, the pmovmskb
instruction can write the resulting integer mask to either a 32-bit or a 64-bit general purpose register in x64 mode. In any case, only 16 lower bits of the result can be nonzero, i.e. the result is surely within range [0; 65535].
Speaking of the intrinsic function _mm_movemask_epi8
, its returned value is of type int
, which as a signed integer of 32-bit size on most platforms. Unfortunately, there is no alternative function which returns a 64-bit integer in x64 mode. As a result:
- Compiler usually generates
pmovmskb
instruction with 32-bit destination register (e.g.eax
). - Compiler cannot assume that upper 32 bits of the whole register (e.g.
rax
) are zero. - Compiler inserts unnecessary instruction (e.g.
mov eax, eax
) to zero the upper half of 64-bit register, given that the register is later used as 64-bit value (e.g. as an index of array).
An example of code and generated assembly with such a problem can be seen in this answer. Also the comments to that answer contain some related discussion. I regularly experience this problem with MSVC2013 compiler, but it seems that it is also present on GCC.
The questions are:
- Why is this happening?
- Is there any way to reliably avoid generation of unnecessary instructions on popular compilers? In particular, when result is used as index, i.e. in
x = array[_mm_movemask_epi8(xmmValue)];
- What is the approximate cost of unnecessary instructions like
mov eax, eax
on modern CPU architectures? Is there any chance that these instructions are completely eliminated by CPU internally and they do not actually occupy time of execution units (Agner Fog's instruction tables document mentions such a possibility).