I am inputting the address of an index into a table into an extended inline assembly operation, but GCC is producing an extra lea
instruction when it is not necessary, even when using -Ofast -fomit-frame-pointer
or -Os -f...
. GCC is using RIP-relative addresses.
I was creating a function for converting two consecutive bits into a two-part XMM mask (1 quadword mask per bit). To do this, I am using _mm_cvtepi8_epi64
(internally vpmovsxbq
) with a memory operand from a 8-byte table with the bits as index.
When I use the intrinsic, GCC produces the exactly same code as using the extended inline assembly.
I can directly embed the memory operation into the ASM template, but that would force RIP-relative addressing always (and I don't like forcing myself into workarounds).
typedef uint64_t xmm2q __attribute__ ((vector_size (16)));
// Used for converting 2 consecutive bits (as index) into a 2-elem XMM mask (pmovsxbq)
static const uint16_t MASK_TABLE[4] = { 0x0000, 0x0080, 0x8000, 0x8080 };
xmm2q mask2b(uint64_t mask) {
assert(mask < 4);
#ifdef USE_ASM
xmm2q result;
asm("vpmovsxbq %1, %0" : "=x" (result) : "m" (MASK_TABLE[mask]));
return result;
#else
// bad cast (UB?), but input should be `uint16_t*` anyways
return (xmm2q) _mm_cvtepi8_epi64(*((__m128i*) &MASK_TABLE[mask]));
#endif
}
Output assembly with -S
(with USE_ASM
and without):
__Z6mask2by: ## @_Z6mask2by
.cfi_startproc
## %bb.0:
leaq __ZL10MASK_TABLE(%rip), %rax
vpmovsxbq (%rax,%rdi,2), %xmm0
retq
.cfi_endproc
What I was expecting (I've removed all the extra stuff):
__Z6mask2by:
vpmovsxbq __ZL10MASK_TABLE(%rip,%rdi,2), %xmm0
retq