GCC w/ inline assembly & -Ofast generating extra code for memory operand

Question

I am inputting the address of an index into a table into an extended inline assembly operation, but GCC is producing an extra lea instruction when it is not necessary, even when using -Ofast -fomit-frame-pointer or -Os -f.... GCC is using RIP-relative addresses.

I was creating a function for converting two consecutive bits into a two-part XMM mask (1 quadword mask per bit). To do this, I am using _mm_cvtepi8_epi64 (internally vpmovsxbq) with a memory operand from a 8-byte table with the bits as index.

When I use the intrinsic, GCC produces the exactly same code as using the extended inline assembly.

I can directly embed the memory operation into the ASM template, but that would force RIP-relative addressing always (and I don't like forcing myself into workarounds).

typedef uint64_t xmm2q __attribute__ ((vector_size (16)));

// Used for converting 2 consecutive bits (as index) into a 2-elem XMM mask (pmovsxbq)
static const uint16_t MASK_TABLE[4] = { 0x0000, 0x0080, 0x8000, 0x8080 };

xmm2q mask2b(uint64_t mask) {
    assert(mask < 4);
    #ifdef USE_ASM
        xmm2q result;
        asm("vpmovsxbq %1, %0" : "=x" (result) : "m" (MASK_TABLE[mask]));
        return result;
    #else
        // bad cast (UB?), but input should be `uint16_t*` anyways
        return (xmm2q) _mm_cvtepi8_epi64(*((__m128i*) &MASK_TABLE[mask]));
    #endif
}

Output assembly with -S (with USE_ASM and without):

__Z6mask2by:                            ## @_Z6mask2by
        .cfi_startproc
## %bb.0:
        leaq    __ZL10MASK_TABLE(%rip), %rax
        vpmovsxbq       (%rax,%rdi,2), %xmm0
        retq
        .cfi_endproc

What I was expecting (I've removed all the extra stuff):

__Z6mask2by:
        vpmovsxbq __ZL10MASK_TABLE(%rip,%rdi,2), %xmm0
        retq

What GCC version do you use. Check [this](https://gcc.godbolt.org/z/TBCgF4) — Victor Gubin, Jul 11 '19 at 09:13
@VictorGubin: Godbolt GCC is configured with non-PIE as the default. The OP is clearly using a GCC configured with `--enable-default-pie`. See [32-bit absolute addresses no longer allowed in x86-64 Linux?](//stackoverflow.com/q/43367427)&lq=1 — Peter Cordes, Jul 11 '19 at 09:16
In any case, I'm on a Macbook Pro with GCC version `Apple LLVM version 10.0.1 (clang-1001.0.46.4)`. Oh crap, it's just Clang in disguise. — Arav K., Jul 11 '19 at 09:25

Peter Cordes · Accepted Answer · 2019-07-11T09:42:37.487

The only RIP-relative addressing mode is RIP + rel32. RIP + reg is not available.

(In machine code, 32-bit code used to have 2 redundant ways to encode [disp32]. x86-64 uses the shorter (no SIB) form as RIP relative, the longer SIB form as [sign_extended_disp32]).

If you compile for Linux with -fno-pie -no-pie, GCC will be able to access static data with a 32-bit absolute address, so it can use a mode like __ZL10MASK_TABLE(,%rdi,2). This isn't possible for MacOS, where the base address is always above 2^32; 32-bit absolute addressing is completely unsupported on x86-64 MacOS.

In a PIE executable (or PIC code in general like a library), you need a RIP-relative LEA to set up for indexing a static array. Or any other case where the static address won't fit in 32 bits and/or isn't a link-time constant.

Intrinsics

Yes, intrinsics make it very inconvenient to express a pmovzx/sx load from a narrow source because pointer-source versions of the intrinsics are missing.

*((__m128i*) &MASK_TABLE[mask] isn't safe: if you disable optimization, you might well get a movdqa 16-byte load but the address will be misaligned. It's only safe when the compiler folds the load into a memory operand for pmovzxbq which has a 2-byte memory operand therefore not requiring alignment.

In fact current GCC does compile your code with a movdqa 16-byte load like movdqa xmm0, XMMWORD PTR [rax+rdi*2] before a reg-reg pmovzx. This is obviously a missed optimization. :( clang/LLVM (which MacOS installs as gcc) does fold the load into pmovzx.

The safe way is _mm_cvtepi8_epi64( _mm_cvtsi32_si128(MASK_TABLE[mask]) ) or something, and then hoping the compiler optimizes away the zero-extend from 2 to 4 bytes and folds the movd into a load when you enable optimization. Or maybe try _mm_loadu_si32 for a 32-bit load even though you really want 16. But last time I tried, compilers sucked at folding a 64-bit load intrinsic into a memory operand for pmovzxbw for example. GCC and clang still fail at it, but ICC19 succeeds. https://godbolt.org/z/IdgoKV

I've written about this before:

Your integer -> vector strategy

Your choice of pmovsx seems odd. You don't need sign-extension, so I would have picked pmovzx (_mm_cvt_epu8_epi64). It's not actually more efficient on any CPUs, though.

A lookup table does work here with only a small amount of static data needed. If your mask range was any bigger, you'd maybe want to look into is there an inverse instruction to the movemask instruction in intel avx2? for alternative strategies like broadcast + AND + (shift or compare).

If you do this often, using a whole cache line of 4x 16-byte vector constants might be best so you don't need a pmovzx instruction, just index into an aligned table of xmm2 or __m128i vectors which can be a memory source for any other SSE instruction. Use alignas(64) to get all the constants in the same cache line.

You could also consider (intrinsics for) pdep + movd xmm0, eax + pmovzxbq reg-reg if you're targeting Intel CPUs with BMI2. (pdep is slow on AMD, though).

I need `pmovsx` so that the sign bits are set for both quadwords so that I can perform AVX2 blends (`vblendvpd`) later. When I said masks, I meant _sign_ masks. Sorry about the confusion. — Arav K., Jul 11 '19 at 09:18
@AravK.: The example in your question is sign-extending `0x01` to `0x0000000000000001`. Did you mean to use `0xFF` -> `0xFFFFFF...`? — Peter Cordes, Jul 11 '19 at 09:20
Thanks for explaining the memory addressing! I had believed that _any_ register could be used in the `offset(base,index,scale)` form. — Arav K., Jul 11 '19 at 09:21
oh yeah, thanks for pointing that out! I didn't realize that I had set the low bits. I'll fix that now. — Arav K., Jul 11 '19 at 09:22
Also: Any chance that GCC can optimize away the `pmovsxbq` and modify the table to use expanded constants (given that it is `static const`)? Or is that too much? — Arav K., Jul 11 '19 at 09:29
@AravK.: RIP doesn't really count as a register. The `offset(%rip)` syntax is basically separate from `offset(base,idx,scale)` with GP registers. — Peter Cordes, Jul 11 '19 at 09:32
@AravK.: GCC could in theory optimize your intrinsics source to a table of 4x 16-byte vectors, but in practice I don't think it looks for that transformation. If you want that, you should simply write it that way in your source with a statically-initialized array of `alignas(64) static const xmm2 masks[] = { {0,0}, {0,-1ULL}, ...};` — Peter Cordes, Jul 11 '19 at 09:34
I was wondering about how well GCC can optimize; I would have done it myself. Note that GCC cannot optimize away the 32 -> 128 -> `pmovsx` safe form - it leaves a `vmovd` behind (in `-Ofast`). I'm sticking with the inline assembly version. — Arav K., Jul 11 '19 at 09:39
@AravK.: There's a balance between how much and what kind of transformations are reasonable. Like if you wrote an InsertionSort, would you want the compiler to *always* replace it with a QuickSort? Even moreso for changing the size vs. speed tradeoff from the source by using fewer or cheaper instructions but a larger table. That's very different from the missed optimization of not folding a `vmovd` load into a memory source for `vpmovsx`. — Peter Cordes, Jul 11 '19 at 09:47
Sorry, I just wanted to let you know about the `vmovd` (and not relate it to the table optimization). I agree completely. — Arav K., Jul 11 '19 at 10:09

GCC w/ inline assembly & -Ofast generating extra code for memory operand

1 Answers1

Intrinsics

Your integer -> vector strategy