2

I've got an AVX kernel I wrote to do complex conjugate multiplies:

__attribute__((noinline))
static __attribute__((target("avx"))) void asm_vcmulcc(
    cfloat* __restrict__ cc, const cfloat* __restrict__ aa, const cfloat* __restrict__ bb, ssize_t size) {

    ssize_t iters = size/4;
    ssize_t rem   = size-iters*4;

    __asm__(
        ".section .rodata # constant section\n\t"
        ".align 32        # 32 byte alignment\n\t"
        "LC%=:\n\t" 
        "     .long 0x80000000\n\t"
        "     .long 0x80000000\n\t"
        "     .long 0x80000000\n\t"
        "     .long 0x80000000\n\t"
        "     .long 0x80000000\n\t"
        "     .long 0x80000000\n\t"
        "     .long 0x80000000\n\t"
        "     .long 0x80000000\n\t"
        ""
        ".text\n\t"
        "     vmovaps   LC%=(%%rip), %%ymm4\n\t"
        "     xorl      %%eax,  %%eax\n\t"
        ""
        ".p2align 4\n\t"
        "LOOP%=:\n\t"
        "     vmovups   (%[bb],%%rax,1), %%ymm3\n\t"
        "     vmovups   (%[aa],%%rax,1), %%ymm1\n\t"
        "     vpermilps $0xa0,  %%ymm1,  %%ymm2\n\t"
        "     vpermilps $0xf5,  %%ymm1,  %%ymm0\n\t"               
        "     vmulps    %%ymm3, %%ymm2,  %%ymm2\n\t"
        "     vxorps    %%ymm4, %%ymm0,  %%ymm0\n\t"
        "     vpermilps $0xb1,  %%ymm3,  %%ymm3\n\t"
        "     vmulps    %%ymm3, %%ymm0,  %%ymm0\n\t"
        "     vaddsubps %%ymm0, %%ymm2,  %%ymm0\n\t"
        "     vmovups   %%ymm0, (%[cc],%%rax,1)\n\t"
        "     addq      $32,      %%rax\n\t"
        "     cmpq      %[bytes], %%rax\n\t"
        "     jl        LOOP%=\n\t"
        :
        : [aa] "r" (aa), [bb] "r" (bb), [cc] "r" (cc), [bytes] "r" (iters*4*sizeof(cfloat))
        : "ymm0", "ymm1", "ymm2", "ymm3", "ymm4", "rax", "memory"
    );

    if (rem > 0) {
        aa += iters*4;
        bb += iters*4;
        cc += iters*4;

        for (ssize_t ii=0; ii < rem; ii++) {
            cc[ii] = conj(aa[ii])*bb[ii];
        }
    }
}

Which works great with Intel compilers, and gcc >= 5, but gcc < 5 errors out (this is g++ 4.8.5):

> g++ -std=c++0x -I. -c -mavx lib.cc -O3 -o lib.o
lib.cc: In function ‘void avx_vcmulcc(prelude::{anonymous}::cfloat*, const cfloat*, const cfloat*, int)’:
lib.cc:80:6: error: unknown register name ‘ymm4’ in ‘asm’
     );
      ^
lib.cc:80:6: error: unknown register name ‘ymm3’ in ‘asm’
lib.cc:80:6: error: unknown register name ‘ymm2’ in ‘asm’
lib.cc:80:6: error: unknown register name ‘ymm1’ in ‘asm’
lib.cc:80:6: error: unknown register name ‘ymm0’ in ‘asm’

With or without the -mavx option. Apparently the compiler is allowed to emit AVX, but won't let it pass through unmolested? Is there a hidden option somewhere to suppress this?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
gct
  • 14,100
  • 15
  • 68
  • 107
  • It looks like anything earlier than GCC 4.9 doesn't have support for `ymm` registers in the clobber list. seems to me that is a deficiency in the compiler – Michael Petch Jul 27 '17 at 19:01
  • 1
    @MichaelPetch I was just browsing the 4.8.5 source and discovering the same thing. Surely, specifying the equivalent XMM registers would be sufficient, no? – gct Jul 27 '17 at 19:08
  • You could use an `[bytes] "re" (...)` constraint to allow [a sign-extended 32-bit immediate](https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints) or register instead of forcing gcc to put a compile-time constant in a register. You could also use an `[idx] "=r" (dummy)` constraint to let the compiler pick a scratch reg for you instead of hard-coding `rax`. – Peter Cordes Jul 28 '17 at 00:35
  • Or better, use pointer increments so you can use a non-indexed addressing mode for the store (so it stays micro-fused [instead of un-laminating on SnB/IvB](https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes), and the store-address uop can run on p7 on Haswell+). You can do the loads relative to the destination: `a_offset = aa - cc;`, then load from `cc[a_offset]` and `cc[b_offset]`. An alternative is to count your index up towards zero, so you can loop on the flags set by `add $32, %[idx]` without a separate `cmp`. – Peter Cordes Jul 28 '17 at 00:38
  • @PeterCordes Thanks for the tips, I did the dummy trick (without the = sign, it didn't like that on the input list, and "dummy" is read only..). I don't think the immediate thing will work, since bytes isn't a constant. Need to do some research to understand the addressing stuff =D – gct Jul 28 '17 at 12:57
  • @SeanMcAllister: If you want a scratch reg you can modify, it has to be an output! – Peter Cordes Jul 28 '17 at 15:22
  • Using `"er"` lets gcc choose to use an immediate if, after inlining and constant-propagation, it does have the value at at compile time. Otherwise it uses a register, since `r` is also in the constraint list. Maybe that never happens in your use-case, but maybe it does, and it doesn't hurt. You generally want to give the compiler as much flexibility as possible. (You could even use `"erm"`, but don't: you do want to force the compiler to load from memory into a register outside the loop.) – Peter Cordes Jul 28 '17 at 15:25
  • Ah OK, changed that then. I misunderstood about the dummy register and thought "dummy" was a special value. I think I was just passing in a pointer to a const string though =O. If I put it on the output list with a dummy var, GCC just elides my assembly though... – gct Jul 28 '17 at 19:21

1 Answers1

2

You need to specify the XMM registers instead because from the compiler perspective, these are the registers which are clobbered because the compiler does not know anything about YMM registers. You should probably add a compiler conditional and use the YMM registers on compilers which support them because it is theoretically possible to use XMM registers without disturbing the YMM-only register parts (using SSE2 instructions), and a future GCC version might use that information (although that seems unlikely). This is alluded to in this document on transition penalities. More details are in the Intel Advanced Vector Extensions Programming Reference.

Note that you could define the .long array as a static array with a used attribute, and reference that from the inline assembly. This would avoid duplication of the constant in case the inline assembly statement is duplicated by the compiler or used multiple times. (Alternatively, you could use an m input operand for the array and drop the used attribute on the array, which also has the advantage that it will work correctly in large memory models.)

Florian Weimer
  • 32,022
  • 3
  • 48
  • 92
  • I think that's the conclusion I've come to as well, do you have documentation on the XMM clobber clearing the upper bits of the register? I'm not too worried about the constant, this shouldn't get inlined and it'd have to happen many many times before I was concerned about it. – gct Jul 27 '17 at 19:17
  • looking at gcc 5.1, in /gcc/varasm.c the decode_reg_name_and_count function looks at the ADDITIONAL_REGISTER_NAMES table where ymm registers are defined, it simply aliases back to the equivalent xmm register, so I'm satisfied I think. – gct Jul 27 '17 at 19:28
  • Right. I clarified the XMM/YMM interactions and added a two links. – Florian Weimer Jul 27 '17 at 19:29
  • 1
    Instead of a `used` attribute, you could also just pass in the array as an input operand, class `m`. This should be resolved to a suitable memory operand. – fuz Jul 27 '17 at 22:33