In GNU C inline asm, what are the size-override modifiers for xmm/ymm/zmm for a single operand?

Question

While trying to answer Embedded broadcasts with intrinsics and assembly, I was trying to do something like this:

__m512 mul_bcast(__m512 a, float b) {
    asm(
        "vbroadcastss  %k[scalar], %q[scalar]\n\t"  // want  vbcast..  %xmm0, %zmm0
        "vmulps        %q[scalar], %[vec], %[vec]\n\t"
        : [vec] "+x" (a), [scalar] "+&x" (b)
        : :
    );
    return a;
}

The GNU C x86 Operand Modifiers doc only specifies modifiers up to q (DI (DoubleInt) size, 64bits). Using q on a vector register will always bring it down to xmm (from ymm or zmm). e.g. scalar registers:

 long scratch = 0;  // not useful instructions, just syntax demo
 asm(
     "movw         symbol(%q[inttmp]), %w[inttmp]\n\t"  // movw symbol(%rax), %ax
     "movsbl        %h[inttmp], %k[inttmp]\n\t"     // movsx %ah, %eax
   :  [inttmp] "+r" (scratch)
   :: "memory"  // we read some index in symbol[]
 );

The question:

What are the modifiers to change between sizes of vector register?

Also, are there any specific-size constraints for use with input or output operands? Something other than the generic x which can end up being xmm, ymm, or zmm depending on the type of the expression you put in the parentheses.

Off-topic:
clang appears to have some Yi / Yt constraints (not modifiers), but I can't find docs on that either. clang won't even compile this, even with the vector instructions commented out, because it doesn't like +x as a constraint for an __m512 vector.

Background / motivation

I can get the result I want by passing in the scalar as an input operand, constrained to be in the same register as a wider output operand, but it's clumsier. (The biggest downside for this use-case is that AFAIK the matching constraint can only reference by operand-number, rather than the [symbolic_name], so it's susceptible to breakage when adding/removing output constraints.)

// does what I want, by using a paired output and input constraint
__m512 mul_bcast(__m512 a, float b) {
    __m512 tmpvec;
    asm(
        "vbroadcastss  %[scalar], %[tmpvec]\n\t"
        "vmulps        %[tmpvec], %[vec], %[vec]\n\t"
        : [vec] "+x" (a), [tmpvec] "=&x" (tmpvec)
        : [scalar] "1" (b)
        :
    );

  return a;
}

On the Godbolt compiler explorer

Also, I think this whole approach to the problem I was trying to solve is going to be a dead end because Multi-Alternative constraints don't let you give different asm for the different constraint patterns. I was hoping to have x and r constraints end up emitting a vbroadcastss from a register, while m constraints end up emitting vmulps (mem_src){1to16}, %zmm_src2, %zmm_dst (a folded broadcast-load). The purpose of doing this with inline asm is that gcc doesn't yet know how to fold set1() memory operands into broadcast-loads (but clang does).

Anyway, this specific question is about operand modifiers and constraints for vector registers. Please focus on that, but comments and asides in answers are welcome on the other issue. (Or better, just comment / answer on Z Boson's question about embedded broadcasts.)

Also, you don't have to use operand numbers when matching inputs to outputs: `asm("" : [me] "=a" (a) : "[me]"(7));`. — David Wohlferd, Dec 25 '15 at 06:01
@DavidWohlferd: Thanks! I'm really glad to know about the `"[me]"` syntax. That was a major objection to the matching-output-constraint method. — Peter Cordes, Dec 25 '15 at 06:16
When Anger said that the syntax for GCC inline assembly was elaborate and difficult to learn he was not kidding. I felt I more or less got NASM after a few days and could figure out anything else from the documentation but GCC inliene assembly in some cases is still confusing. I don't actually mind AT&T syntax that much but the GCC extended syntax is complicated. — Z boson, Dec 25 '15 at 19:11
@Zboson The official documentation is better than it used to be. Before half of it was hidden away in the GCC internals documentation. The tricky part is that you need to describe every effect and side-effect your asm statement has, and it can be easy to overlook something. — Ross Ridge, Dec 25 '15 at 23:04
@firo: That link is already in the question, and still doesn't document the `%g0` modifier to get the ZMM name of an input or output `"x"(__m256)` — Peter Cordes, Sep 21 '18 at 00:17

score 9 · Accepted Answer · answered Dec 25 '15 at 05:52

From the file gcc/config/i386/i386.c of the GCC sources:

       b -- print the QImode name of the register for the indicated operand.
        %b0 would print %al if operands[0] is reg 0.
       w --  likewise, print the HImode name of the register.
       k --  likewise, print the SImode name of the register.
       q --  likewise, print the DImode name of the register.
       x --  likewise, print the V4SFmode name of the register.
       t --  likewise, print the V8SFmode name of the register.
       g --  likewise, print the V16SFmode name of the register.
       h -- print the QImode name for a "high" register, either ah, bh, ch or dh.

Similarly from gcc/config/i386/contraints.md:

    ;; We use the Y prefix to denote any number of conditional register sets:
    ;;  z   First SSE register.
    ;;  i   SSE2 inter-unit moves to SSE register enabled
    ;;  j   SSE2 inter-unit moves from SSE register enabled
    ;;  m   MMX inter-unit moves to MMX register enabled
    ;;  n   MMX inter-unit moves from MMX register enabled
    ;;  a   Integer register when zero extensions with AND are disabled
    ;;  p   Integer register when TARGET_PARTIAL_REG_STALL is disabled
    ;;  f   x87 register when 80387 floating point arithmetic is enabled
    ;;  r   SSE regs not requiring REX prefix when prefixes avoidance is enabled
    ;;  and all SSE regs otherwise

This file also defines a "Yk" constraint but I don't know if how well it would work in an asm statement:

    (define_register_constraint "Yk" "TARGET_AVX512F ? MASK_EVEX_REGS : NO_REGS"
    "@internal Any mask register that can be used as predicate, i.e. k1-k7.")

Note this is all copied from the latest SVN revision. I don't know what release of GCC, if any, the particular modifiers and constraints you're interested in were added.

Works great in [gcc 5.3 on godbolt](http://goo.gl/8B7O7T). Except for spuriously generating a stack frame and a redundant push/pop of `%r10`. Looks similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69041 (which I reported yesterday), but it's affecting a 64b target not just `-m32`. — Peter Cordes, Dec 25 '15 at 06:28

Nathan Kurz · Answer 2 · 2016-01-16T17:01:23.247

3

It seems like all recent versions of GCC will accept both 'q' and 'x' as modifiers to print the XMM version of a YMM register.

Intel's icc looks to accept 'q', but not 'x' (at least through version 13.0.1).

[Edit: Well, it worked in this small example below, but in a real test case, I'm having problems with icc 14.0.3 accepting the 'q' but writing a 'ymm'.]

[Edit: Testing with more recent versions of icc, I'm finding that neither icc 15 nor icc 16 work with either 'q' or 'x'.]

But Clang 3.6 and earlier accept neither syntax. And at least on Godbolt, Clang 3.7 crashes with both!

// inline assembly modifiers to convert ymm to xmm

#include <x86intrin.h>
#include <stdint.h>

// gcc also accepts "%q1" as "%x1" 
// icc accepts "%q1" but not "%x1"
// clang-3.6 accepts neither
// clang-3.7 crashes with both!

#define ASM_MOVD(vec, reg)       \
__asm volatile("vmovd %q1, %0" : \
               "=r" (reg) :      \
               "x" (vec)         \
    );          

uint32_t movd_ymm(__m256i ymm) {
   uint32_t low;
   ASM_MOVD(ymm, low);
   return low;
}

uint32_t movd_xmm(__m128i xmm) {
   uint32_t low;
   ASM_MOVD(xmm, low);
   return low;
}

Link to test on Godbolt: http://goo.gl/bOkjNu

(Sorry that this isn't full answer to your question, but it seemed like useful information to share and was too long for a comment)

edited Jan 16 '16 at 17:01

answered Jan 12 '16 at 04:22

Nathan Kurz

1,649
1
14
28

(Got here randomly from somewhere else) This code is actually subtly wrong - gcc is basically taking what you have coming in and printing out "something": vmovd %xmm0, %eax However, you've got the output modifier on the xmm register rather than the integer register. If you swap those then you'll get the right output of "rax" on 64-bit here. You also want a "y" constraint for the ymm register case. – echristo Apr 17 '18 at 23:37
I haven't thought about this for a while, but I don't think that your correction is correct. It's not a mistake that the "q" modifier is on the XMM register: the goal is to find a syntax that will modify a passed YMM register and output assembly for the corresponding XMM on ICC, Clang, and GCC. And VMOVD requires a 32 bit integer register (as opposed to VMOVQ): https://www.felixcloutier.com/x86/MOVD:MOVQ.html. But maybe I'm not following you correctly. Could you link to a test on Godbolt showing exactly what you are suggesting? – Nathan Kurz Apr 19 '18 at 00:44
So, you're definitely right that I was mistaken, honestly in lots of ways. I should have been more careful. It looks like what you might want is the 'x' modifier: https://godbolt.org/g/mxRBVd which will treat the operand like it's a V4SF type and print out the right thing - at least in gcc. It's not currently working in clang (file a bug and I'll try to get to it) and is also not working in the most up to date that compiler explorer has :( Otherwise you might want to try the corresponding intrinsic? At any rate, sorry for the confusion and hope this helps a bit. – echristo May 21 '18 at 09:19

In GNU C inline asm, what are the size-override modifiers for xmm/ymm/zmm for a single operand?

The question:

Background / motivation

2 Answers2

Linked