Embedded broadcasts with intrinsics and assembly

Question

In section 2.5.3 "Broadcasts" of the Intel Architecture Instruction Set Extensions Programming Reference the we learn than AVX512 (and Knights Corner) has

a bit-field to encode data broadcast for some load-op instructions, i.e. instructions that load data from memory and perform some computational or data movement operation.

For example using Intel assembly syntax we can broadcast the scalar at the address stored in rax and then multiplying with the 16 floats in zmm2 and write the result to zmm1 like this

vmulps zmm1, zmm2, [rax] {1to16}

However, there are no intrinsics which can do this. Therefore, with intrinsics the compiler should be able to fold

__m512 bb = _mm512_set1_ps(b);
__m512 ab = _mm512_mul_ps(a,bb);

to a single instruction

vmulps zmm1, zmm2, [rax] {1to16}

but I have not observed GCC doing this. I found a GCC bug report about this.

I have observed something similar with FMA with GCC. e.g. GCC 4.9 will not collapse _mm256_add_ps(_mm256_mul_ps(areg0,breg0) to a single fma instruction with -Ofast. However, GCC 5.1 does collapse it to a single fma now. At least there are intrinsics to do this with FMA e.g. _mm256_fmadd_ps. But there is no e.g. _mm512_mulbroad_ps(vector,scalar) intrinsic.

GCC may fix this at some point but until then assembly is the only solution.

So my question is how to do this with inline assembly in GCC?

I think I may have come up with the correct syntax (but I am not sure) for GCC inline assembly for the example above.

"vmulps        (%%rax)%{1to16}, %%zmm1, %%zmm2\n\t"

I am really looking for a function like this

static inline __m512 mul_broad(__m512 a, float b) {
    return a*b;
}

where if b is in memory point to in rax it produces

vmulps        (%rax){1to16}, %zmm0, %zmm0
ret

and if b is in xmm1 it produces

vbroadcastss    %xmm1, %zmm1
vmulps          %zmm1, %zmm0, %zmm0
ret

GCC will already do the vbroadcastss-from-register case with intrinsics, but if b is in memory, compiles this to a vbroadcastss from memory.

__m512 mul_broad(__m512 a, float b) {       
    __m512 bb = _mm512_set1_ps(b);
    __m512 ab = _mm512_mul_ps(a,bb);
    return ab;
}

clang will use a broadcast memory operand if b is in memory.

I put your last intrinsics function on godbolt. With `-m32` (so `b` is in memory), [clang uses a broadcast-load](http://goo.gl/gdrhXM). gcc uses `vbroadcastss`. (And appears to be broken, because it does a useless `push ecx / lea ecx, ... / pop ecx`) Maybe it's trying to align the stack temporarily? At `-O1`, gcc uses `ecx` after the `lea`. — Peter Cordes, Dec 22 '15 at 11:57
@PeterCordes, sheesh...Clang wins again! I can't believe I did not try Clang. How can I tell Clang/GCC that `b` is in memory in 64-bit mode? — Z boson, Dec 22 '15 at 12:02
Probably make a version of the function with a `float *pb` arg. — Peter Cordes, Dec 22 '15 at 12:21
@PeterCordes, yeah that works. I guess I wanted to simulate that with `static inline` but that shows what I want. — Z boson, Dec 22 '15 at 12:24
Clang does not like the assembly syntax " invalid % escape in inline assembly string" in `vmulps (%%rdi)%{1to16%}, %%zmm0, %%zmm0\n\t"`. — Z boson, Dec 22 '15 at 12:32
Collapsing an add/mul intrinsic pair to FMA would be completely wrong, so it's a good thing it doesn't do that. — harold, Dec 22 '15 at 12:56
@harold: Surprisingly, [gcc actually *does* do it, even without `-ffast-math`!](http://goo.gl/qUfqs0). gcc always tries to take advantage of any FMA hardware support you tell it about. `clang` only fuses add and mul intrinsics together with `-ffast-math`. I guess gcc doesn't worry about keeping extra precision, beyond what the C standard requires. I haven't read up on `FLT_EVAL_METHOD` or whatever it is recently. — Peter Cordes, Dec 22 '15 at 13:12
@PeterCordes, arggh...GCC 4.9 does not do it but GCC 5.1 does. Apparently it has been fixed. — Z boson, Dec 22 '15 at 13:16
@PeterCordes, I guess we just have to wait for GCC to fix the broadcast memory. AVX512 is not even out yet and GCC does not support intrinsic for KNC. Meanwhile Clang does not appear to support the {1to16} syntax with inline assembly but GCC does. — Z boson, Dec 22 '15 at 13:17
@PeterCordes, note that using FMA is not losing precision more precision than without FMA. If anything it's better since it's a single rounding mode rather than two rounding modes. I'm not sure what the rules for C are suppose to be. I think to be IEEE compliant it needs to be two rounding modes. So it's not even correct to say a looser or relaxed floating point model (e.g. `-ffast-math`) is needed. Just that a different floating point model is needed for FMA. Apparently GCC does not even require a different floating point model for FMA. — Z boson, Dec 22 '15 at 14:12
@Zboson: I meant "gcc doesn't worry about dropping the rounding step between the mul and add", but I see now how what I said was ambiguous. I thought [strict FP rules](http://en.cppreference.com/w/c/types/limits/FLT_EVAL_METHOD) required the compiler to at least be consistent, but it turns out that by default, compilers are explicitly allowed to "contract" things, effectively keeping infinite precision for temporaries. I put some code on godbolt for [looking at `FLT_EVAL_METHOD` with `-m32`, and the effect of `-mfpmath=sse`](http://goo.gl/UAy74l). (F_E_M = 2 or 0, with x87 / with SSE). — Peter Cordes, Dec 22 '15 at 22:37
I think gcc has support for multiple alternatives, so you can give it multiple patterns with different constraints, and it will pick the code for the version that has the set of constraints it can match most cheaply. I ran into trouble trying to find the syntax for passing in a variable in an xmm register, and referencing that same register with a different width. (like `%q[int_var]` to emit `%rax` instead of `%eax`). The GCC manual only documents the prefixes for integer registers. — Peter Cordes, Dec 23 '15 at 11:56
@PeterCordes, post an answer with some code please. You are much better with inline assembly (and assembly in general) than me so I will learn from what ever you post. — Z boson, Dec 23 '15 at 12:48
I was going to post one. I might still do so :P I've never used the alternatives stuff, I've just seen it in the manual. I ran into trouble with the broadcast-from-register version, trying to get `vpbroadcastss %[scalar], %%zmm_of_the_same_register` emitted. (Using `scalar` as an input/output operand. Hmm, that'll make worse code for an inlined function, unless the parameter comes by non-const reference. Oh, actually I could just lie to gcc and tell it that I don't write the input operand holding the scalar. But I'm worried about breakage when used on the low elem of a non-scalar vec. — Peter Cordes, Dec 23 '15 at 12:57
@PeterCordes, http://stackoverflow.com/questions/34436233/fused-multiply-add-and-default-rounding-modes — Z boson, Dec 23 '15 at 12:58
http://stackoverflow.com/questions/34459803/in-gnu-c-inline-asm-whatre-the-modifiers-for-different-sizes-vector-registers. However, my general idea is doomed: I finally got around to checking [the docs on multi-alternative constraints](https://gcc.gnu.org/onlinedocs/gcc/Multi-Alternative.html#Multi-Alternative) It's not going to work: you don't get to specify a different template for the different constraint patterns. That's what I was thinking would work. I'd need an `if` on something like `__builtin_constant_p(scalar)`, but testing if it needs to be loaded or not. — Peter Cordes, Dec 25 '15 at 03:51
@PeterCordes maybe you could add your sources for GCC x86 inline assembly to the x86 tag? I find the documentation difficult. There seems to be several fragments scattered about that explain a few peices of the puzzle but not single document which describes gcc inline assembly well. What do you use? — Z boson, Dec 25 '15 at 19:27
@Zboson: Just the docs themselves. The key is to understand that it's designed for wrapping single instructions that the compiler can't use directly. Writing sequences or loops works, of course, but wording like the "early-clobber" description talking about "the instruction writes... before reading all its other operands" is talking about the use-case of a single instruction. The purpose of inline asm is to describe the asm to the compiler, so it can slot it into the basic block it's part of and actually optimize around it. I figured that out on my own; I didn't read it anywhere. — Peter Cordes, Dec 25 '15 at 19:35

score 5 · Accepted Answer · answered Dec 25 '15 at 22:52

As Peter Cordes notes GCC doesn't let you specify a different template for different constraint alternatives. So instead my solution has the assembler choose the correct instruction according to the operands chosen.

I don't have a version of GCC that supports the ZMM registers, so this following example uses XMM registers and a couple of nonexistent instructions to demonstrate how you can achieve what you're looking for.

typedef __attribute__((vector_size(16))) float v4sf;

v4sf
foo(v4sf a, float b) {
    v4sf ret;
    asm(".ifndef isxmm\n\t"
        ".altmacro\n\t"
        ".macro ifxmm operand, rnum\n\t"
        ".ifc \"\\operand\",\"%%xmm\\rnum\"\n\t"
        ".set isxmm, 1\n\t"
        ".endif\n\t"
        ".endm\n\t"
        ".endif\n\t"
        ".set isxmm, 0\n\t"
        ".set regnum, 0\n\t"
        ".rept 8\n\t"
        "ifxmm <%2>, %%regnum\n\t"
        ".set regnum, regnum + 1\n\t"
        ".endr\n\t"
        ".if isxmm\n\t"
        "alt-1 %1, %2, %0\n\t"
        ".else\n\t"
        "alt-2 %1, %2, %0\n\t"
        ".endif\n\t"
        : "=x,x" (ret)
        : "x,x" (a), "x,m" (b));
    return ret;
}


v4sf
bar(v4sf a, v4sf b) {
    return foo(a, b[0]);
}

This example should be compiled with gcc -m32 -msse -O3 and should generate two assembler error messages similar to the following:

t103.c: Assembler messages:
t103.c:24: Error: no such instruction: `alt-2 %xmm0,4(%esp),%xmm0'
t103.c:22: Error: no such instruction: `alt-1 %xmm0,%xmm1,%xmm0'

The basic idea here is the assembler checks to see whether the second operand (%2) is an XMM register or something else, presumably a memory location. Since the GNU assembler doesn't support much in the way of operations on strings, the second operand is compared to every possible XMM register one at a time in a .rept loop. The isxmm macro is used to paste %xmm and a register number together.

For your specific problem you'd probably need to rewrite it something like this:

__m512
mul_broad(__m512 a, float b) {
    __m512 ret;
    __m512 dummy;
    asm(".ifndef isxmm\n\t"
        ".altmacro\n\t"
        ".macro ifxmm operand, rnum\n\t"
        ".ifc \"\\operand\",\"%%zmm\\rnum\"\n\t"
        ".set isxmm, 1\n\t"
        ".endif\n\t"
        ".endm\n\t"
        ".endif\n\t"
        ".set isxmm, 0\n\t"
        ".set regnum, 0\n\t"
        ".rept 32\n\t"
        "ifxmm <%[b]>, %%regnum\n\t"
        ".set regnum, regnum + 1\n\t"
        ".endr\n\t"
        ".if isxmm\n\t"
        "vbroadcastss %x[b], %[b]\n\t"
        "vmulps %[a], %[b], %[ret]\n\t"
        ".else\n\t"
        "vmulps %[b] %{1to16%}, %[a], %[ret]\n\t"
        "# dummy = %[dummy]\n\t"
        ".endif\n\t"
        : [ret] "=x,x" (ret), [dummy] "=xm,x" (dummy)
        : [a] "x,xm" (a), [b] "m,[dummy]" (b));
    return ret;
}

Thanks! I will try this out in the next few days and then get back to you. — Z boson, Dec 28 '15 at 10:31

Embedded broadcasts with intrinsics and assembly

1 Answers1

Linked