It varies by assembler. Some support the {1to16}
/ {1to8}
/ {1to4}
syntax in slides from a 2014 talk introducing AVX-512 at a GCC conference, by Kirill Yukhin of Intel. (Despite it being a GCC talk, the slides use Intel syntax.) Others support that and/or something else.
MASM: vminps zmm1, zmm2, DWORD bcst [rax]
NASM vminps zmm1, zmm2, [rax] {1to16}
(optional dword
or qword
specifier in the usual place, like dword [rax]{1to16}
NASM does not support the bcst
keyword.
€ASM aka Euro Assembler: vminps ymm1,ymm2,[rax],Bcst=on
GAS/clang .intel_syntax
is MASM-like in general and supports dword bcst [rax]
. But also [rax]{1to16}
. (objdump -drwC -Mintel
uses dword bcst [rax]
)
AT&T syntax: vminps (%rax){1to16},%zmm2,%zmm1
The machine code only has 1 bit to encode broadcast vs. regular, so there's no way to broadcast 64-bit pairs of floats for vminps
; the broadcast element size has to match the SIMD element size. So €ASM's minimal syntax is sufficient; the others merely provide a way for the assembler to check for a mismatch in what the human thinks the instruction will do.
Unlike embedded rounding + suppress-all-exceptions which only work with scalar (like vmulss
) or 512-bit instructions1, broadcast memory operands do work with 256 and 128-bit bit vectors as well (AVX512VL).
Broadcast element sizes of 32 and 64-bit are supported; not coincidentally, those are the element sizes that load ports on Intel CPUs can do for free as part of a load uop. (Note that vpbroadcastb/w vec, [mem]
need an ALU uop, vpbroadcastd/q
only need the load uop.)
Footnote 1: e.g. vmulps zmm0,zmm1,zmm2{rz-sae}
(GAS .intel_syntax / MASM)
or vmulps zmm0, zmm1, zmm2, {rz-sae}
(NASM, with an extra comma before the {})