TL:DR: AVX512F
related: AVX512DQ _mm512_reduce_pd
or _ps
will subtract the integer part (and a specified number of leading fraction bits), range-reducing your input to only the fractional part. asm docs for vreducepd
have the most detail.
The EVEX prefix allows overriding the default rounding direction {er}
and setting suppress-all-exceptions {sae}
, for FP instructions. (This is what the ..._round_ps()
versions of intrinsics are for.) But it doesn't have a "round to integer" option; you still need a separate asm instruction for that.
vroundps xy, xy/mem, imm8
didn't get upgraded to AVX512. Actually it did: the same opcode has a new mnemonic for the EVEX version, using the high 4 bits of the immediate that are reserved in the SSE and VEX encodings.
vrndscaleps xyz, xyz/mem/m32broadcast, imm8
is available in ss/sd/ps/pd flavours. The high 4 bits of the imm8 specify the number of fraction bits to round to. In these terms, rounding to the nearest integer is rounding to 0 fraction bits. Rounding to nearest 0.5
would be rounding to 1 fraction bit. It's the same as scaling by 2^M, rounding to nearest integer, then scaling back down (done without overflow).
I think the field is unsigned, so you can't use M=-1 to round to an even number. The ISA ref manual doesn't mention signedness, so I'm leaning towards unsigned being the most likely.
The low 4 bits of the field specify the rounding mode like with roundps
. As usual, the PD
version of the instruction has the diagram (because it's alphabetically first).
With the upper 4 bits = 0, it behaves the same as roundps
: they use the same encoding for the low 4 bits. It's not a coincidence that the instructions have the same opcode, just different prefixes.
(I'm curious if SSE or VEX roundpd
on an AVX512 CPU would actually scale based on the upper 4 bits; it says they're "reserved" not "ignored". But probably not.)
__m512 _mm512_roundscale_ps( __m512 a, int imm);
is the no-frills intrinsic. See Intel's intrinsic finder
The merge-masking + SAE-override version is __m512 _mm512_mask_roundscale_round_ps(__m512 s, __mmask16 k, __m512 a, int imm, int sae);
. There's nothing you can do with the sae
operand that roundscale
can't already do with its imm8
, though, so it's a bit pointless.
You can use the _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC
and so on constants documented for _mm_round_pd
/ _mm256_round_pd
, to round up, down, or truncate towards zero, or the usual nearest with even-as-tiebreak that's the IEEE default rounding mode. Or _MM_FROUND_CUR_DIRECTION
to use whatever the current mode is. _MM_FROUND_NO_EXC
suppresses setting the inexact exception bit in the MXCSR.
You might be wondering why vrndscaleps
needs any immediate bits to specify rounding direction when you could just use the EVEX prefix to override the rounding direction with vrndscaleps zmm0 {k1}, zmm1, {rz-sae}
(Or whatever the right syntax is; NASM doesn't seem to be accepting any of the examples I found.)
The answer is that explicit rounding is only available with 512-bit vectors or with scalars, and only for register operands. (It repurposes 3 EVEX bits used to set vector length (if AVX512VL is supported), and to distinguish between broadcast memory operands vs. vector. EVEX bits are overloaded based on context to pack more functionality into limited space.)
So having the rounding-control in the imm8 makes it possible to do vrndscaleps zmm0{k1}, [rdi]{m32bcst}, imm8
to broadcast a float from memory, round it, and merge that into an existing register according to mask register k1
. All in a single instruction which decodes to probably 3 uops on SKX, assuming it's the same as vroundps
. (http://agner.org/optimize/).