2

I'm missing the intrinsic _mm512_round_ps for AVX512 (it is only available for KNC). Any idea why this is not available?

What would be a good workaround?

  • apply _mm256_round_ps to upper and lower half and fuse the results?

  • use _mm512_add_round_ps with one argument being zero?

Thanks!

Ralf
  • 1,203
  • 1
  • 11
  • 20
  • Probably `_mm512_cvtps_epi32` is what you need. The value is rounded according the the current rounding mode. The output is a packed integer. You can use `_mm512_cvtepi32_ps` to convert it back to a packed float. – wim Jun 14 '18 at 13:52
  • @wim Thanks, that would work for floats which don't exceed the range of 32-bit ints, but for larger exponents I can't squeeze the rounded float into a 32-bit int. – Ralf Jun 14 '18 at 13:59
  • I see. I have another idea, but I have to do some testing to see if it works. – wim Jun 14 '18 at 14:01
  • @wim: I'm writing up an answer explaining `_mm512_roundscale_ps`, the AVX512 replacement for `roundps`. Got side-tracked on it, though, will finish soon. – Peter Cordes Jun 14 '18 at 14:21
  • @PeterCordes Great! I didn't know `_mm512_roundscale_ps` until now. – wim Jun 14 '18 at 14:36
  • I was thinking of `magicfloat = _mm512_set1_ps(8388608.0f); x = _mm512_add_ps(x,magicfloat); x = _mm512_sub_ps(x,magicfloat);`, which should work well for positive floats, but needs some extra bit logic to make it work for both positive and negative floats. Note that `vrndscaleps` is two μops, which is the same as `vaddps` and `vsubps` together. – wim Jun 14 '18 at 14:42

1 Answers1

4

TL:DR: AVX512F

__m512 nearest_integer = _mm512_roundscale_ps(input_vec, _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC);

related: AVX512DQ _mm512_reduce_pd or _ps will subtract the integer part (and a specified number of leading fraction bits), range-reducing your input to only the fractional part. asm docs for vreducepd have the most detail.


The EVEX prefix allows overriding the default rounding direction {er} and setting suppress-all-exceptions {sae}, for FP instructions. (This is what the ..._round_ps() versions of intrinsics are for.) But it doesn't have a "round to integer" option; you still need a separate asm instruction for that.


vroundps xy, xy/mem, imm8 didn't get upgraded to AVX512. Actually it did: the same opcode has a new mnemonic for the EVEX version, using the high 4 bits of the immediate that are reserved in the SSE and VEX encodings.

vrndscaleps xyz, xyz/mem/m32broadcast, imm8 is available in ss/sd/ps/pd flavours. The high 4 bits of the imm8 specify the number of fraction bits to round to. In these terms, rounding to the nearest integer is rounding to 0 fraction bits. Rounding to nearest 0.5 would be rounding to 1 fraction bit. It's the same as scaling by 2^M, rounding to nearest integer, then scaling back down (done without overflow).

I think the field is unsigned, so you can't use M=-1 to round to an even number. The ISA ref manual doesn't mention signedness, so I'm leaning towards unsigned being the most likely.

The low 4 bits of the field specify the rounding mode like with roundps. As usual, the PD version of the instruction has the diagram (because it's alphabetically first).

With the upper 4 bits = 0, it behaves the same as roundps: they use the same encoding for the low 4 bits. It's not a coincidence that the instructions have the same opcode, just different prefixes.

(I'm curious if SSE or VEX roundpd on an AVX512 CPU would actually scale based on the upper 4 bits; it says they're "reserved" not "ignored". But probably not.)


__m512 _mm512_roundscale_ps( __m512 a, int imm); is the no-frills intrinsic. See Intel's intrinsic finder

The merge-masking + SAE-override version is __m512 _mm512_mask_roundscale_round_ps(__m512 s, __mmask16 k, __m512 a, int imm, int sae);. There's nothing you can do with the sae operand that roundscale can't already do with its imm8, though, so it's a bit pointless.

You can use the _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC and so on constants documented for _mm_round_pd / _mm256_round_pd, to round up, down, or truncate towards zero, or the usual nearest with even-as-tiebreak that's the IEEE default rounding mode. Or _MM_FROUND_CUR_DIRECTION to use whatever the current mode is. _MM_FROUND_NO_EXC suppresses setting the inexact exception bit in the MXCSR.


You might be wondering why vrndscaleps needs any immediate bits to specify rounding direction when you could just use the EVEX prefix to override the rounding direction with vrndscaleps zmm0 {k1}, zmm1, {rz-sae} (Or whatever the right syntax is; NASM doesn't seem to be accepting any of the examples I found.)

The answer is that explicit rounding is only available with 512-bit vectors or with scalars, and only for register operands. (It repurposes 3 EVEX bits used to set vector length (if AVX512VL is supported), and to distinguish between broadcast memory operands vs. vector. EVEX bits are overloaded based on context to pack more functionality into limited space.)

So having the rounding-control in the imm8 makes it possible to do vrndscaleps zmm0{k1}, [rdi]{m32bcst}, imm8 to broadcast a float from memory, round it, and merge that into an existing register according to mask register k1. All in a single instruction which decodes to probably 3 uops on SKX, assuming it's the same as vroundps. (http://agner.org/optimize/).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks a lot, that solves my problem. I'm a bit confused by the Intel Intrinsics Guide: There is `_mm512_roundscale_ps()` and `_mm512_roundscale_round_ps()`. In the latter the rounding mode can be specified in an additional parameter `rounding`, but the same information is provided by `imm[0:1]`? – Ralf Jun 15 '18 at 10:55
  • @Ralf: yeah, I mentioned this in my answer: "*There's nothing you can do with the sae operand that roundscale can't already do with its imm8, though, so it's a bit pointless.*" It appears to just be there for uniformity. – Peter Cordes Jun 15 '18 at 10:58
  • What I meant was: does `rounding` overwrite `imm[0:1]` in `_mm512_roundscale_round_ps()`? – Ralf Jun 15 '18 at 11:01
  • @Ralf: If `imm[2]` is unset, then `imm[1:0]` takes precedence over anything else. But if it's set (`#define _MM_FROUND_CUR_DIRECTION 0x4`), the docs claim it uses `MXCSR.RC`, and don't mention the ER-SAE rounding mode override. I wouldn't be surprised if the EVEX prefix provided the effective value of `MXCSR.RC` if you use the override though, despite what the docs say. Good question. – Peter Cordes Jun 15 '18 at 11:18