1

What happens when you call _mm512_i32scatter_ps and the indices repeat? Does it store the sum? Does it just store one? Is it UB? I can't seem to find any documentation on this edge case and I don't want to rely on it if it is UB.

I tried seaching on the intel intrinics site but that got me nowhere.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847

1 Answers1

1

Check the asm manual for more detailed docs and time the intrinsics guide seems lacking or wrong: https://www.felixcloutier.com/x86/vscatterdps:vscatterdpd:vscatterqps:vscatterqpd

Only writes to overlapping vector indices are guaranteed to be ordered with respect to each other (from LSB to MSB of the source registers) ... Note that this does not account for non-overlapping indices that map into the same physical address locations.

So the final value will come from the highest-index vector element.

(Unless you have the same page mapped to multiple addresses and you scatter to the same element via different virtual addresses.)

The rest of the details the asm manual documents are about encoding, and rules for ordering in case of faults such as #PF page faults (all lower elements will be complete before the faulting element; upper elements might or might not be completed.)


Note that vpconflictd exists for detecting same-index vector elements, but it's slow on Intel, like 37 uops / 20 to 19 cycle throughput on SKX / Ice Lake / Sapphire Rapids for the ZMM version. (Fast on AMD Zen 4: https://uops.info/) So if you can avoid needing to check for conflicts, that's good. (Scatters themselves, like vscatterdps zmm you're asking about, are 11 cycle throughput on Ice Lake / Alder Lake, but slower on Zen 4, like 22 cycle throughput.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Maybe I am just being stupid but I am not sure what they are saying when it comes to overlapping indices. Is it that they will rewrite the same value to that index in the order of how they appear in the SIMD register? – Grogfrognumber47 Aug 02 '23 at 04:23
  • @Grogfrognumber47: Yes, that's my reading of what they're saying. So the final value comes from the highest-index vector element. Should be easy to test if you want to double-check, given that we know there is an ordering rule so whatever you find will be guaranteed on other CPUs. – Peter Cordes Aug 02 '23 at 04:28
  • Thank you for the information. This was very helpful. Where do you find the throughput for the instructions for Zen? I am developing with Zen 4 in mind since the target machine is Zen 4. – Grogfrognumber47 Aug 02 '23 at 04:33
  • @Grogfrognumber47: https://uops.info/ – Peter Cordes Aug 02 '23 at 05:29
  • @Grogfrognumber47: BTW, see https://agner.org/optimize/ and [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) for details on what to do with the latency and ports per-instruction microbenchmark results from https://uops.info/. – Peter Cordes Aug 03 '23 at 02:15