Check the asm manual for more detailed docs and time the intrinsics guide seems lacking or wrong: https://www.felixcloutier.com/x86/vscatterdps:vscatterdpd:vscatterqps:vscatterqpd
Only writes to overlapping vector indices are guaranteed to be ordered with respect to each other (from LSB to MSB of the source registers) ... Note that this does not account for non-overlapping indices that map into the same physical address locations.
So the final value will come from the highest-index vector element.
(Unless you have the same page mapped to multiple addresses and you scatter to the same element via different virtual addresses.)
The rest of the details the asm manual documents are about encoding, and rules for ordering in case of faults such as #PF page faults (all lower elements will be complete before the faulting element; upper elements might or might not be completed.)
Note that vpconflictd
exists for detecting same-index vector elements, but it's slow on Intel, like 37 uops / 20 to 19 cycle throughput on SKX / Ice Lake / Sapphire Rapids for the ZMM version. (Fast on AMD Zen 4: https://uops.info/) So if you can avoid needing to check for conflicts, that's good. (Scatters themselves, like vscatterdps zmm
you're asking about, are 11 cycle throughput on Ice Lake / Alder Lake, but slower on Zen 4, like 22 cycle throughput.)