2

Is there an atomic CAS instruction or equivalent in the AVX512 set?

I can't immediately find one but don't have the best google fu.

Paul R
  • 208,748
  • 37
  • 389
  • 560
Alex R
  • 23
  • 6
  • I don't think so - what are you actually trying to achieve ? I can't think of a use case where you'd want an atomic SIMD compare-and-swap (presumably element-wise ?) ? – Paul R Jan 04 '18 at 11:01
  • @PaulR I was thinking about lockless data structures, e.g. CASing multiple packed 64bit ints simultaneously. The specific idea I had (if an atomic 512bit CAS was possible) was a trie with 256bits of mask for existence and 256bits for 4 pointers in order of existence in the mask. – Alex R Jan 04 '18 at 11:39
  • If you use 32-bit pointers (or 32-bit array offsets relative to a 64-bit base pointer), you can fit twice as many elements in the same vector. But atomic 64-byte loads aren't possible either (again without a transaction), so even readers of this data structure would need expensive operations. Unless you only need it to work on a specific Skylake-AVX512 machine, where aligned 64-byte loads/stores may in fact be atomic even though x86 on paper does *not* guarantee this. (And some future AMD CPU will probably run 512-bit ops as multiple smaller loads/stores.) – Peter Cordes Jan 04 '18 at 12:11

1 Answers1

3

Other than lock cmpxchg16b (16-bytes), x86 doesn't have any guaranteed-atomic operations wider than 8 bytes. Aligned vector load / store are elementwise-atomic on current CPUs (i.e. no tearing within an 8-byte element), although it's not clear if the documentation guarantees that.

Were you hoping for a 64-byte whole-cache-line CAS? There's no single instruction for that.

AVX512 alone doesn't provide that, but with TSX (transactional memory) you can roll your own. Put a load + compare + store inside a transaction. IDK how expensive xbegin / xend is compared to lock cmpxchg.

You don't need AVX512 for it either; the whole transaction commits atomically or not at all, so you could use a pair of AVX2 load / compare instructions to implement a 64-byte CAS.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks for information :) Shame that it's not possible without TSX (I've experimented with TSX before and found it to have far too high of an overhead to be viable currently). – Alex R Jan 04 '18 at 13:40
  • @AlexR: Indeed, it would be nice if there was some way to query what width of operation is atomic on the current CPU, so you could take advantage of atomic 64-byte loads on CPUs where the actually *are* atomic. (That wouldn't give you atomic CAS, but it would make the read path cheap.) I guess it would be interesting if there were SIMD atomics, there are no SIMD RMW instructions, so the load/store execution units don't currently need any support for read-modify-write on the FP side, only integer. – Peter Cordes Jan 04 '18 at 13:45
  • Also, that might constrain chipsets if cache-line transfers between cores had to be atomic. They are on normal Intel machines, but maybe exotic machines with more than 8 sockets might use custom glue logic... AMD CPUs in practice can have tearing on cache line transfers even when within a single socket a vector store is atomic: https://stackoverflow.com/questions/7646018/sse-instructions-which-cpus-can-do-atomic-16b-memory-operations/7647825#7647825 – Peter Cordes Jan 04 '18 at 13:48
  • I'm curious to know why, if `lock` prefix locks the whole cache line, can't we lock the cache line, execute arbitrary instructions, then unlock it when we're done? – Nick Strupat Jun 15 '21 at 03:48
  • @NickStrupat: Because that could deadlock the hardware if software used it wrong. Automatically locking only for a single instruction doesn't have that problem. However, TSX (transactional memory) *does* provide what you're asking for, across potentially multiple line. (With a transactional abort on conflict, not deadlock, but that means software has to be aware of the need to maybe retry, and provide a branch address.) Similarly, on other architectures, LL/SC does let you do arbitrary things to one memory location as an atomic RMW, but not a whole line.) – Peter Cordes Jun 15 '21 at 06:30
  • @NickStrupat: Also, hanging on to exclusive ownership of a cache line as part of the `lock` prefix makes this part of the operation atomic; getting this cache line to other cores as a whole 64-byte atomic transfer is another matter. (Of course, it appears Intel CPUs do that, too, but there are counterexamples of CPUs that [introduce tearing in the transfer between cores on different sockets, but fine within one socket, e.g. AMD K10](https://stackoverflow.com/questions/7646018/sse-instructions-which-cpus-can-do-atomic-16b-memory-operations/7647825#7647825) which used HyperTransport for that. – Peter Cordes Jun 15 '21 at 06:33
  • @PeterCordes, Interesting... that makes perfect sense. Thank you for the concise explanation! – Nick Strupat Jun 15 '21 at 18:47
  • @NickStrupat if you only need interrupt atomicity you could try out [rseq](https://github.com/compudj/librseq). – Noah Jun 16 '21 at 03:47