Atomic operation contention between SMT/HT Threads

Question

Across modern AMD/Intel CPU's is contention for atomics (inc/dec/swap etc) between say 2 SMT/HT threads on the same core known to have significantly better performance in general than between threads pinned to different cores?

Yes, atomic RMWs are much cheaper when the cache line is already hot, no RFO needed to get it from another physical core. [What are the CPU-internal characteristics of CAS collision?](https://stackoverflow.com/q/5720007) seems to be about that (different vs. same physical core). Related in general, without perf numbers: [What will be used for data exchange between threads are executing on one Core with HT?](https://stackoverflow.com/q/32979067). — Peter Cordes, Jun 13 '21 at 15:44
The situation is different with pure-load / pure-store atomic operations: it seems sharing a physical core gives *more* opportunity for memory order mis-speculation machine clears: [What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?](https://stackoverflow.com/a/45610386). That only tested load and store, not RMWs like `.fetch_add` (x86 `lock add` or `lock xadd`) — Peter Cordes, Jun 13 '21 at 15:45
That's a lot of good to digest. It makes me wonder if amortising atomic ops between HT pairs to a core local variable first, before resolving the result to a global cross core atomic value is ever worth experimenting with (as a first or only level of hierarchical atomic composition to spread the contention load). — iam, Jun 13 '21 at 16:19
For something like collecting and reducing the results of parallel chunks of an operation like an array sum? Yeah, if you're already pinning threads to cores and writing HT-aware code, you might have pairs of threads that share a physical core share adjacent elements of a results array (with 16 or 32-byte elements aligned by 64 so each phys core only shares the output line with 0 or 1 other phys core), and maybe even have have a shared still-active counter so the last one of the pair can add its result to the other's? — Peter Cordes, Jun 13 '21 at 16:40
Hmm, that last bit is probably silly unless there's significant work to do (more than just one scalar add of a sum); if that's all, just let the one thread collecting the results grab both from the same cache line and add. Number of cache lines touched is the more significant factor. — Peter Cordes, Jun 13 '21 at 16:42
Yeh I was thinking very GPU style data parallel operations like reductions etc. And actually also from the perspective of a generalised task system (where tasks might not even always be so well optimised) - as maybe the atomic ops or sync primitives used for accounting could take advantage of that somehow. As I haven't seen anyway talk about a task system optimised specifically around HT before. — iam, Jun 13 '21 at 16:46

Atomic operation contention between SMT/HT Threads

0 Answers0