Across modern AMD/Intel CPU's is contention for atomics (inc/dec/swap etc) between say 2 SMT/HT threads on the same core known to have significantly better performance in general than between threads pinned to different cores?
Asked
Active
Viewed 148 times
2

Peter Cordes
- 328,167
- 45
- 605
- 847

iam
- 1,623
- 1
- 14
- 28
-
Yes, atomic RMWs are much cheaper when the cache line is already hot, no RFO needed to get it from another physical core. [What are the CPU-internal characteristics of CAS collision?](https://stackoverflow.com/q/5720007) seems to be about that (different vs. same physical core). Related in general, without perf numbers: [What will be used for data exchange between threads are executing on one Core with HT?](https://stackoverflow.com/q/32979067). – Peter Cordes Jun 13 '21 at 15:44
-
The situation is different with pure-load / pure-store atomic operations: it seems sharing a physical core gives *more* opportunity for memory order mis-speculation machine clears: [What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?](https://stackoverflow.com/a/45610386). That only tested load and store, not RMWs like `.fetch_add` (x86 `lock add` or `lock xadd`) – Peter Cordes Jun 13 '21 at 15:45
-
That's a lot of good to digest. It makes me wonder if amortising atomic ops between HT pairs to a core local variable first, before resolving the result to a global cross core atomic value is ever worth experimenting with (as a first or only level of hierarchical atomic composition to spread the contention load). – iam Jun 13 '21 at 16:19
-
For something like collecting and reducing the results of parallel chunks of an operation like an array sum? Yeah, if you're already pinning threads to cores and writing HT-aware code, you might have pairs of threads that share a physical core share adjacent elements of a results array (with 16 or 32-byte elements aligned by 64 so each phys core only shares the output line with 0 or 1 other phys core), and maybe even have have a shared still-active counter so the last one of the pair can add its result to the other's? – Peter Cordes Jun 13 '21 at 16:40
-
Hmm, that last bit is probably silly unless there's significant work to do (more than just one scalar add of a sum); if that's all, just let the one thread collecting the results grab both from the same cache line and add. Number of cache lines touched is the more significant factor. – Peter Cordes Jun 13 '21 at 16:42
-
Yeh I was thinking very GPU style data parallel operations like reductions etc. And actually also from the perspective of a generalised task system (where tasks might not even always be so well optimised) - as maybe the atomic ops or sync primitives used for accounting could take advantage of that somehow. As I haven't seen anyway talk about a task system optimised specifically around HT before. – iam Jun 13 '21 at 16:46