How "lock add" is implemented on x86 processors

Question

I recently benchmarked std::atomic::fetch_add vs std::atomic::compare_exchange_strong on a 32 core Skylake Intel processor. Unsurprisingly (from the myths I've heard about fetch_add), fetch_add is almost an order of magnitude more scalable than compare_exchange_strong. Looking at the disassembly of the program std::atomic::fetch_add is implemented with a lock add and std::atomic::compare_exchange_strong is implemented with lock cmpxchg (https://godbolt.org/z/qfo4an).

What makes lock add so much faster on an intel multi-core processor? From my understanding, the slowness in both instructions comes from contention on the cacheline, and to execute both instructions with sequential consistency, the executing CPU has to pull the line into it's own core in exclusive or modified mode (from MESI). How then does the processor optimize fetch_add internally?

This is a simplified version of the benchmarking code. There was no load+CAS loop for the compare_exchange_strong benchmark, just a compare_exchange_strong on the atomic with an input variable that kept getting varied by thread and iteration. So it was just a comparison of instruction throughput under contention from multiple CPUs.

You should show your benchmarking code as it's often where the problem lies. Note that you'd have the same cache contention problems even if the LOCK prefix wasn't being used, as the CPU has to pull the cache line into memory for any write that's less than a full cache line width. — Ross Ridge, Dec 30 '19 at 20:17
Can you post the actual loops disassembly with actual numbers (iter/s)? — rustyx, Dec 30 '19 at 22:45
You didn't really benchmark `fetch_add()`. Because you didn't use the value that was fetched the compiler optimized it into an atomic "don't fetch, just add" and it's this "don't fetch, just add" that you benchmarked. — Brendan, Dec 31 '19 at 11:03
You can view locked instructions as taking a memory mutex: a mutex on the smallest portion of the memory system that can be locked by the memory access system. The way it works guarantees no deadlock can happen and the lock doesn't have to be optimistic (it never has to be cancelled). — curiousguy, Jan 01 '20 at 00:06
@Brendan: `lock xadd` is presumably only a couple extra uops and not fundamentally different. But fair point, that would be another thing the OP could test. — Peter Cordes, Jan 01 '20 at 04:22

Peter Cordes · Answer 1 · 2019-12-31T06:23:15.600

lock add and lock cmpxchg both work essentially the same way, by holding onto that cache line in Modified state for the duration of the microcoded instruction. (Can num++ be atomic for 'int num'?). According to Agner Fog's instruction tables, lock cmpxchg and lock add are very similar numbers of uops from microcode. (Although lock add is slightly simpler). Agner's throughput numbers are for the uncontended case, where the var stays hot in L1d cache of one core. And cache misses can cause uop replays, but I don't see any reason to expect a significant difference.

You claim you aren't doing load+CAS or using a retry loop. But is it possible you're only counting successful CAS or something? On x86, every CAS (including failures) has almost identical cost to lock add. (With all your threads hammering on the same atomic variable, you'll get lots of CAS failures from using a stale value for expected. This is not the usual use-case for CAS retry loops).

Or does your CAS version actually do a pure load from the atomic variable to get an expected value? That might be leading to memory-order mis-speculation.

You don't have complete code in the question so I have to guess, and couldn't try it on my desktop. You don't even have any perf-counter results or anything like that; there are lots of perf events for off-core memory access, and events like mem_inst_retired.lock_loads that could record number of locked instructions executed.

With lock add, every time a core gets ownership of the cache line, it succeeds at doing an increment. Cores are only waiting for HW arbitration of access to the line, never for another core to get the line and then fail to increment because it had a stale value.

It's plausible that HW arbitration could treat lock add and lock cmpxchg differently, e.g. perhaps letting a core hang onto the line for long enough to do a couple lock add instructions.

Is that what you mean?

Or maybe you have some major failure in microbenchmark methodology, like maybe not doing a warm-up loop to get CPU frequency up from idle before starting your timing? Or maybe some threads happen to finish early and let the other threads run with less contention?

score 2 · Answer 2 · answered Jan 01 '20 at 00:34

to execute both instructions with sequential consistency, the executing CPU has to pull the line into it's own core in exclusive or modified mode (from MESI).

No, to execute either instruction with any consistent, defined semantic that guarantees that concurrent executions on multiple CPU do not lose increments, you would need that. Even if you were willing to drop "sequential consistency" (on these instructions) or even drop the usual acquire and release guarantees of reads and writes.

Any locked instruction effectively enforces mutual exclusion on the part of memory sufficient to guarantee atomicity. (Like a regular mutex but at the memory level.) Because no other core can access that memory range for the duration of the operation, the atomicity is trivially guaranteed.

What makes lock add so much faster on an intel multi-core processor?

I would expect any tiny difference of timing to be critical in these cases, and doing the load plus compare (or compare-load plus compare-load ...) might change the timing enough to lose the chance, much like too code using mutexes can have widely different efficiency when there is heavy contention and a small change in access pattern changes the way the mutex is attributed.

How "lock add" is implemented on x86 processors

2 Answers2