How can an unlock/lock operation on a mutex be faster than a fetch from memory?

Question

Norvig claims, that an mutex lock or unlock operation takes only a quarter of the time that is needed to do a fetch from memory.

This answer explains, that a mutex is essentially a flag and a wait queue and that it would only take a few instructions to flip the flag on an uncontended mutex.

I assume, if a different CPU or core tries to lock that mutex, it needs to wait for the cache line to be written back into the memory (if that didn't already happen) and its own memory read to get the state of the flag. Is that correct? What is the difference, if it is a different core compared to a different CPU?
So the numbers Norvig states are only for an uncontended mutex where the CPU or core trying the operation already has that flag in its cache and the cache line isn't dirty?

Do you really want to know about different CPUs? That's not the most common setup (it exists of course), usually there's just one CPU with multiple cores — harold, Jun 21 '17 at 11:46
The cache line does not **have** to be written back to memory. Even in implementations that writeback a modified line, the write can be buffered and the data can be transferred (along with ownership) to the new write requester before the data has been transferred to memory. Cache-to-cache transfers are typically faster than memory accesses (in some large NUMA systems local memory might be faster than very remote cache). — , Jun 21 '17 at 16:20

score 2 · Accepted Answer · answered Jun 22 '17 at 13:19

A typical PC runs a x86 CPU, Intel's CPUs can perform the locking entirely on the caches:

if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus.
Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically.
This operation is called “cache locking.”
The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area.

^{From Intel Software Developer Manual 3, Section 8.1.4}

The cache coherence mechanism is a variation of the MESI protocol.
In such protocol before a CPU can write to a cached location, it must have the corresponding line in the Exclusive (E) state.
This means that only one CPU at a time has a given memory location in a dirty state.
When other CPUs want to read the same location, the owner CPU will delay such reads until the atomic operation is finished.
It then follows the coherence protocol to either forward, invalidate or write-back the line.

In the above scenario a lock can be performed faster than an uncached load.

Those times however are a bit off and surely outdated.
They are intended to give an order, along with an order of magnitude, among the typical operations.
The timing for an L1 hit is a bit odd, it isn't faster than the typical instruction execution (which by itself cannot be described with a single number).
The Intel optimization manual reports, for an old CPU like Sandy Bridge, an L1 access time of 4 cycles while there are a lot of instructions with a latency of 4 cycles of less.

I would take those numbers with a grain of salt, avoiding reasoning too much on them.
The lesson Norvig tried to teach us is: hardware is layered, the closer (from a topological point of view¹) to the CPU, the faster.
So when parsing a file, a programmer should avoid moving data back and forth to a file, instead it should minimize the IO pressure.
The some applies when processing an array, locality will improve performance.
Note however that these are technically, micro-optimisations and the topic is not as simple as it appears.

¹ In general divide the hardware in what is: inside the core (registers), inside the CPU (caches, possibly not the LLC), inside the socket (GPU, LLC), behind dedicated bus devices (memory, other CPUs), behind one generic bus (PCIe - internal devices like network cards), behind two or more buses (USB devices, disks) and in another computer entirely (servers).

How can an unlock/lock operation on a mutex be faster than a fetch from memory?

1 Answers1