Eliding cache snooping for thread-local memory

Question

Modern multicore CPUs synchronize cache between cores by snooping, i.e. each core broadcasts what it is doing in terms of memory access, and watches the broadcasts generated by other cores, to cooperate in making sure writes from core A are seen by core B.

This is good in that if you have data that really does need to be shared between threads, it minimizes the amount of code you have to write to make sure it does get shared.

It's bad in that if you have data that should be local to just one thread, the snooping still happens, constantly dissipating energy to no purpose.

Does the snooping still happens if you declare the relevant variables thread_local? Unfortunately the answer is yes according to the accepted answer to Can other threads modify thread-local memory?

Does any currently extant platform (combination of CPU and operating system) provide any way to turn off snooping for thread-local data? Doesn't have to be a portable way; if it requires issuing OS-specific API calls, or even dropping into assembly, I'm still interested.

*Modern multicore CPUs synchronize cache between cores by snooping* - not really, that doesn't scale for power and aggregate L3 cache bandwidth. In practice, modern CPUs use directory-based coherence, e.g. the tags in Intel's inclusive L3 cache are augmented with bits to say which core might have a modified copy of a line. — Peter Cordes, Mar 26 '21 at 14:13
Unfortunately this is one of the many ways that one has to pay for "transparent" shared memory support. Completely eliminating snoop traffic (or the corresponding directory lookups) for thread-private memory may require that the OS "pin" the thread to a specific cache context and that the OS be architected so that it will either never receive pointers to thread-private memory, or that any system call that uses such a pointer must on the same core that made the call. Hierarchical (page-based) directories could reduce much of the snoop traffic, perhaps in conjunction with extended TLB support. — John D McCalpin, Mar 29 '21 at 15:52

score 4 · Accepted Answer · answered Mar 26 '21 at 13:38

Most modern processors use a directory coherence protocol to maintain coherence between all the cores in the same NUMA node and another directory coherence protocol to maintain coherence between all NUMA nodes and IO hubs that are in the same coherence domain, where each NUMA node could be an active socket, part of an active socket, or a node controller. A brief introduction to coherence in real processors can be fount at: Cache coherency(MESI protocol) between different levels of cache namely L1, L2 and L3.

Directory coherence protocols significantly reduce the need for broadcasting snoops because they provide additional coherence state per cache line to basically track who may possibly have a copy of the line. Unnecessary snoops can still occur in the following cases:

A line gets silently evicted from a core or NUMA node without notifying the directory controller.
The directory state may be protected with an error detection code. If the state is deemed corrupted, a broadcast is required.
Depending on the microarchitecture, the in-memory directory may not have the capability of tracking cache lines per NUMA node but rather at the granularity of "any other NUMA node."

The cost of unnecessary snooping is not just extra energy consumption, but also latency because a request cannot be considered to have completed non-speculatively unless all the coherence transactions have completed. This can significantly increase the time to complete a request, which in turn limits bandwdith because each outstanding request consumes certain hardware resources.

You don't have to worry about unnecessary snoops to cache lines storing thread-local variables as long as there are truly being used as thread-local and the thread that owns these variables rarely migrates between physical cores.

Snoop filters can also reduce coherence traffic. The latency cost of snooping can also be overlapped with the latency of reading memory. The bandwidth cost is also a factor. A complete directory can be implemented by replicating tags, which is expensive, or storing directory data for each memory chunk (which requires storage proportional to memory capacity — one technique used for the Alpha 21364 was using per cache block ECC and using the extra bits to hold limited directory information). Avoiding unnecessary coherence activity is still a desirable feature. — , Mar 26 '21 at 15:10

score 3 · Answer 2 · answered Mar 24 '21 at 01:57

There is a basic invalidation based protocol, MESI, which is somewhat foundational. There are other extensions of it, but it serves to minimize the number of bus transactions on a read or write. MESI encodes the states a cache line can be in: Modified, Exclusive, Shared, Invalid. A basic schematic of MESI involves two views. The dashes(-) means maybe an internal state change, but no external operation required. From the CPU to its cache:

           M   E  S   I
Read       -   -  -   2
Write      -   -  1   3

where:

Issue a bus invalidate, change state to M.
Issue a bus read, change state to S.
Issue a bus read + bus invalidate, change state to M.

Also, these states "listen" to the exterior bus, so from the bus to the cache:

           M   E  S  I
Read       4   -  -  -
Write      5   -  -  -

Flush from cache, change to S.
Flush from cache, change to I.

So the bus-agents co-operate to only generate the minimum necessary transactions.

Many CPU's, particularly embedded controllers, have cpu-private-memory, which could be a great candidate for thread local storage; however to migrate a thread from one core to another, would require chasing down all of its thread local storage variables, and copying them (somehow) to the new core's private-memory.

Depending upon the workload, this may be viable, but for the general workload, minimizing the bus traffic and loosening the affinity is a win.

This is the shared-bus model that MESI is usually described in terms of, but modern real CPUs don't work this way. The interconnect isn't a shared bus that all cores compete for access to. e.g. Intel Sandybridge-family use a ring bus (https://www.realworldtech.com/sandy-bridge/8/), or in the server versions from Skylake-X onward, a mesh. So all cores can have messages in flight to slices of L3 cache at once. They use directory-based coherence. There *is* some extra coherency overhead on an L3 miss, especially on a multi-socket system to make sure the other socket doesn't have it Modified. — Peter Cordes, Mar 26 '21 at 15:34

Eliding cache snooping for thread-local memory

2 Answers2