How does cache coherence work when there are multiple levels of private caches and then shared memory?

Question

Most of the examples online explain coherence with a configuration similar to this: each single core connected to single private cache (no multiple cache levels per core) and then all such cores connect to a common shared memory. How does coherence work if i have a configuration like this for each core: private l1 --> private l2 --> shared memory. Now if I have to invalidate a block in proc2's L1 and get a modified copy in proc1's L2, then how does it work?

If anybody can point to good resources that would be great.

I'm betting the answer will be related to https://stackoverflow.com/questions/62114759/what-cache-coherence-solution-do-modern-x86-cpus-use — Martheen, Dec 30 '20 at 05:27

score 3 · Answer 1 · answered Jan 01 '21 at 23:38

This is implementation specific.

From the SoC / uncore perspective, the core behaves as a single entity, so it can be snooped and will return a single answer (and an updated copy of the line if it needs to).

Within the core, each cache can behave according to its design - a cache that is inclusive towards its upper levels (e.g. an inclusive L2 that has all the data in the L1) can serve as a snoop filter by knowing whether a further snoop is needed to the L1. If the line is in the upper levels, or if the L2 cache is not guaranteed to be inclusive, the CPU will need to create another snoop to these levels. Alternatively you can start out by snooping all levels in parallel, which may be faster in some cases but more wasteful (in terms of power and cache access slots). Finally, for some upper level caches it may be enough to send a snoop without waiting for a response (for example instruction caches usually can't have modified lines to write back), while others will require an entire response protocol.

Keep in mind that most out-of-order designs need not only to snoop caches but also certain buffers that hold in-flight operations in case they need to be flushed. For example x86 with its TSO policy will usually need to snoop also the load-buffer to flush in-flight loads (since at their commit point the data is not guaranteed to be correct anymore). Some store buffers designs may also need to be snooped if they hold data that should be observable.

If the L2 (or lowest level private cache in the core) is managing this internal snooping, then eventually it would need to collect all responses and decide on the overall response to send to the shared cache outside.

To make matters worse, when having multiple cache levels with MESI states, these states don't have to agree. For example you can have a modified line in the L1, while an older copy of it still resides in the L2. Snoops themselves may also have different types, for example snoops for sharing data vs snoops to invalidate shared copies upon one core acquiring ownership. Different snoop types may have different effects on different MESI states on different cache levels. This cannot break of course the basic premises of these snoops (an invalidating snoop will guarantee invalidating all copies of a line), but some snoops can be more aggressive than is strictly required.

How does cache coherence work when there are multiple levels of private caches and then shared memory?

1 Answers1