How does DC PMM (memory mode) cache coherence behave?

Question

Current setup:
Most recent intel architectures today have non-inclusive L3 cache where each slice (+CHA) includes a "snoop filter" that contains the location information an L3 directory would have provided if it were inclusive (This design choice is likely to avoid coherence messages taking over mesh bandwidth). Most also enable "memory directories" by default, which can be used to filter remote snoops or otherwise change the timing properties of the local and remote portions of the coherence transaction. When a memory location belonging to a different socket is accessed, the RFO is directly sent to the QPI/UPI ring and not L3+CHA. Cores copy the Source Address Decoder (SAD) registers that L3 maintains, these registers determine which NUMA node is responsible for the physical address. Once the RFO reaches the home agent responsible, it decides if snoops must be sent to other sockets/cores and respond back to the caller (can do this in parallel). There is also OSB that let's L3 do speculative snooping if bandwidth is available.

The "memory directory" is one or more bits located with the cache line data in DRAM that indicate whether another coherence domain might have a modified copy of the cache line.
These bits aren't updated for loads from local cores/cache because L3/CHA will track that. After a write back invalidation of a M state cache line, the memory directory bit is cleared since only one L3/CHA can have the cache line in M state.

Intel DC PMEM,
From, Intel® 64 and IA-32 Architectures Optimization Reference Manual Section 2.1.31
(I suppose in memory mode, although they don't specify it in the section)

On systems with multiple processors, a directory is used for cache coherence. This directory is implemented as a distributed in-memory directory, with the coherence state of each cache line stored in metadata within the line itself in memory.
In cases where there are cores in different processors repeatedly reading the same set of lines in the Intel Optane DC Persistent Memory Module, there will be several writes to the Intel Optane DC Persistent Memory Module recording the change in the coherence state each time.

This indicates PMM uses memory directories.

These writes are called “directory writes” and tend to be random in nature. As a result, several of these writes can lower the effective Intel Optane DC Persistent Memory Module bandwidth that is available to the application.

Would normal DRAM also suffer from random directory writes in a similar setup?
Or does it not matter in DRAM which has a write b/w of 48GB/s while PMM has only ~2.3GB/s (1)?

Why does PMM need to use directory coherence protocol when the DRAM 'memory directory' exists?

Optane DC Persistent Memory Module may be accessed by different threads, and if these kind of patterns are observed, one option to consider is to change the coherence protocol for Intel Optane DC Persistent Memory Module regions from directory-based to snoop-based by disabling the directory system-wide.

Would RDMA requests to remote PMM need to go through remote DRAM as well?

As I see it, in PMM the DC memory is not a cache, the DRAM is. So PMM cannot be non-inclusive (there is only one PMM system, so it contains all the lines). Other DMA actors go through the system agent as usual and then to the DRAM and PMM as usual. I think the directory cache present in the PMM is needed because the PMM is bigger than the (directly mapped) DRAM cache, so a DRAM access doesn't tell which line in the PMM is really accessed (and the usual cache mechanism is thus insufficient). But without the PMM, the cache dir in the L3s suffices (the DRAM has no directory writes in itself). — Margaret Bloom, Dec 16 '20 at 14:47
Yes, I'm pretty sure this only applies when using PMM in "memory mode", where *This is transparent to the operating system and applications* - it just looks like there's a huge amount of physical RAM. The manual says *The DRAM memory present in the system is being used as a memory-side cache* (12.1.1 Memory Mode). I don't know why they'd keep coherence state in the PMMs instead of in DRAM; it's not intended to survive a reboot. (The same paragraph explains that it encrypts data to/from PMM with a key that's discarded on reboot.) I didn't know Optane DC PM had a mode like that at all, neat. — Peter Cordes, Dec 17 '20 at 15:56

score 3 · Accepted Answer · answered Dec 17 '20 at 22:06

Most recent intel architectures today have non-inclusive L3 cache where each slice (+CHA)

Processors with the server uncore design have a non-inclusive L3 on a mesh interconenct since Skylake. Tiger Lake (TGL) is the first homogeneous (big cores only) microarchitecture with a client uncore design that includes a non-inclusive L3. See: Where data goes after Eviction from cache set in case of Intel Core i3/i7. But the CHA design isn't used in TGL.

includes a "snoop filter" that contains the location information an L3 directory would have provided if it were inclusive

A snoop filter is a directory. Both terms refer to the same hardware structure used to hold coherence information.

When a memory location belonging to a different socket is accessed, the RFO is directly sent to the QPI/UPI ring

The on-chip ring interconnect doesn't adhere to the QPI or UPI specifications. Theses interconnects are actually significantly different from each other. There are dedicated interfacing units between the on-chip interconnect and external interconnects that convert between the message formats. Intel uses QPI/UPI for links between chips.

When a memory location belonging to a different socket is accessed, the RFO is directly sent to the QPI/UPI ring and not L3+CHA.

You mean accessed from a core? All types of requests from a core to any address go through a caching agent, which could be the one collocated with that core or another CA in same NUMA domain. When a CA receives a request, it sends it to the SAD (which is inside the CA) to determine which unit should service the request. At the same time, depending on the type of the request, it's also sent to the associated L3 slice (if present and enabled) for lookup. For example, if the request is to read a data cache line in the E/F/S state (RdData), then an L3 lookup operation is performed in parallel. If it was a read from the legacy I/O space, then no lookup is performed. If a lookup is performed and the result of the lookup is a miss, the output from the SAD is used to determine where to send the request to.

Once the RFO reaches the home agent responsible, it decides if snoops must be sent to other sockets/cores and respond back to the caller (can do this in parallel).

A home agent (or the home agent functionality of a CHA) doesn't sends snoops locally. After a miss in the L3, assuming the home snooping mode, the following happens:

The request is sent to the home agent that owns the line, which will ultimately service the request.
A snoop request is sent to the CA that owns the line if the line is homed in a NUMA domain that is different from the one in which the requestor exists.
A snoop request is sent to each IIO unit in the same NUMA domain as the requestor (because there is a cache in each IIO unit).
A snoop request is sent to each IIO unit in the home NUMA domain.

The HA then checks the directory cache (if supported and enabled) and if missed, it checks the directory in memory (if supported and enabled), and based on the result, it sends snoops to other NUMA domains.

All responses are collected by the HA, which then eventually sends back the the requested line and updates the directory.

I have no idea what you mean by "can do this in parallel."

The "memory directory" is one or more bits located with the cache line data in DRAM that indicate whether another coherence domain might have a modified copy of the cache line.

It's not just about tracking modified copies, but rather the presence of lines in any state.

Note that all of the caching agents we're talking about here are in the same coherence domain. It's just one coherence domain. I think you meant another NUMA node.

Would normal DRAM also suffer from random directory writes in a similar setup?

Yes. The impact can be significant even for DRAM if there happens to be too many access to the directory and the directory cache is not supported or disabled. But the impact is substantially larger in 3D XPoint because writes have a much lower row buffer locality (even in general, not just directory writes) and the precharge time of 3D XPoint is much higher than of DRAM.

Why does PMM need to use directory coherence protocol when the DRAM 'memory directory' exists?

The coherence state is stored with each line whether it's in DRAM or 3D XPoint. It takes only one transaction to read both the state and the line, instead of potentially two transactions had all of the directory been stored in DRAM. I'm not sure which design is better performance-wise and by how much, but storing the state with each line is certainly simpler.

Would RDMA requests to remote PMM need to go through remote DRAM as well?

I don't understand the question. Why do you think it has to go through DRAM if the address of the request is mapped to a PMM?

I'm having trouble understanding why the PMM needs a cache directory in the first place. It needs to keep track of the lines in the DRAM cache but why keep track of which core accessed them and their state? Couldn't the usual cache subsystem do that and the PMM just appear as a very big set of DIMMs? — Margaret Bloom, Dec 19 '20 at 11:13
@MargaretBloom The memory-level directory is used to support the "home snooping with directory" coherence mode in systems with a multiple NUMA nodes. The in-memory directory tracks whether a line may exist in other NUMA nodes, and in the more recent microarchitectures, it can track the coherence state the line may be in. Every 64-byte line in the entire physical main memory has to be tracked. This is in contrast to the L3-level directory, which tracks which core in the same NUMA domain may have a copy of the line. — Hadi Brais, Dec 19 '20 at 13:00

How does DC PMM (memory mode) cache coherence behave?

1 Answers1