Current setup:
Most recent intel architectures today have non-inclusive L3 cache where each slice (+CHA) includes a "snoop filter" that contains the location information an L3 directory would have provided if it were inclusive (This design choice is likely to avoid coherence messages taking over mesh bandwidth). Most also enable "memory directories" by default, which can be used to filter remote snoops or otherwise change the timing properties of the local and remote portions of the coherence transaction.
When a memory location belonging to a different socket is accessed, the RFO is directly sent to the QPI/UPI ring and not L3+CHA. Cores copy the Source Address Decoder (SAD) registers that L3 maintains, these registers determine which NUMA node is responsible for the physical address. Once the RFO reaches the home agent responsible, it decides if snoops must be sent to other sockets/cores and respond back to the caller (can do this in parallel). There is also OSB that let's L3 do speculative snooping if bandwidth is available.
The "memory directory" is one or more bits located with the cache line data in DRAM that indicate whether another coherence domain might have a modified copy of the cache line.
These bits aren't updated for loads from local cores/cache because L3/CHA will track that.
After a write back invalidation of a M state cache line, the memory directory bit is cleared since only one L3/CHA can have the cache line in M state.
Intel DC PMEM,
From, Intel® 64 and IA-32 Architectures Optimization Reference Manual Section 2.1.31
(I suppose in memory mode, although they don't specify it in the section)
On systems with multiple processors, a directory is used for cache coherence. This directory is implemented as a distributed in-memory directory, with the coherence state of each cache line stored in metadata within the line itself in memory.
In cases where there are cores in different processors repeatedly reading the same set of lines in the Intel Optane DC Persistent Memory Module, there will be several writes to the Intel Optane DC Persistent Memory Module recording the change in the coherence state each time.
This indicates PMM uses memory directories.
These writes are called “directory writes” and tend to be random in nature. As a result, several of these writes can lower the effective Intel Optane DC Persistent Memory Module bandwidth that is available to the application.
Would normal DRAM also suffer from random directory writes in a similar setup?
Or does it not matter in DRAM which has a write b/w of 48GB/s while PMM has only ~2.3GB/s (1)?
Why does PMM need to use directory coherence protocol when the DRAM 'memory directory' exists?
Optane DC Persistent Memory Module may be accessed by different threads, and if these kind of patterns are observed, one option to consider is to change the coherence protocol for Intel Optane DC Persistent Memory Module regions from directory-based to snoop-based by disabling the directory system-wide.
Would RDMA requests to remote PMM need to go through remote DRAM as well?