Is anyone aware of any shortcomings in trying to use the Intel Optane DC Memory (DCPMM) in App Direct Mode (that is as non-volatile memory) to write or read to/from it using Write Through (WT) or Un-Cacheable (UC) memory policies? The idea is to use regular memory as non-volatile (data is not lost in case of failure), having dirty cache lines is not ideal since is volatile. There are multiple links that show examples using Write Back (WB) or Write Combining (WC) with non-temporal access (NTA) instructions, also using WB and CLFLUSHOPT or CLWB write instructions. Are there any important drawbacks other than bandwidth, not writing an entire cache line to memory when using WT/UC compared to WB/WC?
1 Answers
(This is mostly speculation, I haven't done any performance testing with Optane DC PM, and only read about UC or WT for DRAM occasionally. But I think enough is known about how they work in general to say it's probably a bad idea for many workloads.)
Further reading about Optane DC PM DIMMs: https://thememoryguy.com/whats-inside-an-optane-dimm/ - they include a wear-leveling remapping layer like an SSD.
Also related: When I test AEP memory, I found that flushing a cacheline repeatedly has a higher latency than flushing different cachelines. I want to know what caused this phenomenon. Is it wear leveling mechanism ? on Intel forums. That would indicate that repeated writes to the same cache line might be even worse than you might expect.
UC also implies strong ordering which would hurt OoO exec, I think. I think UC also stops you from using NT stores for full-line writes. It would also totally destroy read performance so I don't think it's worth considering.
WT is maybe worth considering as an alternative to clwb
(assuming it actually works with NV memory), but you'd still have to be careful about compile-time reordering of stores. _mm_clwb
is presumably a compiler memory barrier that would prevent such problems.
In a store-heavy workload, you'd expect serious slowdowns in writes, though. Per-core memory bandwidth is very much limited by number of outstanding requests. Making each request smaller (only 8 bytes or something instead of a whole line) don't make it appreciably faster. The vast majority of the time is in getting the request through the memory hierarchy, and waiting for the address lines to select the right place, not the actual burst transfer over the memory bus. (This is pipelined so with multiple full-line requests to the same DRAM page a memory controller can spend most of its time transferring data, not waiting, I think. Optane / 3DXPoint isn't as fast as DRAM so there may be more waiting.)
So for example, storing contiguous int64_t
or double
would take 8 separate stores per 64-byte cache line, unless you (or the compiler) vectorizes. With WT instead of WB + clwb
, I'd guess that would be about 8x slower. This is not based on any real performance details about Optane DC PM; I haven't seen memory latency / bandwidth numbers, and I haven't looked at WT performance. I have seen occasional papers that compare synthetic workloads with WT vs. WB caching on real Intel hardware on regular DDR DRAM, though. I think it's usable if multiple writes to the same cache line aren't typical for your code. (But normally that's something you want to do and optimize for, because WB caching makes it very cheap.)
If you have AVX512, that lets you do full-line 64-byte stores, if you make sure they're aligned. (Which you generally want for performance with 512-bit vectors anyway).

- 328,167
- 45
- 605
- 847
-
I seem to recall that the Optane thinks MCE is a good way to provide status information to the client. It is, if you happen to be writing machine code for a single purpose embedded system, but beyond that, imprecise exceptions are a bit difficult to deal with. In kernels and hypervisors that support/emulate this, they set a known good state (figuratively setjmp), perform a bunch of transactions, do something to ensure all transactions are committed; otherwise longjmp on MCE. – mevets Jan 03 '20 at 07:05
-
Thank you @Peter Cordes, AVX512 instructions sounds a good way to test performance, don't have to waste cycles flushing cache line to memory as in WB. Could be a good case for measuring bandwidth and latency (when not using the intel PMU counters). – AdvSphere Jan 03 '20 at 14:41
-
@AdvSphere: Last time I was looking at `libpmem` code out of curiosity, I seem to remember it using NT stores in one loop, *instead* of CLWB. So NT stores on WB memory (including AVX512 64-byte NT stores) are another way to avoid CLFLUSHOPT overhead for large stores, if you're going to read again soon. – Peter Cordes Jan 03 '20 at 14:49
-
@PeterCordes, yes, that case is definitely an good alternative to flushing. It'll be interesting to see how using NTA stores (WC buffer) in WB performs against WT with AVX512 instruction. – AdvSphere Jan 03 '20 at 14:53
-
One should also consider that Optane uses wear leveling through renaming (I do not remember the block size, but I *think* it is larger than 64B). While the memory controller could buffer writes (similar to write coalescing buffers) to reduce repeated writes to the same chunk, there would still be "excessive" rewriting. At least some targeted workloads have a more transactional nature so explicit commit of writes to persistent memory is reasonable. – Jan 05 '20 at 15:58
-
@PaulA.Clayton: It does? I had been assuming that 3D-XPoint DIMMs would be lower overhead. Do we know for sure that there's a controller doing remapping / wear leveling between the memory bus and the 3DXPoint storage in Optane DC *PM* devices? The Optane name includes plain SSDs that use 3DXPoint as well (Optane DC without the PM), so it can be easy to mix up what's being discussed. (Thanks, Intel.) – Peter Cordes Jan 05 '20 at 16:21
-
1@PeterCordes Yep. "All DCPMM wear leveling is internal to DCPMM and under control of the DCPMM controller. There is nothing wear level related visible from the host." (from [Intel](https://forums.intel.com/s/question/0D50P00004PboHGSAZ/when-i-test-aep-memory-i-found-that-flushing-a-cacheline-repeatedly-has-a-higher-latency-than-flushing-different-cachelines-i-want-to-know-what-caused-this-phenomenon-is-it-wear-leveling-mechanism-?language=en_US). See also: https://thememoryguy.com/whats-inside-an-optane-dimm/ – Jan 06 '20 at 18:37
-
1@PeterCordes The DIMMs are (from what I gather) substantially lower overhead for reads (writes are more throughput-oriented). The DIMM interface also tends to provide higher bandwidth (I suspect) and lower latency. – Jan 06 '20 at 18:42
-
1@PaulA.Clayton: thanks for the links, included in this answer. Especially the fact that repeated writes to the same line are worse than scattered writes makes WT likely to be a bad choice for some workloads. – Peter Cordes Jan 06 '20 at 18:46