Low performance with CLWB (cacheline write-backs) to same location vs. cycling through a few lines

Question

Why does the running time of the code below decrease when I increase kNumCacheLines?

In every iteration, the code modifies one of kNumCacheLines cachelines, writes the line to the DIMM with the clwb instruction, and blocks until the store hits the memory controller with sfence. This example requires Intel Skylake-server or newer Xeon, or IceLake client processors.

#include <stdlib.h>
#include <stdint.h>

#define clwb(addr) \
  asm volatile(".byte 0x66; xsaveopt %0" : "+m"(*(volatile char *)(addr)));

static constexpr size_t kNumCacheLines = 1;

int main() {
  uint8_t *buf = new uint8_t[kNumCacheLines * 64];
  size_t data = 0;
  for (size_t i = 0; i < 10000000; i++) {
    size_t buf_offset = (i % kNumCacheLines) * 64;
    buf[buf_offset] = data++;
    clwb(&buf[buf_offset]);
    asm volatile("sfence" ::: "memory");
  }

  delete [] buf;
}

(editor's note: _mm_sfence() and _mm_clwb(void*) would avoid needing inline asm, but this inline asm looks correct, including the "memory" clobber).

Here are some performance numbers on my Skylake Xeon machine, reported by running time ./bench with different values of kNumCacheLines:

kNumCacheLines  Time (seconds)
1               2.00
2               2.14
3               1.74
4               1.82
5               1.00
6               1.17
7               1.04
8               1.06

Intuitively, I would expect kNumCacheLines = 1 to give the best performance because of hits in the memory controller's write pending queue. But, it is one of the slowest.

As an explanation for the unintuitive slowdown, it is possible that while the memory controller is completing a write to a cache line, it blocks other writes to the same cache line. I suspect that increasing kNumCacheLines increases performance because of higher parallelism available to the memory controller. The running time jumps from 1.82 seconds to 1.00 seconds when kNumCacheLines goes from four to five. This seems to correlate with the fact that the memory controller's write pending queue has space for 256 bytes from a thread [https://arxiv.org/pdf/1908.03583.pdf, Section 5.3].

Note that because buf is smaller than 4 KB, all accesses use the same DIMM. (Assuming it's aligned so it doesn't cross a page boundary)

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/200768/discussion-on-question-by-anuj-kalia-low-performance-with-clwb-cacheline-write). — Samuel Liew, Oct 13 '19 at 06:54
I don't think "Note that because buf is smaller than 4 KB, all accesses use the same DIMM." is correct. How are you drawing that conclusion? Mapping from physical addresses to DRAM addresses isn't linear: on lots of hardware alternating cache lines go to alternating DIMMs, on others the granularity is 256 bytes, etc. Determining the mapping on your system is an experiment in itself. — BeeOnRope, Feb 21 '20 at 03:13
You can be sure everything is going to one DIMM by pulling out the other DIMMs :). — BeeOnRope, Feb 21 '20 at 03:14
Try to disable all hardware prefetchers and see if the results are any different. — Hadi Brais, Feb 21 '20 at 03:20

Peter Cordes · Answer 1 · 2022-12-01T06:27:26.897

This is probably fully explained by Intel's CLWB instruction invalidating cache lines - turns out SKX runs clwb the same as clflushopt, i.e. it's a stub implementation for forward compatibility so persistent-memory software can start using it without checking CPU feature levels.

More cache lines means more memory-level parallelism in reloading the invalidated lines for the next store. Or that the flush part is finished before we try to reload. One or the other; there are a lot of details I don't have a specific explanation for.

In each iteration, you store a counter value into a cache line and clwb it. (and sfence). The previous activity on that cache line was kNumCacheLines iterations ago.

We were expecting that these stores could just commit into lines that were already in Exclusive state, but in fact they're going to be Invalid with eviction probably still in flight down the cache hierarchy, depending on exactly when sfence stalls, and for how long.

So each store needs to wait for an RFO (Read For Ownership) to get the line back into cache in Exclusive state before it can commit from the store buffer to L1d.

It seems that you're only getting a factor of 2 speedup from using more cache lines, even though Skylake(-X) has 12 LFBs (i.e. can track 12 in-flight cache lines incoming or outgoing). Perhaps sfence has something to do with that.

The big jump from 4 to 5 is surprising. (Basically two levels of performance, not a continuous transition). That lends some weight to the hypothesis that it's something to do with the store having made it all the way to DRAM before we try to reload, rather than having multiple RFOs in flight. Or at least casts doubt on the idea that it's just MLP for RFOs. CLWB forcing eviction is pretty clearly key, but the specific details of exactly what happens and why there's any speedup is just pure guesswork on my part.

A more detailed analysis might tell us something about microarchitectural details if anyone wants to do one. This hopefully isn't a very normal access pattern so probably we can just avoid doing stuff like this most of the time!

(Possibly related: apparently repeated writes to the same line of Optane DC PM memory are slower than sequential writes, so you don't want write-through caching or an access pattern like this on that kind of non-volatile memory either.)

@HadiBrais: this is in a 10000000 iteration repeat loop over an 8-line buffer. Having all 8 lines hot in cache or not is an inconsequential startup detail compared to re-touching the same line that clwb evicted last iteration (or 8 iterations ago). — Peter Cordes, Feb 21 '20 at 02:29
It doesn't strike me as a.usual MLP type effect. Why would 2 lines not be nearly 2x as fast? The fastest time is 100 nanos, a full DRAM miss, not really consistent with any MLP. Essentially all the speedup occurs from 4 to 5, with zero (negative actually) after that, and very little before. — BeeOnRope, Feb 21 '20 at 02:51
The OP's claim is that `sfence` waits for the `clwb` to "complete" in some sense, which may be true: Intel's documentation seems to imply it at least in the case of persistent memory. That would rule out most types of MLP. I suppose it would be implemented by holding the LFB until the write completes and that status is sent back to the core, freeing the LFB (this is a theory for why server NT stores are so slow). — BeeOnRope, Feb 21 '20 at 02:57
Oh right. I changed my mind, I think your answer is correct. There is one missing piece of the puzzle though, how is the RFO parallelism happening? Do we know for sure that RFOs can occur in parallel across a store fence? The other possible explanation is that the L2 prefetcher is prefetching the other lines in the E state. Even if there is only one demand RFO at a time, they may hit in the L2 cache in this way. I think we can check which explanation is correct by disabling the hardware prefetchers. — Hadi Brais, Feb 21 '20 at 03:01
@HadiBrais: I don't know how / why we're getting a speedup. Maybe it's not parallelism per-se, but just that by the time we get back to a line it's finished flushing. — Peter Cordes, Feb 21 '20 at 03:03

Low performance with CLWB (cacheline write-backs) to same location vs. cycling through a few lines

1 Answers1