Why does the running time of the code below decrease when I increase kNumCacheLines
?
In every iteration, the code modifies one of kNumCacheLines
cachelines, writes the line to the DIMM with the clwb
instruction, and blocks until the store hits the memory controller with sfence
. This example requires Intel Skylake-server or newer Xeon, or IceLake client processors.
#include <stdlib.h>
#include <stdint.h>
#define clwb(addr) \
asm volatile(".byte 0x66; xsaveopt %0" : "+m"(*(volatile char *)(addr)));
static constexpr size_t kNumCacheLines = 1;
int main() {
uint8_t *buf = new uint8_t[kNumCacheLines * 64];
size_t data = 0;
for (size_t i = 0; i < 10000000; i++) {
size_t buf_offset = (i % kNumCacheLines) * 64;
buf[buf_offset] = data++;
clwb(&buf[buf_offset]);
asm volatile("sfence" ::: "memory");
}
delete [] buf;
}
(editor's note: _mm_sfence()
and _mm_clwb(void*)
would avoid needing inline asm, but this inline asm looks correct, including the "memory"
clobber).
Here are some performance numbers on my Skylake Xeon machine, reported by running time ./bench
with different values of kNumCacheLines
:
kNumCacheLines Time (seconds)
1 2.00
2 2.14
3 1.74
4 1.82
5 1.00
6 1.17
7 1.04
8 1.06
Intuitively, I would expect kNumCacheLines = 1
to give the best performance because of hits in the memory controller's write pending queue. But, it is one of the slowest.
As an explanation for the unintuitive slowdown, it is possible that while the memory controller is completing a write to a cache line, it blocks other writes to the same cache line. I suspect that increasing kNumCacheLines
increases performance because of higher parallelism available to the memory controller. The running time jumps from 1.82 seconds to 1.00 seconds when kNumCacheLines
goes from four to five. This seems to correlate with the fact that the memory controller's write pending queue has space for 256 bytes from a thread [https://arxiv.org/pdf/1908.03583.pdf, Section 5.3].
Note that because buf
is smaller than 4 KB, all accesses use the same DIMM. (Assuming it's aligned so it doesn't cross a page boundary)