Understanding performance and behaviour of clwb instruction

Question

I am trying to understand the read/write performance of clwb instruction and test how it varies in case of a write to a cache line against when I am only reading it. I expect that for the write case, the time taken should be higher than the read case. To test the same, here is a small code snippet that I am running on Intel Xeon CPU (skylake) and using non-volatile memory (NVM) for the read write store

/* nvm_alloc allocates memory on NVM */
uint64_t *array = (uint64_t *) nvm_alloc(pool, 512);
uint64_t *p = &array[0];
/* separated p & q by the size of write unit in Optane (256B) */
uint64_t *q = &array[32];

uint64_t time_1 = 0;
uint64_t time_2 = 0;
uint64_t start;

volatile uint64_t x;
for(int i = 0; i < 1000000; i++)
{
        /* issues an mfence instruction */
        mfence();
        /* this is for the read case, bring p into cache */
        /* commented read case */
        //x = *p;
        /* this is for the write case, update cacheline containing p */
        *p = *p + 1;
        *q = *q + 1;
        /* rdtscp here to flush instruction pipeline */
        start = rdtscp();
        /* issue clwb on cacheline containing p */
        clwb(p);
        time_1 += rdtsc() - start;

        start = rdtsc();
        clwb(q);
        time_2 += rdtsc() - start;
}

As clwb doesn't explicitly evict the cache line, the next iterations for the read can possibly be served from the cache itself. In case of writes, the cache line is modified in each iteration and then clwb is issued to write it back. However, the time taken for writes is almost equal to the read case which I am unable to understand. Should the time for the write not include the time to writeback the dirty cache line to memory (or memory controller)

Unfortunately, CLWB isn't guaranteed to leave the cache line present; on SKX it runs the same as CLFLUSHOPT. Introducing stub support is arguably better than making future users of the instruction check for CPU support to avoid SIGILL, though. — Peter Cordes, Mar 10 '20 at 07:52
@PeterCordes plus the timing code is wrong because `rdtsc` is not ordered with `clwb` at both compile-time and run-time. — Hadi Brais, Mar 10 '20 at 16:00
@HadiBrais, could you explain what you meant. I have updated the code to reflect the actual measurement. I am writing to two different cache lines (spaced by write unit size in Optane) and using rdtsc except for the first one as I want to interleave the two clwb operations. — skm, Mar 10 '20 at 18:02
If you let `rdtsc` read the TSC before `clwb` finishes, that's kind of pointless and you aren't even timing `clwb` retirement, let alone completion all the way to DRAM or wherever it goes. All you're timing is out-of-order exec of `rdtsc` after issuing `clwb`. And not even that because `rdtscp` only makes sure previous instructions are complete; it doesn't also stop later instructions from executing before it finishes reading the time. And the bit you added is even worse: both start and stop are unordered `rdtsc` instructions. Also, why use two back-to-back `rdtsc`? Just read the time once — Peter Cordes, Mar 10 '20 at 18:06
See [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/a/51907627) for some links, especially https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf and [clflush to invalidate cache line via C function](https://stackoverflow.com/a/51830976) — Peter Cordes, Mar 10 '20 at 18:07
@PeterCordes reading back-to-back rdtsc is kind of unnecessary, that I get. What I wanted to do was to interleave the two clwb instructions as they are not ordered with respect to each other and measure their individual times. Using rdtscp won't help as it will wait for the previous clwb to finish which won't allow interleaving, any clues to how that can be done. — skm, Mar 10 '20 at 18:22
What do you mean by "interleave the two clwb operations"? Also Skylake processors don't support Optane persistent memory sticks. Perhaps you're using Cascade Lake (2nd gen SP)? — Hadi Brais, Mar 10 '20 at 18:41
@HadiBrais As clwb instructions to different cache lines are not ordered with respect to each other, I don't want the first clwb to finish before issuing the second one. This is what I am calling interleaving. — skm, Mar 10 '20 at 18:45
I see what you're *trying* to do (hoping that rdtsc is ordered wrt. CLWB but not wrt. anything else) but unfortunately for you that's not how anything works. If you want to let execution overlap, you can't time separately, only for the whole overlapping black-box of whatever the CPU did during that overlap. (There are perf counters like `frontend_retired.latency_ge_64` and `mem_trans_retired.load_latency_gt_64` which can tell you something on a per-instruction basis even during OoO exec, but that's very limited.) — Peter Cordes, Mar 10 '20 at 18:47
Related: [Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths](https://stackoverflow.com/q/51986046) — Peter Cordes, Mar 10 '20 at 18:47
Hi, as @HadiBrais already mentioned, Skylake processors don't support Optane persistent memory. On the other hand, on Cascade lake architecture, with NVM DIMMs you have two configuration modes: Memory mode and App direct mode. So what I wanted to add is that in App direct mode, two `clwb` instructions are most likely ordered with respect to each other. I have observed that Clwb is ordered with respect to the following write operation to the same cache line, which gives similar behavior as clflush, that's why it might not be entirely right to assume that you can interleave two clwbs. — Ana Khorguani, Mar 23 '20 at 21:09

Understanding performance and behaviour of clwb instruction

0 Answers0