I am trying to understand the read/write performance of clwb instruction and test how it varies in case of a write to a cache line against when I am only reading it. I expect that for the write case, the time taken should be higher than the read case. To test the same, here is a small code snippet that I am running on Intel Xeon CPU (skylake) and using non-volatile memory (NVM) for the read write store
/* nvm_alloc allocates memory on NVM */
uint64_t *array = (uint64_t *) nvm_alloc(pool, 512);
uint64_t *p = &array[0];
/* separated p & q by the size of write unit in Optane (256B) */
uint64_t *q = &array[32];
uint64_t time_1 = 0;
uint64_t time_2 = 0;
uint64_t start;
volatile uint64_t x;
for(int i = 0; i < 1000000; i++)
{
/* issues an mfence instruction */
mfence();
/* this is for the read case, bring p into cache */
/* commented read case */
//x = *p;
/* this is for the write case, update cacheline containing p */
*p = *p + 1;
*q = *q + 1;
/* rdtscp here to flush instruction pipeline */
start = rdtscp();
/* issue clwb on cacheline containing p */
clwb(p);
time_1 += rdtsc() - start;
start = rdtsc();
clwb(q);
time_2 += rdtsc() - start;
}
As clwb doesn't explicitly evict the cache line, the next iterations for the read can possibly be served from the cache itself. In case of writes, the cache line is modified in each iteration and then clwb is issued to write it back. However, the time taken for writes is almost equal to the read case which I am unable to understand. Should the time for the write not include the time to writeback the dirty cache line to memory (or memory controller)