I am wondering is there any single event that can capture the L1D cache misses. I tried to capture L1d cache miss by measuring latency to access specific memory with rdtsc at the beginning. On my setting, if the L1d cache miss happens, it should hit L2 cache. Therefore I measure latency of accessing memory with RDTSC and compare it with L1 cache latency and L2 cache latency. However, because of the noise, I cannot discern whether it hits L1 or L2. So I decided to use RDPMC.
I found that several APIs provide some functions to monitor perf events easily, but I would like to use RDPMC instruction directly on my test program. I found that MEM_INST_RETIRED.ALL_LOADS-MEM_LOAD_RETIRED.L1_HIT can be used to count the number of retired load instructions that miss in the L1D.(counting L1 cache misses with PAPI_read_counters gives unexpected results). However, it seems that this posting talks about the papi Api.
How can I find what values should be assigned for ecx register before executing rdpmc instruction to capture specific events?? Also, I am wondering is there any single event that can tell me L1 miss happens for one memory load instruction in between two rdpmc instructions back to back like below.
c = XXX; //I don't know what value should be assigned for what perf counter..
asm volatile(
"lfence"
"rdpmc"
"lfence"
"mov (0xdeadbeef), %%r10"//read memory
"mov %%eax, %%r10 //read lower 32 bits of counter
"lfence"
"rdpmc" //another rdpmc to capture difference
"sub %%r10, %%eax //sub two counter to get difference
:"=a"(a)
:"c"(c)
:"r10", "edx");
I am currently using 9900k coffee lake machine, so I searched perf counter number for coffee lake machine in the intel manual. It seems that just capturing two MEM_LOAD_RETIRED.L1_HIT before and after the load instruction is enough to capture the event, but I am not sure whether it is okay to do so.. Also I don't know well how to encode that perf event as ecx register.
Lastly, I am wondering does the rdpmc instruction back-to-back require any serialization instructions. In my case, because I only put the load instruction and measure the L1d cache miss happens or not, I enclose the first rdpmc instruction with lfence instruction and put one more lfence instruction before last rdpmc to make sure the load instruction finish before second rdpmc.
Added code
asm volatile (
"lfence\n\t"
"rdpmc\n\t"
"lfence\n\t"
"mov %%eax, %%esi\n\t"
//measure
"mov (%4), %%r10\n\t"
"lfence\n\t"
"rdpmc\n\t"
"lfence\n\t"
"sub %%esi, %%eax\n\t"
"mov %%eax, (%0)\n\t"
:
:"r"(&perf[1]), "r"(&perf[2]), "r"(&perf[3]),
"r"(myAddr), "c"(0x0)
:"eax","edx","esi","r10", "memory");
Also I pinned my core number 3 with isolcpu and disable hyperthreading for testing. MSR register has been figured with below command
sudo wrmsr -p 3 0x186 0x4108D1 #L1 MISS