I am currently trying to write a program with an L1 miss rate that is as high as possible.
To measure the L1 miss rate I am using the MEM_LOAD_RETIRED.L1_MISS and MEM_LOAD_RETIRED.L1_HIT performance counter events on an Intel Core i7 processor (I am not interested in fill buffer hits). I modified the Linux kernel to give me exact measurements at each context switch so that I can determine precisely how many hits and misses each program gets.
The hardware prefetcher is disabled.
This is the code that I currently have:
#define LINE_SIZE 64
#define CACHE_SIZE 4096 * 8
#define MEM_SIZE CACHE_SIZE * 64
void main(int argc, char* argv[])
{
volatile register char* addr asm ("r12") = mmap(0, MEM_SIZE, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
volatile register unsigned long idx asm ("r13") = 0;
volatile register unsigned long store_val asm ("r14") = 0;
volatile register unsigned long x64 asm ("r15") = 88172645463325252ull;
while(1)
{
x64 ^= x64 << 13;
x64 ^= x64 >> 7;
x64 ^= x64 << 17;
store_val = addr[x64 % MEM_SIZE];
}
}
This code produces exactly one memory access per loop iteration, so my question is: Why is the miss rate that I am getting here close to 0%? Even without the xorshift and just linearly accessing the array (edit: increasing the index by 64 on each access) I should be getting a close to 100% miss rate, right? What am I missing here?
Thanks in advance! :)
Update: With callgrind I am getting the expected 99.9% miss rate when doing linear accesses. I don't understand.
Using the perf-tool with:
perf stat -r 10 -B -e mem_load_retired.l1_miss,mem_load_retired.l1_hit ./thrasher
gives me similar results to the one I'm getting using my modified kernel's output.