5

I am currently trying to write a program with an L1 miss rate that is as high as possible.

To measure the L1 miss rate I am using the MEM_LOAD_RETIRED.L1_MISS and MEM_LOAD_RETIRED.L1_HIT performance counter events on an Intel Core i7 processor (I am not interested in fill buffer hits). I modified the Linux kernel to give me exact measurements at each context switch so that I can determine precisely how many hits and misses each program gets.

The hardware prefetcher is disabled.

This is the code that I currently have:

#define LINE_SIZE 64
#define CACHE_SIZE 4096 * 8
#define MEM_SIZE CACHE_SIZE * 64


void main(int argc, char* argv[])
{

    volatile register char* addr asm ("r12") = mmap(0, MEM_SIZE, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

    volatile register unsigned long idx asm ("r13") = 0;
    volatile register unsigned long store_val asm ("r14") = 0;

    volatile register unsigned long x64 asm ("r15") = 88172645463325252ull;

    while(1)
    {
        x64 ^= x64 << 13;
        x64 ^= x64 >> 7;
        x64 ^= x64 << 17;

        store_val = addr[x64 % MEM_SIZE];
    }
}

This code produces exactly one memory access per loop iteration, so my question is: Why is the miss rate that I am getting here close to 0%? Even without the xorshift and just linearly accessing the array (edit: increasing the index by 64 on each access) I should be getting a close to 100% miss rate, right? What am I missing here?

Thanks in advance! :)

Update: With callgrind I am getting the expected 99.9% miss rate when doing linear accesses. I don't understand.

Using the perf-tool with:

perf stat -r 10 -B -e mem_load_retired.l1_miss,mem_load_retired.l1_hit ./thrasher

gives me similar results to the one I'm getting using my modified kernel's output.

arn
  • 83
  • 6
  • 1
    does it produce the expected assembly code? – user253751 Aug 08 '22 at 16:30
  • @user253751 yes it does. Exactly one memory access, the rest is done with registers. – arn Aug 08 '22 at 16:32
  • Linearly accessing it you definitely would not get a high miss rate. Every time there is a miss, the memory controller will transfer an entire cache line. – Ben Voigt Aug 08 '22 at 16:34
  • @BenVoigt with linearly accessing I meant increasing the index by 64 on each access, so each access would need a new cache line, sorry for the imprecise wording. – arn Aug 08 '22 at 16:36
  • BTW the cache miss rate should be around 56/64, assuming you chose 8 because your cache is 8-way associative, and assuming an ideal PRNG. Each of those 8 ways contains a random one out of the 64 different cache lines that map to that set of 8 ways, so there's an 8/64 chance the cache line being accessed is in there. – user253751 Aug 08 '22 at 16:52
  • @user253751 when I repeatedly access memory that would be mapped to the same cache set (increasing the index by 4096 each time and wrapping around), shouldn't the miss-rate then be close to 100%? – arn Aug 08 '22 at 16:57
  • @arn yes, but this program is randomly accessing – user253751 Aug 08 '22 at 17:27
  • @user253751 yes, but that doesn't work either unfortunately – arn Aug 08 '22 at 17:47
  • Hardware prefetching would avoid some cache misses in a linear access pattern, especially if L1d prefetch managed to hit in L2. A stride of 4096 bytes could still defeat that; like I forget if L1d prefetch will lock on to such a large stride. And it should defeat the L2 streamer on Intel CPUs, which works only within 4k regions last I read. But yeah for random access, I'd expect conflict misses to be more of a big deal; L1d aliases on 4k intervals in Intel CPUs at least, so Eric's answer pointing out you have 1024-byte strides means only 4 x 8-way = 32 lines of your working set can fit. – Peter Cordes Aug 09 '22 at 00:52
  • @PeterCordes I disabled the hardware prefetchers, so that should not be it (also doesn't explain the high hit rate for random accesses). I ran a linear 4KB stride through callgrind and that gave me the expected miss rate. I've also tried counting hardware prefetches that hit in L2, but the number of hits were negligible. I've been trying to find an answer to this for hours, without any success :/ Really baffling. – arn Aug 09 '22 at 02:41

2 Answers2

3

x64 % MEM_SIZE is x64 % CACHE_SIZE * 64 which is x64 % 4096 * 16 * 64 which is ((x64 % 4096) * 16) * 64 which touches at most 4096 different locations.

Put your macro replacement text in parentheses.

(I am not positive this is the whole explanation, since 4096 locations at 1024-byte intervals ought to map to the same cache sets repeatedly, but maybe this combined with the shifting and XORing produces a pattern that hits multiple times before moving on. A quick test shows the first 4096 iterations touch only 2558 locations.)

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • Thank you for the hint! Would've definitely not spotted that. Unfortunately that didn't solve the problem :/ – arn Aug 08 '22 at 16:51
1

UPDATE:

I finally found the issue! Really a rather stupid mistake: I assume Linux has some sort of zero page deduplication going on, so because I did not initialize the array all the accesses were presumably mapped to the same physical page, obviously leading to a high hit rate.

I updated my code like so:

#define PAGE 4096
#define LINE_SIZE 64
#define CACHE_SIZE 4096 * 8
#define MEM_SIZE (CACHE_SIZE * 512)

char array[MEM_SIZE] = {[0 ... MEM_SIZE-1] = 5};


void main(int argc, char* argv[])
{
    volatile register char* addr asm ("r12") = array;

    volatile register unsigned long idx asm ("r13") = 0;
    volatile register unsigned long store_val asm ("r14") = 0;

    volatile register unsigned long cnt asm ("r15") = 0;

    while(cnt++ < 100000000)
    {
        idx += PAGE;
        idx %= MEM_SIZE;

        store_val += addr[idx];
    }
}

and am now getting my desired ~100% miss rate.

Thanks everyone for weighing in and trying to help, much appreciated!

arn
  • 83
  • 6
  • 1
    Not full de-duplication, just that anonymous pages (including the BSS) get copy-on-write mapped to the same physical zero page, if the first access is a read. Putting the array in `.data` so it's a file-backed private mapping defeats that; a non-zero initializer for even the first element would do it. (Because toolchains aren't smart enough to have an array that spans the .data/.bss boundary.) – Peter Cordes Aug 09 '22 at 06:19
  • 1
    See [Why is iterating though \`std::vector\` faster than iterating though \`std::array\`?](https://stackoverflow.com/a/57130924) for details. Added a mention of that to my answer on [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) – Peter Cordes Aug 09 '22 at 06:35
  • @PeterCordes yes! Exactly what I meant, we called it zero page deduplication at my Uni :D Thanks for those links, very interesting! – arn Aug 09 '22 at 13:13
  • Also, great hint about only needing to initialize the first element. I didn't even try that ^^ – arn Aug 09 '22 at 13:18
  • 1
    Normally "deduplication" implies hashing / scanning to look for separate pages that happen to be holding the same data. (Or that are all zero). Avoiding making them separate in the first place could be called an anti-duplicate mechanism to avoid duplication in the first place, or just sharing. You *could* call it deduplication, but I'd personally avoid that because that term that already has a more specific meaning / usual implementation in a storage / memory context (https://en.wikipedia.org/wiki/Data_deduplication and [Linux KSM](https://en.wikipedia.org/wiki/Kernel_same-page_merging)). – Peter Cordes Aug 09 '22 at 13:43
  • 1
    Making all the elements non-zero isn't worse; it only makes the source slightly more complicated thanks to C having good initializer syntax. The executable will be exactly the same size. And if you want to write to it, there's actually a [*hardware* optimization in Skylake that is specific to storing zeros to already-zeroed memory](https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html), probably because so much software uses inefficient stuff like C++ std::vector that can't ask for already-zeroed memory so always has to write zeros itself if resizing. Not relevant to your RO case – Peter Cordes Aug 09 '22 at 13:49