0

I'm trying to write an intentional code that has very low L1 d-cache hit rate and here it goes:

#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>

#define S 16*1024*1024
int largedata[S];

int main()
{
    struct timeval tv1;
    struct timeval tv2;

    for (int j = 0; j < 10; j++)
    {
            gettimeofday(&tv1, NULL);
            int total = 0;
            for (int i = 1; i < S; i++) {
                    largedata[i] = largedata[rand()%S] + 1;
                    total += largedata[i];
            }
            gettimeofday(&tv2, NULL);
            int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
            printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
    }

    return 0;
}

What it does is, it has 64MB of buffer and randomly accesses. The L1 D-cache size of my machine is 32KB so I expect very low cache-hit ratio but what I see is 2.57% miss-rate as following:

# gcc test.c &&    perf stat -e L1-dcache-load-misses -e L1-icache-load-misses -e L1-dcache-loads -e L1-dcache-stores  ./a.out
Round 0 elapsed 399706 us --> 26671607
Round 1 elapsed 344664 us --> 57210118
Round 2 elapsed 342444 us --> 79296375
Round 3 elapsed 344605 us --> 92293029
Round 4 elapsed 342173 us --> 93904234
Round 5 elapsed 346295 us --> 76478386
Round 6 elapsed 343390 us --> 98878844
Round 7 elapsed 347893 us --> 107286968
Round 8 elapsed 355442 us --> 87289283
Round 9 elapsed 362253 us --> 101320374

 Performance counter stats for './a.out':

     104128257      L1-dcache-load-misses     #    2.57% of all L1-dcache accesses
       2630683      L1-icache-load-misses
    4047961891      L1-dcache-loads
    2192632892      L1-dcache-stores

   3.539076619 seconds time elapsed

   3.479520000 seconds user
   0.032746000 seconds sys

I'd expect way more higher (like 20~30%) and can anyone explain how this X86 (xeon) do this magic?

BTW, my CPU: Intel(R) Xeon(R) Gold 6230

UPDATE

That xorshift and the optimization worked like a charm!

    for (int j = 0; j < 10; j++)
    {
            gettimeofday(&tv1, NULL);
            int total = 0;
            for (int i = 1; i < S; i++) {

                    uint32_t t = x;
                    t ^= t << 11U;
                    t ^= t >> 8U;
                    x = y; y = z; z = w;
                    w ^= w >> 19U;
                    w ^= t;

                    largedata[i] = largedata[w%S] + 1;
                    total += largedata[i];
            }
            gettimeofday(&tv2, NULL);
            int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
            printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
    }

Which results in 47% of cache miss. I guess calling rand takes about ~30 memory accesses like Brendan mentioned.

 Performance counter stats for './a.out':

      87715381      L1-dcache-load-misses     #   47.35% of all L1-dcache accesses
       1064131      L1-icache-load-misses
     185235862      L1-dcache-loads
     177250983      L1-dcache-stores

   0.715463900 seconds time elapsed

   0.664886000 seconds user
   0.028505000 seconds sys
jaeyong
  • 8,951
  • 14
  • 50
  • 63
  • 1
    Working backwards we can say 1 out of 40 accesses are L1D misses; so for every `largedata[rand()%S]` there are 39 L1D hits, so there's possibly 30+ L1D hits when calling `rand()`. – Brendan Jun 16 '22 at 21:02
  • 1
    Looks like you compiled without optimization, so a lot of useless extra memory accesses were added to the program that are polluting the statistic – harold Jun 16 '22 at 22:53
  • Look at the asm for your loop, and see how many L1d accesses are happening per every "interesting" access. As harold says, enable optimization. And inline your own simple PRNG so there's no `call` (push a return address) / `ret` (pop a return address) or load/store of the PRNG seed; it can stay in a register. Like a simple `xorshift*`. – Peter Cordes Jun 17 '22 at 00:59
  • Also, you could do all your accesses to the same set of L1d, accessing memory multiples of 4k apart. But then you'd have 4k aliasing slowdowns from false dependency between stores and loads that are actually to different addresses, at the same offset within different pages. (Although different dwords within the same 64-byte cache line wouldn't have this problem.) – Peter Cordes Jun 17 '22 at 01:01
  • Just `rand()` alone wasn't ~30 accesses. With optimization disabled, GCC would [store/reload everything between every C statement](https://stackoverflow.com/questions/53366394/why-does-clang-produce-inefficient-asm-with-o0-for-this-simple-floating-point), so your original loop has multiple loads of `i`, and of `total`. But yes, significant extra loads while calling a library function even though it was compiled with optimization enabled. – Peter Cordes Jun 18 '22 at 03:51
  • Your last edit should have been posted as an answer instead, along with a summary of comments. It seems like the question now answers itself, so isn't a question anymore. – Peter Cordes Jun 18 '22 at 08:22

0 Answers0