I'm trying to write an intentional code that has very low L1 d-cache hit rate and here it goes:
#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>
#define S 16*1024*1024
int largedata[S];
int main()
{
struct timeval tv1;
struct timeval tv2;
for (int j = 0; j < 10; j++)
{
gettimeofday(&tv1, NULL);
int total = 0;
for (int i = 1; i < S; i++) {
largedata[i] = largedata[rand()%S] + 1;
total += largedata[i];
}
gettimeofday(&tv2, NULL);
int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
}
return 0;
}
What it does is, it has 64MB of buffer and randomly accesses. The L1 D-cache size of my machine is 32KB so I expect very low cache-hit ratio but what I see is 2.57% miss-rate as following:
# gcc test.c && perf stat -e L1-dcache-load-misses -e L1-icache-load-misses -e L1-dcache-loads -e L1-dcache-stores ./a.out
Round 0 elapsed 399706 us --> 26671607
Round 1 elapsed 344664 us --> 57210118
Round 2 elapsed 342444 us --> 79296375
Round 3 elapsed 344605 us --> 92293029
Round 4 elapsed 342173 us --> 93904234
Round 5 elapsed 346295 us --> 76478386
Round 6 elapsed 343390 us --> 98878844
Round 7 elapsed 347893 us --> 107286968
Round 8 elapsed 355442 us --> 87289283
Round 9 elapsed 362253 us --> 101320374
Performance counter stats for './a.out':
104128257 L1-dcache-load-misses # 2.57% of all L1-dcache accesses
2630683 L1-icache-load-misses
4047961891 L1-dcache-loads
2192632892 L1-dcache-stores
3.539076619 seconds time elapsed
3.479520000 seconds user
0.032746000 seconds sys
I'd expect way more higher (like 20~30%) and can anyone explain how this X86 (xeon) do this magic?
BTW, my CPU: Intel(R) Xeon(R) Gold 6230
UPDATE
That xorshift and the optimization worked like a charm!
for (int j = 0; j < 10; j++)
{
gettimeofday(&tv1, NULL);
int total = 0;
for (int i = 1; i < S; i++) {
uint32_t t = x;
t ^= t << 11U;
t ^= t >> 8U;
x = y; y = z; z = w;
w ^= w >> 19U;
w ^= t;
largedata[i] = largedata[w%S] + 1;
total += largedata[i];
}
gettimeofday(&tv2, NULL);
int elapsed = 1000000 * (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec);
printf("Round %d elapsed %d us --> %d\n", j, elapsed, total);
}
Which results in 47% of cache miss. I guess calling rand takes about ~30 memory accesses like Brendan mentioned.
Performance counter stats for './a.out':
87715381 L1-dcache-load-misses # 47.35% of all L1-dcache accesses
1064131 L1-icache-load-misses
185235862 L1-dcache-loads
177250983 L1-dcache-stores
0.715463900 seconds time elapsed
0.664886000 seconds user
0.028505000 seconds sys