I am trying to force a user application to flush all the cache lines that hold an array (created by itself) from all levels of cache.
After reading this post (Cflush to invalidate cache line via C function) and having great guidance from @PeterCordes, I tried to come up with a function in C that would allow me to accomplish this.
#include <x86intrin.h>
#include <stdint.h>
inline void flush_cache_range(uint64_t *ptr, size_t len){
size_t i;
// prevent any load or store to be scheduled across
// this point due to CPU Out of Order execution.
_mm_mfence();
for(i=0; i<len; i++)
// flush the cache line that contains ptr+i from
// all cache levels
_mm_clflushopt(ptr+i);
_mm_mfence();
}
int main(){
size_t plen = 131072; // or much bigger
uint64_t *p = calloc(plen,sizeof(uint64_t));
for(size_t i=0; i<plen; i++){
p[i] = i;
}
flush_cache_range(p,plen);
// at this point, accessing any element of p should
// cause a cache miss. As I access them, adjacent
// elements and perhaps pre-fetched ones will come
// along.
(...)
return 0;
}
I am compiling with gcc -O3 -march=native source.c -o exec.bin
in an AMD Zen2 processor running kernel 5.11.14 (Fedora 33).
I do not completely understand the difference between mfence
/sfence
/lfence
, or when one or the other is enough, so I just used mfence
because I believe it imposes the strongest restriction (Am I right?).
My question is: Am I missing something with this implementation? Will it do what I imagine it will do? (my imagination is in a comment after calling the flush_cache_range
function)
Thanks.
Edit 1: flushing once per line, and removing fences.
After the answer of @PeterCordes I am making some adjustments:
First, the function receives a pointer to char and its size in chars, because they are 1 byte long, so I have control on how much to jump from one flush to the next.
Then, I need to confirm the size of my cache lines. I can get that info using the program
cpuid
:
cpuid -1 | grep -A12 -e "--- cache [0-9] ---"
For L1i, L1d, L2, and L3, I getline size in bytes = 0x40 (64)
So this is the amount of bytes I have to skip after each flush.Then I determine the pointer to the last char by adding
ptr + len - 1
.And loop across all the addresses, one in each cache line, including the last one (
ptr_end
).
This is the updated version of the code:
#include <stdio.h>
#include <x86intrin.h>
#include <stdint.h>
inline void flush_cache_range(char *ptr, size_t len);
void flush_cache_range(char *ptr, size_t len){
const unsigned char cacheline = 64;
char *ptr_end = ptr + len - 1;
while(ptr <= ptr_end){
_mm_clflushopt(ptr);
ptr += cacheline;
}
}
int main(){
size_t i, sum=0, plen = 131072000; // or much bigger
uint64_t *p = calloc(plen,sizeof(uint64_t));
for(i=0; i<plen; i++){
p[i] = i;
}
flush_cache_range((char*)p, sizeof(p[0])*plen);
// there should be many cache misses now
for(i=0; i<plen; i++){
sum += p[i];
}
printf("sum is:%lu\n", sum);
return 0;
}
Now when I compile and run perf
:
gcc -O3 -march=native src/source.c -o exec.bin && perf stat -e cache-misses,cache-references ./exec.bin
I get:
sum is:8589934526464000
Performance counter stats for './exec.bin':
1,202,714 cache-misses:u # 1.570 % of all cache refs
76,612,476 cache-references:u
0.377100534 seconds time elapsed
0.170473000 seconds user
0.205574000 seconds sys
If I comment the line calling flush_cache_range
, I get pretty much the same:
sum is:8589934526464000
Performance counter stats for './exec.bin':
1,211,462 cache-misses:u # 1.590 % of all cache refs
76,202,685 cache-references:u
0.356544645 seconds time elapsed
0.160227000 seconds user
0.195305000 seconds sys
What am I missing?
Edit 2: adding sfence
, and fixing loop limit
- I added the sfence as suggested by @prl
- Changed ptr_end to point to the last byte of its cache line.
void flush_cache_range(char *ptr, size_t len){
const unsigned char cacheline = 64;
char *ptr_end = (char*)(((size_t)ptr + len - 1) | (cacheline - 1));
while(ptr <= ptr_end){
_mm_clflushopt(ptr);
ptr += cacheline;
}
_mm_sfence();
}
I still get the same unexpected results in perf.