C function to flush all cache lines that hold an array

Question

I am trying to force a user application to flush all the cache lines that hold an array (created by itself) from all levels of cache.

After reading this post (Cflush to invalidate cache line via C function) and having great guidance from @PeterCordes, I tried to come up with a function in C that would allow me to accomplish this.

#include <x86intrin.h>
#include <stdint.h>

inline void flush_cache_range(uint64_t *ptr, size_t len){
    size_t i;
    // prevent any load or store to be scheduled across 
    // this point due to CPU Out of Order execution.
    _mm_mfence();
    for(i=0; i<len; i++)
        // flush the cache line that contains ptr+i from 
        // all cache levels
        _mm_clflushopt(ptr+i); 
    _mm_mfence();
}

int main(){
    size_t plen = 131072; // or much bigger
    uint64_t *p = calloc(plen,sizeof(uint64_t));
    for(size_t i=0; i<plen; i++){
        p[i] = i;
    }
    flush_cache_range(p,plen);
    // at this point, accessing any element of p should
    // cause a cache miss. As I access them, adjacent
    // elements and perhaps pre-fetched ones will come
    // along.
    (...)
    return 0;
}

I am compiling with gcc -O3 -march=native source.c -o exec.bin in an AMD Zen2 processor running kernel 5.11.14 (Fedora 33).

I do not completely understand the difference between mfence/sfence/lfence, or when one or the other is enough, so I just used mfence because I believe it imposes the strongest restriction (Am I right?).

My question is: Am I missing something with this implementation? Will it do what I imagine it will do? (my imagination is in a comment after calling the flush_cache_range function)

Thanks.

Edit 1: flushing once per line, and removing fences.

After the answer of @PeterCordes I am making some adjustments:

First, the function receives a pointer to char and its size in chars, because they are 1 byte long, so I have control on how much to jump from one flush to the next.
Then, I need to confirm the size of my cache lines. I can get that info using the program cpuid:
cpuid -1 | grep -A12 -e "--- cache [0-9] ---"
For L1i, L1d, L2, and L3, I get line size in bytes = 0x40 (64) So this is the amount of bytes I have to skip after each flush.
Then I determine the pointer to the last char by adding ptr + len - 1.
And loop across all the addresses, one in each cache line, including the last one (ptr_end).

This is the updated version of the code:

#include <stdio.h>
#include <x86intrin.h>
#include <stdint.h>

inline void flush_cache_range(char *ptr, size_t len);

void flush_cache_range(char *ptr, size_t len){
    const unsigned char cacheline = 64;
    char *ptr_end = ptr + len - 1;
    while(ptr <= ptr_end){
        _mm_clflushopt(ptr);
        ptr += cacheline;
    }
}

int main(){
    size_t i, sum=0, plen = 131072000; // or much bigger
    uint64_t *p = calloc(plen,sizeof(uint64_t));
    for(i=0; i<plen; i++){
        p[i] = i;
    }
    flush_cache_range((char*)p, sizeof(p[0])*plen);
    // there should be many cache misses now
    for(i=0; i<plen; i++){
        sum += p[i];
    }
    printf("sum is:%lu\n", sum);
    return 0;
}

Now when I compile and run perf: gcc -O3 -march=native src/source.c -o exec.bin && perf stat -e cache-misses,cache-references ./exec.bin

I get:

sum is:8589934526464000
 Performance counter stats for './exec.bin':

         1,202,714      cache-misses:u # 1.570 % of all cache refs    
        76,612,476      cache-references:u                                          
       0.377100534 seconds time elapsed
       0.170473000 seconds user
       0.205574000 seconds sys

If I comment the line calling flush_cache_range, I get pretty much the same:

sum is:8589934526464000

 Performance counter stats for './exec.bin':
         1,211,462      cache-misses:u # 1.590 % of all cache refs    
        76,202,685      cache-references:u                                          
       0.356544645 seconds time elapsed
       0.160227000 seconds user
       0.195305000 seconds sys

What am I missing?

Edit 2: adding `sfence`, and fixing loop limit

I added the sfence as suggested by @prl
Changed ptr_end to point to the last byte of its cache line.

void flush_cache_range(char *ptr, size_t len){
    const unsigned char cacheline = 64;
    char *ptr_end = (char*)(((size_t)ptr + len - 1) | (cacheline - 1));

    while(ptr <= ptr_end){
        _mm_clflushopt(ptr);
        ptr += cacheline;
    }
    _mm_sfence();
}

I still get the same unexpected results in perf.

Related post: https://stackoverflow.com/questions/4537753/when-should-i-use-mm-sfence-mm-lfence-and-mm-mfence — kiner_shah, Jun 26 '21 at 06:37
Your edits look like they should be answers, not part of the question. — Peter Cordes, Jun 26 '21 at 16:18
Hi @PeterCordes. The editions do not solve the problem, they instead provide more context and the reason I think there is a problem. That's why I didn't consider them as answers. — onlycparra, Jun 26 '21 at 17:50
Is `-o3` in your compilation command a typo? It should be `-O3` with a capital O. — Nate Eldredge, Jun 26 '21 at 18:18
yes it is a typo, thanks. However, correcting it does not resolve the main issue. — onlycparra, Jun 26 '21 at 23:16
`// there should be many cache misses now` - you're not doing anything to defeat hardware prefetch inside the loop, or for all 4 vector loads from the same line to all go together. IDK if Zen2 has enough bandwidth for HW prefetch to keep up with 16 bytes per clock cycle (or per a bit more than 1 if GCC's loop isn't 5 uops or less), but if so that could explain it. Using `-march=native` may help (for AVX2), and `-funroll-all-loops`. Or use clang `-O3 -march=native`; it might use multiple vector accumulators to hide 1/clock `vpaddq` latency. — Peter Cordes, Jun 27 '21 at 21:08
Also, this part about observing cache misses in a micro-benchmark is basically a separate question from how to flush a range of cache lines. — Peter Cordes, Jun 27 '21 at 21:10

Peter Cordes · Answer 1 · 2021-06-27T20:56:12.013

Yes that looks correct but quite inefficient.

Your expectation of cache misses afterward (mitigated by HW prefetch) is justified. You can use perf stat to check, if you write some actual test code that uses the array later.

You run clflushopt on each separate uint64_t, but x86 cache lines are 64 bytes on every current CPU that supports clflushopt. So you're doing 8x as many flushes, and repeated flushes to the same line can be quite slow on some CPUs. (Worse than flushing more lines that were hot in cache.)

See my answer on The right way to use function _mm_clflush to flush a large struct for the general case where the array starts at unknown alignment relative to a cache line, and the array size isn't a multiple of the line size. Running clflush or clflushopt once on each cache line that contains any of an array / struct.

Flushing is idempotent except for performance, so you can just loop with 64-byte increments and also flush the last byte of the array, but in that linked answer I came up with a cheap way to implement the loop logic to touch each line exactly once. For an array pointer + length, obviously use sizeof(ptr[0]) * len instead of sizeof(struct) like that linked answer used.

Code review: API

Flushes work on whole lines. Either take a char*, or a void* which you then cast to char* to increment it by the line size. Because logically, you give the asm instruction a pointer and it flushes just the one line containing that byte.

Memory Barrier Not Needed Before

It's pointless to mfence before flushing; clflushopt is ordered wrt. stores to the same cache line, so a store then clflushopt to the same line (in that order in asm) will happen in that order, flushing the newly stored data. The manual documents this (https://www.felixcloutier.com/x86/clflushopt is from Intel's, I assume AMD's manual documents the same semantics for it on their CPUs.)

I think/hope that C compilers treat _mm_clflushopt(p) like at least a volatile access to the whole line containing p, and thus won't do compile-time reordering of stores to any C objects that *p could alias. (And probably not loads either.) If not, you'd want at most asm("":::"memory"), a compile-time-only barrier. (Like atomic_signal_fence(memory_order_seq_cst), not atomic_thread_fence).

I think it's also somewhat unnecessary to fence after, if your loop is non-tiny, if all you care about is this thread getting cache misses. It's certainly useless to use sfence, which doesn't order loads at all, unlike mfence or lfence

The normal reason to use sfence after clflushopt is to guarantee that earlier stores have made it to persistent storage, before any later stores, e.g. to make it possible to recover consistency after a crash. (In a system with Optane DC PM or other kind of truly memory-mapped non-volatile RAM). See for example this Q&A about clflushopt ordering and why sfence is sometimes needed.

That doesn't force later loads to miss, they're not ordered wrt. sfence and thus can execute early, before sfence and before clflushopt. mfence would prevent that.

lfence (or about ROB size amount of uops, e.g. 224 on Skylake) might, but waiting for clflushopt to just retire from the out-of-order back-end doesn't mean it's done evicting the line. It might work more like a store, and have to go through the store buffer.

I tested this on my CPU, and i7-6700k Intel Skylake:

default rel
%ifdef __YASM_VER__
    CPU Conroe AMD
    CPU Skylake AMD
%else
%use smartalign
alignmode p6, 64
%endif

global _start
_start:

%if 1
    lea        rdi, [buf]
    lea        rsi, [bufsrc]
%endif

    mov     ebp, 10000000

  mov [rdi], eax
  mov [rdi+4096], edx   ; dirty a couple BSS pages
align 64
.loop:
    mov  eax, [rdi]
;    mov  eax, [rdi+8]
    clflushopt [rdi]          ; clflush vs. clushopt doesn't make a different here except in uops issued/executed
sfence            ; actually speeds things up
;mfence           ; the load after this definitely misses.
    mov  eax, [rdi+16]
;    mov  eax, [rdi+24]
    add rdi, 64        ; next cache line
    and rdi, -(1<<14)  ; wrap to 4 pages
    dec ebp
    jnz .loop
.end:

    xor edi,edi
    mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       ; sys_exit_group(0)


section .bss
align 4096
buf:    resb 4096*4096

bufsrc:  resb 4096
resb 100

t=testloop; asm-link -dn "$t".asm && taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,mem_load_retired.l1_hit,mem_load_retired.l1_miss -r4 ./"$t"
+ nasm -felf64 -Worphan-labels testloop.asm
+ ld -o testloop testloop.o

...
0000000000401040 <_start.loop>:
  401040:       8b 07                   mov    eax,DWORD PTR [rdi]
  401042:       66 0f ae 3f             clflushopt BYTE PTR [rdi]
  401046:       0f ae f8                sfence 
  401049:       8b 47 10                mov    eax,DWORD PTR [rdi+0x10]
  40104c:       48 83 c7 40             add    rdi,0x40
  401050:       48 81 e7 00 c0 ff ff    and    rdi,0xffffffffffffc000
  401057:       ff cd                   dec    ebp
  401059:       75 e5                   jne    401040 <_start.loop>
...

 Performance counter stats for './testloop' (4 runs):

            334.27 msec task-clock                #    0.999 CPUs utilized            ( +-  7.62% )
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 3      page-faults               #    0.009 K/sec                  
     1,385,639,019      cycles                    #    4.145 GHz                      ( +-  7.68% )
        80,000,116      instructions              #    0.06  insn per cycle           ( +-  0.00% )
       100,271,634      uops_issued.any           #  299.968 M/sec                    ( +-  0.04% )
       100,257,154      uops_executed.thread      #  299.924 M/sec                    ( +-  0.04% )
            16,894      mem_load_retired.l1_hit   #    0.051 M/sec                    ( +- 17.24% )
         2,347,561      mem_load_retired.l1_miss  #    7.023 M/sec                    ( +- 14.76% )

            0.3346 +- 0.0255 seconds time elapsed  ( +-  7.62% )

That's with sfence, and is surprisingly the fastest. The average run time is pretty variable. Using clflush instead of clflushopt doesn't change timing much, but more uops: 150,185,359 uops_issued.any (fused domain) and 110,219,059 uops_executed.thread (unfused domain).

With mfence is the slowest, and every clflush costs us two cache misses (one this iteration right after the load, and another next iteration when we get back to it.)

## With MFENCE
     3,765,471,466      cycles                    #    4.129 GHz                      ( +-  1.26% )
        80,000,292      instructions              #    0.02  insn per cycle           ( +-  0.00% )
       160,386,634      uops_issued.any           #  175.881 M/sec                    ( +-  0.03% )
       100,533,848      uops_executed.thread      #  110.246 M/sec                    ( +-  0.06% )
             7,005      mem_load_retired.l1_hit   #    0.008 M/sec                    ( +- 21.58% )
         9,966,476      mem_load_retired.l1_miss  #   10.929 M/sec                    ( +-  0.05% )

With no fence, still slower then sfence. I don't know why. Perhaps sfence stops the clflush operations from executing so fast, giving loads in later iterations a chance to get ahead of them and both read the cache line before clflushopt evicts it?

     2,047,314,028      cycles                    #    4.125 GHz                      ( +-  2.58% )
        70,000,166      instructions              #    0.03  insn per cycle           ( +-  0.00% )
        80,619,482      uops_issued.any           #  162.427 M/sec                    ( +-  0.05% )
        80,584,719      uops_executed.thread      #  162.357 M/sec                    ( +-  0.04% )
            66,198      mem_load_retired.l1_hit   #    0.133 M/sec                    ( +-  6.61% )
         4,814,405      mem_load_retired.l1_miss  #    9.700 M/sec                    ( +-  4.59% )

These experimental results are from Intel Skylake, not AMD

(And older or newer Intel may be different in how they allow loads to reorder with clflushopt.

(I would have linked [The right way to use function \_mm\_clflush to flush a large struct](https://stackoverflow.com/q/66382562) in our discussion on an earlier question if I'd remembered that Q&A before now.) — Peter Cordes, Jun 26 '21 at 03:57
Clflushopt is not ordered with respect to later accesses to the line, so if you want to be sure that a subsequent access misses the cache, clflushopt should be followed by an sfence. (And of course even that isn't necessarily sufficient, because the prefetcher could reload the line at any time.) — prl, Jun 26 '21 at 05:52
Hi, I tried the suggested changes, and still getting unexpected results. I am adding the `sfence` suggested by @prl — onlycparra, Jun 26 '21 at 07:53
@prl: Hmm, If you want to be sure a subsequent load misses, wouldn't you need mfence to make sure the load doesn't happen before the sfence (and also before the clflush)? In practice on Skylake, alternating clflush / load, I get one cache miss per clflush with mfence (`mem_load_retired.l1_miss`) (like 9.987 M for 10M clflush) and a handful (~15k) of `l1_hit`. With no fence, I get ~4M +- 10% hits, and with sfence I surprisingly get 1.5 to 1.8M counts for l1d_hit, but faster performance. (~1.1G cycles vs. 1.8G with no fence vs. 3.7 with mfence. i7-6700k) https://godbolt.org/z/qeY4hv917 — Peter Cordes, Jun 27 '21 at 20:35
@onlycparra: sfence isn't useful; you'd want `mfence` after if you wanted to make sure. Testing on Skylake confirms that some real CPUs do reorder later stores with earlier clflush, even across `sfence`. (Although on AMD, `sfence` I think has stronger documented guarantees like maybe also waiting for the store buffer to drain before later loads?) — Peter Cordes, Jun 27 '21 at 20:58
The SDM says "Executions of the CLFLUSHOPT instruction are ... not ordered with respect to ... younger writes to the cache line being invalidated. Software can use the SFENCE instruction to order an execution of CLFLUSHOPT relative to one of those operations." It sounds like your tests show that this is incorrect? — prl, Jun 27 '21 at 22:28
@prl: No, I'm doing later *loads*, not stores. Since this is for microbenchmarking purposes, and usually load misses are a bigger problem than store misses. (Also, the benchmark loop in the question is read-only after flushing, summing the array into a scalar.) So this only matters for performance, not correctness. (And BTW, the benchmark linked in my previous reply to you had the flush + fence in a less interesting place than in the update to my answer, in case anyone's looking at that Godbolt link.) — Peter Cordes, Jun 28 '21 at 00:33
Perhaps turn off syntax highlighting? (```lang-none or similar) — Peter Mortensen, Jul 28 '21 at 13:12
Won't `and rdi, -(1<<14)` cause the same cache line to be accessed in every iteration? This would generate a lot of hits in the LFBs. With `MFENCE`, you'd then get 10M load misses in the L1 and 10M hits in the LFB. When using these two events, it'd be generally useful to also measure `FB_HIT`. Wasn't your intention to access each cache line in 4 different pages in a cyclic manner 100M times? Not sure why you're only dirtying two pages then. It's fine to use the same cache line in every iteration, but you need to sandwich every `CLFLUSHOPT` with `LFENCE` or `MFENCE` on Intel... — Hadi Brais, Nov 16 '21 at 06:25
...(only `MFENCE` can order the flush on AMD) to deterministically show the effect of flushing. It looks like the OP didn't understand that hardware prefetching needs to be defeated for the demand loads following the flushes to miss all the way. There are multiple ways to do this. This answer shows one. Another way is to disable the prefetchers in the BIOS, if the options are available. Yet another way is to access one line in each page or to access the lines in an unpredictable order. — Hadi Brais, Nov 16 '21 at 06:26
The point here is to check whether the lines are being flushed, so you need to use events at the L3, not L1. For this purpose, the event `MEM_LOAD_RETIRED.L3_MISS` can be used. (Note that on SKL, this event is buggy according to erratum SKL128 if counted in user mode or kernel mode, but not both. It's accurate if both user and kernel events are counted.) The events `cache-misses` and `cache-references` (used by the OP @onlycparra) can also be used on SKL, although they also count prefetches. Well, they're also buggy according to SKL057. — Hadi Brais, Nov 16 '21 at 06:27
But on AMD, these perf events are mapped to native events at the L2, not L3, so they won't tell you whether the lines are being flushed all the way. One option on Zen2 is to measure the sum `ls_refills_from_sys.ls_mabresp_lcl_dram`+`ls_refills_from_sys.ls_mabresp_rmt_dram`, which represents the number of L1D demand fills sourced from main memory. Another issue is that the OP's system seems to had the paranoid level set to 2, which forces unqualified events to only count in user mode. This can be confusing or lead to incorrect interpretation of the results, especially with dirtying many pages. — Hadi Brais, Nov 16 '21 at 06:27
@HadiBrais: Oh bother, yeah, `and rdi, -(1<<14)` is a brain fart; it always clears all the low bits! I wanted it to wrap the low bits at a 4-page boundary, i.e. block carry out of the low 14 bits, but that's not what that does. Probably just make sure my buffer is 8-page aligned and then clear bit #14 with `and rdi, ~(1<<14)` or something. I guess I'll have to at least re-run that benchmark, and maybe rewrite the whole answer if that invalidates a lot of my conclusions. Thanks for catching that! Re: dirtying only 2 pages to start; I probably rethought the wrap mask and didn't update earlier. — Peter Cordes, Nov 16 '21 at 07:22
Cheers. Aligning on 8-page boundary with `rdi, ~(1<<14)` works, or you can just use one cache line. The most important thing is that the event counts should provide a satisfactory evidence of the flushes. If you want to be rigorous and provide a conclusive proof that the flushes are happening, use `perf record` and sample misses and hits. All the samples captured on the load that should always hit should be all hit events and all samples on the load that should always miss (most likely due to the demand flush) should be all miss events. — Hadi Brais, Nov 18 '21 at 01:10
Heck, if you want to go this far then you might as well earn the complete respect of many millions of people around the world who care by checking first that `clflushopt` is supported and then obtain the flush size using `cpuid` and use increments of that size. (Yea millions is probably a slight exaggeration.) Since the OP hasn't followed up on this, they may no longer care about the answer, so you don't need to spend too much time on this anyway. — Hadi Brais, Nov 18 '21 at 01:10