How to flush the CPU cache for a region of address space in Linux?

Question

I am interested in flushing cache (L1, L2, and L3) only for a region of address space, for example all cache entries from address A to address B. Is there a mechanism to do so in Linux, either from user or kernel space?

And what is your CPU? Do you want to run "flush" from user space or from kernel space? — osgx, Mar 27 '14 at 23:45
User space would be great, but kernel space is OK too. I am doing an study, so I need some info for both x86 or ARM. I'd suppose they don't have the same mechanism (at least the underlying implemetation/instruction would not be the same). — aminfar, Mar 28 '14 at 00:37

score 10 · Accepted Answer · answered Mar 28 '14 at 09:40

10

Check this page for list of available flushing methods in linux kernel: https://www.kernel.org/doc/Documentation/cachetlb.txt

Cache and TLB Flushing Under Linux. David S. Miller

There are set of range flushing functions

2) flush_cache_range(vma, start, end);
   change_range_of_page_tables(mm, start, end);
   flush_tlb_range(vma, start, end);

3) void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)

Here we are flushing a specific range of (user) virtual
addresses from the cache.  After running, there will be no
entries in the cache for 'vma->vm_mm' for virtual addresses in
the range 'start' to 'end-1'.

You can also check implementation of the function - http://lxr.free-electrons.com/ident?a=sh;i=flush_cache_range

For example, in arm - http://lxr.free-electrons.com/source/arch/arm/mm/flush.c?a=sh&v=3.13#L67

 67 void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
 68 {
 69         if (cache_is_vivt()) {
 70                 vivt_flush_cache_range(vma, start, end);
 71                 return;
 72         }
 73 
 74         if (cache_is_vipt_aliasing()) {
 75                 asm(    "mcr    p15, 0, %0, c7, c14, 0\n"
 76                 "       mcr     p15, 0, %0, c7, c10, 4"
 77                     :
 78                     : "r" (0)
 79                     : "cc");
 80         }
 81 
 82         if (vma->vm_flags & VM_EXEC)
 83                 __flush_icache_all();
 84 }

answered Mar 28 '14 at 09:40

osgx

90,338
53
357
513

Really great info. I appreciate it. I was wondering how I can estimate the exec time of flush_cache_range w/o running it on actual hardware. For example, a really rough estimate could be: (number_cache_lines_to_flush * time_to_flush_each_cache_line). I know it won't be that easy, but if you can shed some lights, it would be great. – aminfar Mar 28 '14 at 15:59
aminfar, this estimation depends on the exact cpu (its microarchitecure), and is hard for any who is not ARM insider. Also, I fear that x86 has no partial cache flushing (only tlb flushing, but don't know about partial tlb flush). – osgx Mar 28 '14 at 18:13
3

@aminfar , On x86 you could probably use [clflush](http://x86.renejeschke.de/html/file_module_x86_id_30.html) in inline assembly and loop over the address range – Leeor Mar 28 '14 at 22:58
@aminfar, it will be hard to estimate due to the activity of DMA and/or GPU. – rsaxvc Jan 14 '15 at 04:09
(Personal research) Does `flush_tlb_range` work as advertised by the name, only flushing a small region of virtual memory when needed (instead of needing to flush the entire TLB)? Not exactly related to everything else in here but more on hypothesizing higher I performance Meltdown workarounds :p – Paul Stelian Jan 19 '18 at 16:05
@PaulStelian, flush_tlb_range function flushes only TLB tables, and meltdown does its data leak with cache which is not flushed with it. And `flush_cache_range` is noop for x86/i386/amd64 https://elixir.free-electrons.com/linux/v4.14/ident/flush_cache_range https://elixir.free-electrons.com/linux/v4.14/source/include/asm-generic/cacheflush.h#L15 and it probably can't flush (or will do long time) to flush L2 or L3 (variants of Meltdown can be tuned to check L2 or L3 cache). The problem is race between starting of actual memory op and checking permissions, it can't be fixed by flushes, do KPTI. – osgx Jan 20 '18 at 18:57
@PaulStelian, check also my new question about ignoring caches for some or all memory accesses. – osgx Jan 20 '18 at 20:23
I was thinking about doing only local, targeted remaps of the kernel address space, so that exiting the kernel mode is cheaper. Messing about with the cache is a tad too late. – Paul Stelian Jan 22 '18 at 14:44
(And slightly higher memory costs, as parts of the paging tables are to be duplicated between the cores) – Paul Stelian Jan 22 '18 at 14:45
@osgx I'm not sure if that allows you to essentially never cache the kernel memory without taking a performance hit in the user space of other processes. – Paul Stelian Jan 22 '18 at 14:54

score 7 · Answer 2 · edited Aug 24 '17 at 06:33

7

This is for ARM.

GCC provides __builtin___clear_cache which ~~does~~ should do syscall cacheflush. However it may have its caveats.

Important thing here is Linux provides a system call (ARM specific) to flush caches. You can check Android/Bionic flushcache for how to use this system call. However I'm not sure what kind of guarantees Linux gives when you call it or how it is implemented through its inner workings.

This blog post Caches and Self-Modifying Code may help further.

edited Aug 24 '17 at 06:33

Ciro Santilli OurBigBook.com

347,512
102
1,199
985

answered Mar 28 '14 at 16:24

auselen

27,577
7
73
114

1

The first link says it's only for instruction cache, not sure it's what OP needed – Leeor Mar 28 '14 at 22:54
@Leeor Linux code doesn't explicitly say that, that's why I've linked it. – auselen Mar 28 '14 at 23:13
1

If you want behaviour of `cacheflush`, you should definitely call that directly. Calling a builtin with weaker behaviour guarantees because it currently happens to be implemented on top of the stronger function you want seems like a Bad Idea. – Peter Cordes Oct 23 '16 at 03:21

score 4 · Answer 3 · edited Feb 21 '20 at 02:23

In the x86 version of Linux you also can find a function void clflush_cache_range(void *vaddr, unsigned int size) which is used for the purposes of flush a cache range. This function relies to the CLFLUSH or CLFLUSHOPT instructions. I would recommend checking that your processor actually supports them, because in theory they are optional.

CLFLUSHOPT is weakly ordered. CLFLUSH was originally specified as ordered only by MFENCE, but all CPUs that implement it do so with strong ordering wrt. writes and other CLFLUSH instructions. Intel decided to add a new instruction (CLFLUSHOPT) instead of changing the behaviour of CLFLUSH, and to update the manual to guarantee that future CPUs will implement CLFLUSH as strongly ordered. For this use, you should MFENCE after using either, to make sure that the flushing is done before any loads from your benchmark (not just stores).

Actually x86 provides one more instruction that could be useful: CLWB. CLWB flushes data from cache to memory without (necessarily) evicting it, leaving it clean but still cached. clwb on SKX does evict like clflushopt, though

Note also that these instructions are cache coherent. Their execution will affect all caches of all processors (processor cores) in the system.

All these three instructions are available in user mode. Thus, you can employ assembler (or intrinsics like _mm_clflushopt) and create your own void clflush_cache_range(void *vaddr, unsigned int size) in your user space application (but do not forget to check their availability, before actual use).

If I correctly understand, it is much more difficult to reason about ARM in this regard. Family of ARM-processors is much less consistent then family of IA-32 processors. You can have one ARM with full-featured caches, and another one completely without caches. Further more, many manufacturers can use customized MMUs and MPUs. So it is better to reason about some particular ARM processor model.

Unfortunately, it looks like that it will be almost impossible to perform any reasonable estimation of time required to flush some data. This time is affected by too many factors including the number of cache lines flushed, unordered execution of instructions, the state of TLB (because instruction takes a virtual address as an argument, but caches use physical addresses), number of CPUs in the system, actual load in terms of memory operations on the other processors in the system, and how many lines from the range are actually cached by processors, and finally by performance of CPU, memory, memory controller and memory bus. In a result, I think execution time will vary significantly in different environments and with different loads. The only reasonable way is to measure the flush time on the system and with load similar to the target system.

And final note, do not confuse memory caches and TLB. They are both caches but organized in different ways and serving different purposes. TLB caches just most recently used translations between virtual and physical addresses, but not data which are pointed by that addresses.

And TLB is not coherent, in contrast to memory caches. Be careful, because flushing of TLB entries does not lead to the flushing of appropriate data from memory cache.

CLFLUSH is now defined as strongly ordered. The version of the Intel manual on [felixcloutier.com](http://www.felixcloutier.com/x86/CLFLUSH.html) describes it the way you did (and is missing an entry for CLFLUSHOPT), but a newer version [on hjlebbink.github.io/x86doc/ matches Intel's official PDF](https://hjlebbink.github.io/x86doc/html/CLFLUSH.html), saying it's ordered wrt other CLFUSHes, and writes, etc, with the footnote that *Earlier versions of this manual... All processors implementing the CLFLUSH instruction also order it relative to the other operations enumerated above.* — Peter Cordes, Oct 23 '16 at 03:33
This is why CLFLUSHOPT exists, and why Linux uses it when available. — Peter Cordes, Oct 23 '16 at 03:34

score 3 · Answer 4 · edited May 23 '17 at 12:09

Several people have expressed misgivings about clear_cache. Below is a manual process to evict the cache which is in-efficient, but possible from any user-space task (in any OS).

PLD/LDR

It is possible to evict caches by mis-using the pld instruction. The pld will fetch a cache line. In order to evict a specific memory address, you need to know the structure of your caches. For instance, a cortex-a9 has a 4-way data cache with 8 words per line. The cache size is configurable into 16KB, 32KB, or 64KB. So that is 512, 1024 or 2048 lines. The ways are always insignificant to the lower address bits (so sequential addresses don't conflict). So you will fill a new way by accessing memory offset + cache size / ways. So that is every 4KB, 8KB and 16KB for a cortex-a9.

Using ldr in 'C' or 'C++' is simple. You just need to size an array appropriately and access it.

See: Programmatically get the cache line size?

For example, if you want to evict 0x12345 the line starts at 0x12340 and for a 16KB round-robin cache a pld on 0x13340, 0x14340, 0x15340, and 0x16340 would evict any value form that way. The same principal can be applied to evict L2 (which is often unified). Iterating over all of the cache size will evict the entire cache. You need to allocate an unused memory the size of the cache to evict the entire cache. This might be quite large for the L2. pld doesn't need to be used, but a full memory access (ldr/ldm). For multiple CPUs (threaded cache eviction) you need to run the eviction on each CPU. Usually the L2 is global to all CPUs so it only needs to be run once.

NB: This method only works with LRU (least recently used) or round-robin caches. For pseudo-random replacement, you will have to write/read more data to ensure eviction, with an exact amount being highly CPU specific. The ARM random replacement is based on an LFSR that is from 8-33bits depending on the CPU. For some CPUs, it defaults to round-robin and others default to the pseudo-random mode. For a few CPUs a Linux kernel configuration will select the mode. ^{ref: CPU_CACHE_ROUND_ROBIN} However, for newer CPUs, Linux will use the default from the boot loader and/or silicon. In other words, it is worth the effort to try and get clear_cache OS calls to work (see other answers) if you need to be completely generic or you will have to spend a lot of time to clear the caches reliably.

Context swich

It is possible to circumvent the cache by fooling an OS using the MMU on some ARM CPUs and particular OSes. On an *nix system, you need multiple processes. You need to switch between processes and the OS should flush caches. Typically this will only work on older ARM CPUs (ones not supporting pld) where the OS should flush the caches to ensure not information leakage between processes. It is not portable and requires that you understand a lot about your OS.

Most explicit cache flushing registers are restricted to system mode to prevent denial of service type attacks between processes. Some exploits can try to gain information by seeing what lines have been evicted by some other process (this can give information about what addresses another process is accessing). These attacks are more difficult with pseudo-random replacement.

jithu83 · Answer 5 · 2019-06-24T22:57:48.713

In x86 to flush the entire cache hierarchy you can use this

native_wbinvd()

Which is defined in arch/x86/include/asm/special_insns.h . If you look at its implementation, it simply calls the WBINVD instruction

static inline void native_wbinvd(void)
{
        asm volatile("wbinvd": : :"memory");
}

Note that you need to be in privileged mode to execute the WBINVD X86 instruction. This is contrast to the CLFLUSH x86 instruction which clears a single cacheline and doesnt need the caller to be in privileged mode.

If you look at x86 Linux kernel code you will only see a handful (6 places when I write this) of this instruction. This is because it slows all entities running on that system. Imagine running this on a server with 100MB LLC. This instruction will mean moving the entire 100+ MB from cache to the RAM. Further it was brought to my notice that this instruction is non-interruptible. So its usage could significantly impact the determinism of a RT system for e.g.

(Though the original question asks about how to clear a specific address range, I thought info on clearing the entire cache hierarchy would also be useful for some readers)

Even worse, `wbinvd` is not itself interruptible, so it's very bad for interrupt latency. This is almost always the wrong solution, except for performance experiments or other experimental or toy usage. Plus it flushes all caches on all cores. — Peter Cordes, Jun 21 '19 at 01:46
great point @PeterCordes w.r.t non interruptible nature of this instruction. I will update the answer to reflect this. — jithu83, Jun 24 '19 at 22:50

How to flush the CPU cache for a region of address space in Linux?

5 Answers5

PLD/LDR

Context swich

Linked

Related