How to flush a range of address in CPU cache?

Question

I want to test the performance of a userspace program in linux running on x86. To calculate the performance, it is necessary for me to flush specific cache lines to memory (make sure those lines are invalidated and upon the next request there will be a cache miss).

I've already seen suggestions using cacheflush(2) which supposed to be a system call, yet g++ complains about it is not being declared. Also, I cannot use clflush_cache_range which apparently can be invoked only within a kernel program. Right now what I tried to do is to use the following code:

static inline void clflush(volatile void *__p)
{
    asm volatile("clflush %0" : "+m" (*(volatile char __force *)__p));
}

But this gives the following error upon compilation:

error: expected primary-expression before ‘volatile’

Then I changed it as follows:

static inline void clflush(volatile void *__p)
{
    asm volatile("clflush %0" :: "m" (__p));
}

It compiled successfully, but the timing results did not change. I'm suspicious if the compiler removed it for the purpose of optimization. Dose anyone has any idea how can I solve this problem?

C++ *usually* operates at a much higher level than this. This is not something you are usually supposed to concern yourself with as far as the language is concerned. — Jesper Juhl, May 08 '19 at 18:17
*g++ complains about it is not being declared.* – `#include `? — Swordfish, May 08 '19 at 18:26
Use `_mm_clflush` from `emmintrin.h` and keep in mind the flush make take time. — Filip Dimitrovski, May 08 '19 at 18:50
@Jesper, the *language* isn't concerned about performance, but real implementations are, and this user is, so I'm not sure that is relevant. — prl, May 08 '19 at 19:17
@JesperJuhl, Thanks. I'm not concerning about the language. I'm measuring the performance of some computation on x86 vs ARM. In the ARM version data is not cached, while it is cached on the x86 version. So, I need to flush those cache line to have fair comparison. — Hoda, May 08 '19 at 19:47
@Swordfish, Thanks! when I include this header file, I receive the following error: fatal error: asm/cachectl.h: No such file or directory — Hoda, May 08 '19 at 19:49
@FilipDimitrovski, Thanks! I tried _mm_clflush from x86intrin.h. It compiled correctly, but the timing results are the same. I'm starting to be suspicious that x86 prefetcher does a pretty good job. Although I cannot prove it. — Hoda, May 08 '19 at 19:53
It's impossible to answer this question without showing the whole code. — Hadi Brais, May 09 '19 at 16:22
We need also to know the processor on which you're running code and the exact command used to compile it. — Hadi Brais, May 09 '19 at 16:26

score 2 · Accepted Answer · answered May 08 '19 at 19:23

2

The second one flushes the memory containing the pointer __p, which is on the stack, which is why it doesn’t have the effect you want.

The problem with the first one is that it uses the macro __force, which is defined in the Linux kernel and is unneeded here. (What does the __attribute__((force)) do?)

If you remove __force, it will do what you want.

(You should also change it to avoid using the variable name __p, which is a reserved identifier.)

answered May 08 '19 at 19:23

prl

11,716
2
13
31

If you are flushing a large number of cache lines, you may want to change it to use `clflushopt` (if your processor supports it), with an `sfence` following all of the flushes. – prl May 08 '19 at 19:28
Thanks. Removed the __force, and it compiled. But the results are the same. As I mentioned above I'm starting to be suspicious that x86 prefetcher does a pretty good job. – Hoda May 08 '19 at 19:58
This would be an even better answer if you pointed out that `_mm_clflush` should be used instead. And that `asm volatile` means it's definitely not optimized away. – Peter Cordes May 08 '19 at 22:39
@HodaAghaeikhouzani: HW prefetch does work very well for sequential access, but it can't keep up with touching 1 `int` per cache line or something, with a 64-byte stride. If you're actual code goes a lot slower than that, like several clock cycles per cache line, then yes it might well keep up even from DRAM. – Peter Cordes May 08 '19 at 22:49
@PeterCordes, Thanks! The code is sequential accesses to a big array, and that's why flushing caches won't affect the latency. – Hoda May 09 '19 at 16:29

How to flush a range of address in CPU cache?

1 Answers1