18

We are trying to use the Intel CLFLUSH instruction to flush the cache content of a process in Linux at the userspace.

We create a very simple C program that first access a large array and then call the CLFLUSH to flush the virtual address space of the whole array. We measure the latency it takes for CLFLUSH to flush the whole array. The size of the array in the program is an input and we vary the input from 1MB to 40MB with a step of 2MB.

In our understanding, the CLFLUSH should flush the content in the cache. So we expect to see the latency of flushing the whole array first increase linearly in terms of the size of the array, and then the latency should stop increasing after the array size is larger than 20MB, which is the size of the LLC of our program.

However, the experiment result is quite surprising, as shown in the figure. The latency does not stop increasing after the array size is larger than 20MB.

We are wondering if the CLFLUSH could potentially bring in the address before CLFLUSH flushes the address out of the cache, if the address is not in the cache yet? We also tried to search in the Intel software developer manual, and didn't find any explanation of what CLFLUSH will do if an address is not in the cache.

enter image description here

Below is the data we used to draw the figure. The first column is the size of the array in KB, and the second column is the latency of flushing the whole array in seconds.

Any suggestion/advice is more than appreciated.

[Modified]

The previous code is unnecessary. CLFLUSH can be done in userspace much easier, although it has the similar performance. So I deleted the messy code to avoid confusion.

SCENARIO=Read Only
1024,.00158601000000000000
3072,.00299244000000000000
5120,.00464945000000000000
7168,.00630479000000000000
9216,.00796194000000000000
11264,.00961576000000000000
13312,.01126760000000000000
15360,.01300500000000000000
17408,.01480760000000000000
19456,.01696180000000000000
21504,.01968410000000000000
23552,.02300760000000000000
25600,.02634970000000000000
27648,.02990350000000000000
29696,.03403090000000000000
31744,.03749210000000000000
33792,.04092470000000000000
35840,.04438390000000000000
37888,.04780050000000000000
39936,.05163220000000000000

SCENARIO=Read and Write
1024,.00200558000000000000
3072,.00488687000000000000
5120,.00775943000000000000
7168,.01064760000000000000
9216,.01352920000000000000
11264,.01641430000000000000
13312,.01929260000000000000
15360,.02217750000000000000
17408,.02516330000000000000
19456,.02837180000000000000
21504,.03183180000000000000
23552,.03509240000000000000
25600,.03845220000000000000
27648,.04178440000000000000
29696,.04519920000000000000
31744,.04858340000000000000
33792,.05197220000000000000
35840,.05526950000000000000
37888,.05865630000000000000
39936,.06202170000000000000
Mike
  • 1,841
  • 2
  • 18
  • 34
  • 1
    Unfortunately Agner Fog didn't test `clflush` for his instruction tables. Presumably it has a significant cost in uops or a limited throughput even when there's nothing to actually do. You should look at perf counters (with `perf`). ocperf.py is a nice wrapper around `perf`, which adds symbolic names for uop counters. – Peter Cordes Mar 09 '16 at 23:26
  • @PeterCordes, however, why the latency increases when there is nothing to do? I'm posting the code by editing the question, and hopefully, it may show some issues inside? – Mike Mar 10 '16 at 02:32
  • I don't have any ideas about the performance yet, but I from looking at the code, you could have used `_mm_clflush(void const *p)` from `immintrin.h` to emit a clflush. Or used `volatile char*cp = p; asm volatile ("clflush %0" :: "m"(*cp));` [to let the compiler use whatever addressing mode it wants](http://goo.gl/0E2Y6c). That also avoids breakage if you compile with `-masm=intel`. Linux [does it this way, but with the operand as a read-write output operand](http://lxr.free-electrons.com/source/arch/x86/include/asm/special_insns.h#L196). – Peter Cordes Mar 10 '16 at 03:51
  • 1
    I see Linux's in-kernel `clflush_cache_range` is optimized for Skylake, and [includes a memory barrier before/after the clflush loop](http://lxr.free-electrons.com/source/arch/x86/mm/pageattr.c#L130), because it uses a function which it hot-patched to `clflushopt` instead of `clflush` if the CPU supports `clflushopt`. Memory barriers aren't free, perhaps some of the cost you're seeing is from this? I guess you got similar results with user-space, too, though. If so, cost of memory barriers doesn't explain it, since you don't use `MFENCE` in your user-space version. – Peter Cordes Mar 10 '16 at 03:57
  • BTW, why do you want this? If you expect that most of cache should be flushed, you could use `wbinvd` (although that's a privileged instruction). It would be interesting to compare. I was going to test this with perf counters on my Sandybridge machine, but the source you posted doesn't compile. I stuck it on [godbolt with the user-space version uncommented, and -Werror to flag missing function definitions](http://goo.gl/SMRXMU), and got tons of compile errors. Since you have a working version, I'll just wait for you to post it. – Peter Cordes Mar 10 '16 at 04:09
  • Are you testing on a virtualized system, or is Linux running on the bare metal? And what hardware: Haswell Xeon? You should add that to the question, since right now it's just in a comment on Leeor's answer. – Peter Cordes Mar 10 '16 at 04:12
  • @PeterCordes, I'm running on bare metal. It's Haswell Xeon 2618L v3 processor. The CLFLUSH size is 64B according to /proc/cpuinfo. I'm creating a repo to hold all code to conduct the exp. – Mike Mar 10 '16 at 04:31
  • @PeterCordes, I created a repo to hold all of the code that needed to run the expeirment. The repo is at https://bitbucket.org/pennpanda/clflush . You may need to compile the kernel and boot into the customized kernel before you run the ca_spin.c userspace program. I really appreciate your help! Tomorrow, we will also try to use clflush only to flush the cache and avoid the MFENCE to see if it solve the mystery... – Mike Mar 10 '16 at 05:00
  • I'll probably just test running `clflush` from userspace. It's easier to profile that way, and it's easy to do it on the whole array. It removes a whole layer of complication, if we're not trying to flush the process's code, too. I'm not sure how the TLB works in syscall context. Does the kernel use the same page tables as userspace to access user memory? My wild guess is "yes", but if not, doing it in-kernel will generate extra TLB misses. – Peter Cordes Mar 10 '16 at 05:08
  • Your repo isn't public. Even after signing up and agreeing to let bitbucket see my email address and profile info, it said: "You do not have access to this repository." github is the usual choice for free repos, because it doesn't require anyone to sign up before doing a checkout. – Peter Cordes Mar 10 '16 at 05:09
  • @PeterCordes, I'm really sorry that the bitbucket.org by default mark a repo as private. :-( I should have check this on another machine without my account to double check. Now I have make it as public and it should work. (I double checked). (BTW, I used bitbucket because it has unlimited number of private repo.s while github will charge me for that. ;-) ) – Mike Mar 10 '16 at 15:39
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/105924/discussion-between-mike-xu-and-peter-cordes). – Mike Mar 10 '16 at 15:42
  • @MikeXu, I'll confess that I'm absolutely fascinated by your cache questions ever since I answered [this one](http://stackoverflow.com/questions/21265785/enable-disable-cache-on-intel-64bit-machine-cd-bit-always-set) more than 2 years ago. Keep them coming! – Iwillnotexist Idonotexist Mar 19 '16 at 04:20
  • 1
    @IwillnotexistIdonotexist wow, how could you remember the question I asked two years ago! Amazing! – Mike Mar 19 '16 at 19:18

2 Answers2

10

You want to look at the new optimization guide for Skylake, Intel came out with another version of clflush, called clflush_opt, which is weakly ordered and would perform much better in your scenario.

See section 7.5.7 in here - http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

In general, CLFLUSHOPT throughput is higher than that of CLFLUSH, because CLFLUSHOPT orders itself with respect to a smaller set of memory traffic as described above and in Section 7.5.6. The throughput of CLFLUSHOPT will also vary. When using CLFLUSHOPT, flushing modified cache lines will experience a higher cost than flushing cache lines in non-modi fied states. CLFLUSHOPT will provide a performance benefit over CLFLUSH for cache lines in any coherenc e states. CLFLUSHOPT is more suitable to flush large buffers (e.g. greater than many KBytes), comp ared to CLFLUSH. In single-threaded applications, flushing buffers using CLFLUSHOPT may be up to 9X better than using CLFLUSH with Skylake microarchi- tecture.

The section also explains that flushing modified data is slower, which obviously comes from the writeback penalty.

As for the increasing latency, are you measuring the overall time is takes to go over the address range and clflush each line? In that case you're linearly dependent on the array size, even when it passes the LLC size. Even if the lines aren't there, the clflush would have to get processed by the execution engine and memory unit, and lookup the entire cache hierarchy for each line, even if it's not present.

Leeor
  • 19,260
  • 5
  • 56
  • 87
  • I agree that clflush will go through the execution engine and mmu, however, if we look at the Read only line in the figure, when the array goes beyong the LLC size boundary, the latency increases faster than the latency when the array is smaller. It means clflush takes more time to "flush" an address that's not in the cache? This is quite surprising to me... – Mike Mar 09 '16 at 20:33
  • 1
    What CPU did you run on? could this me a cross-socket/NUMA effect? Also, please post the code (or at least a simple version). – Leeor Mar 09 '16 at 22:53
  • @MikeXu: Maybe TLB misses? Unlikely because you probably got anon hugepages from malloc. It does still has to translate the virtual address to physical before the cache can tell whether the address is cached. Like I commented on the question, check perf counters. Do you `clflush` in the order you wrote the array, or reverse order? In reverse order, the first ~20MiB would still hit in cache. – Peter Cordes Mar 09 '16 at 23:30
  • 1
    @Leeor, I'm running on Intel(R) Xeon(R) CPU E5-2618L v3 @ 2.30GHz; This machine does have NUMA arch. It has two NUMA nodes. But I'm wondering how/which cross-socket/NUMA effect may cause this behavior? I'm adding the simple version of the code into the question part now. – Mike Mar 10 '16 at 02:29
  • @PeterCordes, we probably didn't flush the cache in the order we wrote the array. We wrote the array in random order but we flush the cache for the task in increasing order of the linear address in the vma of the task_struct inside kernel. As to TLB misses, I found that Haswell processors (which my process is) has 1K L2 TLB entries, which can cover 1K * 4KB (page size) = 4MB. So if it's TLB misses, it should see the latency slope bumps at 4MB array size instead of 20MB array size. Am I right? – Mike Mar 10 '16 at 03:13
7

This doesn't explain the knee in the read-only graph, but does explain why it doesn't plateau.


I didn't get around to testing locally to look into the difference between the hot and cold cache case, but I did come across a performance number for clflush:

This AIDA64 instruction latency/throughput benchmark repository lists a single-socket Haswell-E CPU (i7-5820K) as having a clflush throughput of one per ~99.08 cycles. It doesn't say whether that's for the same address repeatedly, or what.

So clflush isn't anywhere near free even when it doesn't have to do any work. It's still a microcoded instruction, not heavily optimized because it's usually not a big part of the CPUs workload.

Skylake is getting ready for that to change, with support for persistent memory connected to the memory controller: On Skylake (i5-6400T), measured throughput was:

  • clflush: one per ~66.42cycles
  • clflushopt: one per ~56.33cycles

Perhaps clflushopt is more of a win when some of the lines are actually dirty cache that needs flushing, maybe when L3 is busy from other cores doing the same thing. Or maybe they just want to get software using the weakly-ordered version ASAP, before making even bigger improvements to throughput. It's ~15% faster in this case, which is not bad.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I confirmed from the data in the question that clflush for RW on Xeon 2618L v3 takes 91ns to flush a cache line, which is consistent with your data. I guess the insn latency provided in the link above also measure clflush latency based on a mix of R and W requests. I guess you are right! The clflush may take more work than we thought to flush a cache line.. :-( – Mike Mar 13 '16 at 15:58
  • @MikeXu: Those are throughputs, *not* latencies. To measure latency, maybe load from the cache line after clflush? The other thing you could measure about `clflush`, which that benchmark didn't, is how much impact it has on surrounding code. i.e. does a `clflush` every 100 `add` instructions reduce the throughput of the `add`s? Or loads/stores instead of adds. This is probably mostly determined by how many uops `clfush` takes. It's probably quite a few. Most slow operations are multi-uop. It's pretty much only `divps` / `sqrtps` that's single-uop but not fully pipelined. – Peter Cordes Mar 13 '16 at 16:22
  • 2
    Well apparently `cflush` and `cflushopt` _can_ be nearly free (eg a few cycles or two per line), as long as the size of the flushed area is quite small. See the graph [this answer](https://stackoverflow.com/a/44430799/149138). So the behavior is really quite weird - cheap and then skyrocketing costs after a few K. Your tests and the other tests finding > 50 cycles presumably used these larger buffers, or there was some other difference such as the cache line not being present in some level of the hierarchy. – BeeOnRope Jul 03 '17 at 20:43