Exclusive access to L1 cacheline on x86?

Question

If one has a 64 byte buffer that is heavily read/written to then it's likely that it'll be kept in L1; but is there any way to force that behaviour?

As in, give one core exclusive access to those 64 bytes and tell it not to sync the data with other cores nor the memory controller so that those 64 bytes always live in one core's L1 regardless of whether or not the CPU thinks it's used often enough.

Not on any x86 machines I know off. There is no way to turn off cache consistency either and prior attempts of doing so have proven to be [quite destructive](https://randomascii.wordpress.com/2018/01/07/finding-a-cpu-design-bug-in-the-xbox-360/). Likewise, there is no way to manually control which data is cached where. — fuz, Jul 28 '18 at 18:33
If you just need 64 bytes fast, 4 x XMM registers would hold such for you... It would probably be faster and much easier to write code reading from memory assuming that 99% of the time it will anyway be in the L1 cache. — Alexis Wilke, Jul 28 '18 at 22:14

Peter Cordes · Accepted Answer · 2018-07-28T22:59:03.583

No, x86 doesn't let you do this. You can force evict with clfushopt, or (on upcoming CPUs) for just write-back without evict with clwb, but you can't pin a line in cache or disable coherency.

You can put the whole CPU (or a single core?) into cache-as-RAM (aka no-fill) mode to disable sync with the memory controller, and disable ever writing back the data. Cache-as-Ram (no fill mode) Executable Code. It's typically used by BIOS / firmware in early boot before configuring the memory controllers. It's not available on a per-line basis, and is almost certainly not practically useful here. Fun fact: leaving this mode is one of the use-cases for invd, which drops cached data without writeback, as opposed to wbinvd.

I'm not sure if no-fill mode prevents eviction from L1d to L3 or whatever; or if data is just dropped on eviction. So you'd just have to avoid accessing more than 7 other cache lines that alias the one you care about in your L1d, or the equivalent for L2/L3.

Being able to force one core to hang on to a line of L1d indefinitely and not respond to MESI requests to write it back / share it would make the other cores vulnerable to lockups if they ever touched that line. So obviously if such a feature existed, it would require kernel mode. (And with HW virtualization, require hypervisor privilege.) It could also block hardware DMA (because modern x86 has cache-coherent DMA).

So supporting such a feature would require lots of parts of the CPU to handle indefinite delays, where currently there's probably some upper bound, which may be shorter than a PCIe timeout, if there is such a thing. (I don't write drivers or build real hardware, just guessing about this).

As @fuz points out, a coherency-violating instruction (xdcbt) was tried on PowerPC (in the Xbox 360 CPU), with disastrous results from mis-speculated execution of the instruction. So it's hard to implement.

You normally don't need this.

If the line is frequently used, LRU replacement will keep it hot. And if it's lost from L1d at frequent enough intervals, then it will probably stay hot in L2 which is also on-core and private, and very fast, in recent designs (Intel since Nehalem). Intel's inclusive L3 on CPUs other than Skylake-AVX512 means that staying in L1d also means staying in L3.

All this means that full cache misses all the way to DRAM are very unlikely with any kind of frequency for a line that's heavily used by one core. So throughput shouldn't be a problem. I guess you could maybe want this for realtime latency, where the worst-case run time for one call of a function mattered. Dummy reads from the cache line in some other part of the code could be helpful in keeping it hot.

However, if pressure from other cores in L3 cache causes eviction of this line from L3, Intel CPUs with an inclusive L3 also have to force eviction from inner caches that still have it hot. IDK if there's any mechanism to let L3 know that a line is heavily used in a core's L1d, because that doesn't generate any L3 traffic.

I'm not aware of this being much of a problem in real code. L3 is highly associative (like 16 or 24 way), so it takes a lot of conflicts before you'd get an eviction. L3 also uses a more complex indexing function (like a real hash function, not just modulo by taking a contiguous range of bits). In IvyBridge and later, it also uses an adaptive replacement policy to mitigate eviction from touching a lot of data that won't be reused often. http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.

See also Which cache mapping technique is used in intel core i7 processor?

@AlexisWilke points out that you could maybe use vector register(s) instead of a line of cache, for some use-cases. Using ymm registers as a "memory-like" storage location. You could globally dedicate some vector regs to this purpose. To get this in gcc-generated code, maybe use -ffixed-ymm8, or declare it as a volatile global register variable. (How to inform GCC to not use a particular register)

Using ALU instructions or store-forwarding to get data to/from the vector reg will give you guaranteed latency with no possibility of data-cache misses. But code-cache misses are still a problem for extremely low latency.

You're right that a frequently accessed line is unlikely to be evicted. But, as discussed in my answer, things like thread scheduling, SMT, interrupts, can still get the line evicted. I don't know why the OP wants to do that. But I think the question is interesting from a technical point of view. I'm not sure how much "Cache-as-Ram" is useful in this case. I've not heard of it before. — Hadi Brais, Jul 28 '18 at 22:00
This is a relatively recent [patent](https://patents.google.com/patent/WO2017105685A1) from Intel on an LRU policy shared by multiple cache levels. I found also other patents and research papers. — Hadi Brais, Jul 28 '18 at 22:20
@HadiBrais: no-fill mode is almost certainly *not* useful here (because it's not a per-line thing), but it's one of the few ways to do weird things with cache on x86. I added a bit more about it in an update. — Peter Cordes, Jul 28 '18 at 22:48
Unfortunately, I could not find any article that says what cache replacement policies are used at any cache level in Haswell or later processors. This [paper](https://arxiv.org/pdf/1507.06955.pdf) says in page 5 that Haswell and Skylake use the same policy as Ivy Bridge, but they cite a 2007 paper. So I don't think the authors are sure of that. — Hadi Brais, Jul 28 '18 at 22:57
@PeterCordes hello, you mentioned that clwb is write-back for cache line without eviction. Is this somehow tested? in lot of articles they say the same, that clwb will not evict cache line after flushing, but intel documentation says: Hardware may choose to retain the line at any of the levels in the cache hierarchy, and in some cases, may invalidate the line from the cache hierarchy. I have somehow tested it a bit and it seems to me that it's evicting all the cache lines after flushing, so now I wonder what is the case when it does not evict them. — Ana Khorguani, Feb 26 '19 at 19:03
@AnaKhorguani: I don't have any HW that supports CLWB. Neither do you, unless you have an Ice Lake engineering sample. (Check the CPUID feature bit mentioned [in the manual](https://www.felixcloutier.com/x86/clwb)). Maybe on your (skylake?) hardware, `66 0F AE /6 CLWB` decodes the same as `NFx 66 0F AE /7 CLFLUSHOPT`? It's the same opcode except for a `/6` vs. `/7` in the modrm `/r` field. The CLWB manual says it's supposed to `#UD` if the CPUID feature bit isn't set, but that might not be accurate. Or maybe current implementations on whatever CPU you have choose to run it as clflushopt. — Peter Cordes, Feb 26 '19 at 19:11
@PeterCordes yes sorry, of course I don't have it :) I am using one of the servers I have access for my project, with skylake, I checked with lscpu and in flags clwb was included, also it's quite new, that's why I assume it supports clwb. — Ana Khorguani, Feb 26 '19 at 19:38
@AnaKhorguani: interesting. My earlier research suggested `clwb` would be new in IceLake. But yes, according to http://users.atw.hu/instlatx64/GenuineIntel0050654_SkylakeX_InstLatX64.txt, SKX has CLWB too. So CLWB *allows* the CPU to keep the line around, but doesn't *require* it. Maybe an implementation that actually *does* take advantage of that hint will be new in IceLake? Forcing the line to make it to non-volatile DIMM is still the architecturally-guaranteed part of the behaviour, and not evicting it is an optional extra performance hint. — Peter Cordes, Feb 26 '19 at 19:54
@PeterCordes ah, I get the point. however the main advantage of clwb in my opinion is that it should not evict the cache line. there is clflush and clflushopt that also force cache line to be written back to memory. so what's the point of adding clwb if there is no gain. But maybe these two work only with DRAM and not with NVM? does not sound sensible though. Also I think I checked and even if the cache line is not modified it's still evicted. well in any case thank you very much for answering, I see if I can figure out more but your suggestion about implementation, sadly, seems sensible. — Ana Khorguani, Feb 26 '19 at 20:17

Hadi Brais · Answer 2 · 2018-07-28T21:51:56.677

There is no direct way to achieve that on Intel and AMD x86 processors, but you can get pretty close with some effort. First, you said you're worried that the cache line might get evicted from the L1 because some other core might access it. This can only happen in the following situations:

The line is shared, and therefore, it can be accessed by multiple agents in the system concurrently. If another agent attempts to read the line, its state will change from Modified or Exclusive to Shared. That is, it will state in the L1. If, on the other hand, another agent attempts to write to the line, it has to be invalidated from the L1.
The line can be private or shared, but the thread got rescheduled by the OS to run on another core. Similar to the previous case, if it attempts to read the line, its state will change from Modified or Exclusive to Shared in both L1 caches. If it attempts to write to the line, it has to be invalidated from the L1 of the previous core on which it was running.

There are other reasons why the line may get evicted from the L1 as I will discuss shortly.

If the line is shared, then you cannot disable coherency. What you can do, however, is make a private copy of it, which effectively does disable coherency. If doing that may lead to faulty behavior, then the only thing you can do is to set the affinity of all threads that share the line to run on the same physical core on a hyperthreaded (SMT) Intel processor. Since the L1 is shared between the logical cores, the line will not get evicted due to sharing, but it can still get evicted due to other reasons.

Setting the affinity of a thread does not guarantee though that other threads cannot get scheduled to run on the same core. To reduce the probability of scheduling other threads (that don't access the line) on the same core or rescheduling the thread to run on other physical cores, you can increase the priority of the thread (or all the threads that share the line).

Intel processors are mostly 2-way hyperthreaded, so you can only run two threads that share the line at a time. so if you play with the affinity and priority of the threads, performance can change in interesting ways. You'll have to measure it. Recent AMD processors also support SMT.

If the line is private (only one thread can access it), a thread running on a sibling logical core in an Intel processor may cause the line to be evicted because the L1 is competitively shared, depending on its memory access behavior. I will discuss how this can be dealt with shortly.

Another issue is interrupts and exceptions. On Linux and maybe other OSes, you can configure which cores should handle which interrupts. I think it's OK to map all interrupts to all other cores, except the periodic timer interrupt whose interrupt handler's behavior is OS-dependent and it may not be safe to play with it. Depending on how much effort you want to spend on this, you can perform carefully designed experiments to determine the impact of the timer interrupt handler on the L1D cache contents. Also you should avoid exceptions.

I can think of two reasons why a line might get invalidated:

A (potentially speculative) RFO with intent for modification from another core.
The line was chosen to be evicted to make space for another line. This depends on the design of the cache hierarchy:
- The L1 cache placement policy.
- The L1 cache replacement policy.
- Whether lower level caches are inclusive or not.

The replacement policy is commonly not configurable, so you should strive to avoid conflict L1 misses, which depends on the placement policy, which depends on the microarchitecture. On Intel processors, the L1D is typically both virtually indexed and physically indexed because the bits used for the index don't require translation. Since you know the virtual addresses of all memory accesses, you can determine which lines would be allocated from which cache set. You need to make sure that the number of lines mapped to the same set (including the line you don't want it to be evicted) does not exceed the associativity of the cache. Otherwise, you'd be at the mercy of the replacement policy. Note also that an L1D prefetcher can also change the contents of the cache. You can disable it on Intel processors and measure its impact in both cases. I cannot think of an easy way to deal with inclusive lower level caches.

I think the idea of "pinning" a line in the cache is interesting and can be useful. It's a hybrid between caches and scratch pad memories. The line would be like a temporary register mapped to the virtual address space.

The main issue here is that you want to both read from and write to the line, while still keeping it in the cache. This sort of behavior is currently not supported.

With Intel's inclusive L3, conflict evictions in L3 may force evictions in L1d. I'm not sure if/how L3 tracks LRU / MRU to avoid evicting lines that are very hot in a private L1d and never generate any L3 traffic from that core for that line. This is one downside to inclusive caches, and another reason why L3 has to be highly associative. (Since IvB, L3 has an adaptive replacement policy to help reduce evictions from touching lots of data that doesn't get reused: http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/, but IDK if that can help with this.) — Peter Cordes, Jul 28 '18 at 21:37
@PeterCordes Excellent point. Even though the L2 is private like the L1, it has a different placement policy (different organization and physically indexed), and so an inclusive L2 may also force evictions in the L1 due to conflicts in the L2 but not the L1. — Hadi Brais, Jul 28 '18 at 21:48
L2 is NINE, it's the shared L3 that's inclusive in Intel since Nehalem. So eviction could potentially be triggered by pressure from other cores. — Peter Cordes, Jul 28 '18 at 21:49

Exclusive access to L1 cacheline on x86?

2 Answers2

You normally don't need this.

Linked