2

Say I have the defacto standard x86 CPU with 3 level of Caches, L1/L2 private, and L3 shared among cores. Is there a way to allocate shared memory whose data will not be cached on L1/L2 private caches, but rather it will only be cached at L3? I don't want to fetch data from memory (that's too costly), but I'd like to experiment with performance with and without bringing the shared data into private caches.

The assumption is that L3 is shared among the cores (presumably physically indexed cache) and thus will not incur any false sharing or cache line invalidation for heavily used shared data.

Any solution (if it exists) would have to be done programmatically, using C and/or assembly for intel based CPUs (relatively modern Xeon architectures (skylake, broadwell), running linux based OS.

Edit:

I have latency sensitive code which uses a form of shared memory for synchronization. The data will be in L3, but when read or written to it will go into L1/L2 depending on cache inclusivity policy. By implication of the problem, the data will have to be invalidated adding an unnecessary (I think) performance hit. I'd like to see if it's possible to just store the data, either through some page policy or special instructions only in L3.

I know it's possible to use the special memory register to inhibit caching for security reasons, but that requires CPL0 privilege.

Edit2:

I'm dealing with parallel codes that run on high performance systems for months at at time. The systems are high core-count systems (eg. 40-160+ cores) that periodically perform synchronization which needs to execute in usecs.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
janjust
  • 796
  • 2
  • 7
  • 21
  • 1
    What kind of application (in user-land) justifies such a question? Please motivate it more and give additional context. – Basile Starynkevitch Nov 03 '17 at 14:49
  • Please see edit 2. – janjust Nov 03 '17 at 15:00
  • 1
    If you have an application with these kind of real-time requirements, you shouldn't be using a Linux PC. – Lundin Nov 03 '17 at 15:07
  • Thanks for the input, it's not a Linux PC, I mentioned that in my edits I think, it's a high performance system. All the development is happening on that system. Linux based, sure, but the vast majority of these codes run some variant of linux. – janjust Nov 03 '17 at 15:15
  • The edit2 don't explain what kind of application you are coding, and why nanoseconds of synchronization matters. For example, most HPC code running for months (simulation of galaxy collisions, or of nuclear warheads) don't need that, AFAIK. – Basile Starynkevitch Nov 03 '17 at 15:54
  • BTW, If you are running on some [TOP500](http://www.top500.org) super computers (I have colleagues at [CEA](http://www.cea.fr) using one) you probably have some support engineer knowing very well your precise hardware, and you'll better ask him. My blind guess is that you should not really care. – Basile Starynkevitch Nov 03 '17 at 16:03
  • Yes, all caches other than L1 (and the uop cache) are physically indexed / tagged in all Intel and AMD CPUs. (And in any normal CPU design.) L1I/D are VIPT, but associative enough that they behave as PIPT anyway (no homonym/synonym problems), but with index and TLB in parallel. – Peter Cordes Nov 03 '17 at 17:04

4 Answers4

3

x86 has no way to do a store that bypasses or writes through L1D/L2 but stops at L3. There are NT stores which bypass all cache. Anything that forces a write-back to L3 also forces write-back all the way to memory. (e.g. a clwb instruction). Those are designed for non-volatile RAM use cases, or for non-coherent DMA, where it's important to get data committed to actual RAM.

(Update: Tremont / Sapphire Rapids have cldemote. Earlier hardware runs it as a NOP, so it's usable as a hint.)

There's also no way to do a load that bypasses L1D (except from WC memory like video RAM with SSE4.1 movntdqa, but it's not "special" on other memory types). prefetchNTA can bypass L2, according to Intel's optimization manual. (And bypass L3 on Xeons with non-inclusive L3 cache.)

Prefetch on the core doing the read should be useful to trigger write-back from other core into L3, and transfer into your own L1D. But that's only useful if you have the address ready before you want to do the load. (Dozens to hundreds of cycles for it to be useful.)

Intel CPUs use a shared inclusive L3 cache as a backstop for on-chip cache coherency. 2-socket has to snoop the other socket, but Xeons that support more than 2P have snoop filters to track cache lines that move around. (This is describing up to Broadwell Xeon v4, not the redesign for Skylake and later Xeon Scalable.)

When you read a line that was recently written by another core, it's always Invalid in your L1D. L3 is tag-inclusive, and its tags have extra info to track which core has the line. (This is true even if the line is in M state in an L1D somewhere, which requires it to be Invalid in L3, according to normal MESI.) Thus, after your cache-miss checks L3 tags, it triggers a request to the L1 that has the line to write it back to L3 cache (and maybe to send it directly to the core than wants it).

Skylake-X (Skylake-AVX512) doesn't have an inclusive L3 (It has a bigger private L2 and a smaller L3), but it still has a tag-inclusive structure to track which core has a line. It also uses a mesh instead of ring, and L3 latency seems to be significantly worse than Broadwell. (Especially in first-gen Skylake; I think less bad in Ice Lake and later Xeons.)


Possibly useful: map the latency-critical part of your shared memory region with a write-through cache policy. IDK if this patch ever made it into the mainline Linux kernel, but see this patch from HP: Support Write-Through mapping on x86. (The normal policy is WB.)

Also related: Main Memory and Cache Performance of Intel Sandy Bridge and AMD Bulldozer, an in-depth look at latency and bandwidth on 2-socket SnB, for cache lines in different starting states.

For more about memory bandwidth on Intel CPUs, see Enhanced REP MOVSB for memcpy, especially the Latency Bound Platforms section. (Having only 10 LFBs limits single-core bandwidth).


Related: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? has some experimental results for having one thread spam writes to a location while another thread reads it.

Note that the cache miss itself isn't the only effect. You also get a lot of machine_clears.memory_ordering from mis-speculation in the core doing the load. (x86's memory model is strongly ordered, but real CPUs speculatively load early and abort in the rare case where the cache line becomes invalid before the load was supposed to have "happened".


Also related:

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thank you very much for the answer, very very useful info. – janjust Nov 03 '17 at 20:09
  • In the comment above you said that intel CPU L1 caches are physically indexed, if that's the case, how come a read/write to shared memory will cause an invalidation? Maybe I just answered my own question. Invalidation happens when 2 cores have the same physical address being written to in the private cache, in skylake that's L1/L2. – janjust Nov 03 '17 at 20:11
  • @janjust: For the two logical cores sharing a physical core, writes don't cause invalidations, just a potential memory-order mis-speculation on the other logical core. See also https://stackoverflow.com/questions/32979067/what-will-be-used-for-data-exchange-between-threads-are-executing-on-one-core-wi for some CPU-architecture stuff about store buffer vs. L1D, and how out-of-order execution makes this less simple than you might naively expect. – Peter Cordes Nov 03 '17 at 20:15
3

You won't find good ways to disable use of L1 or L2 for Intel CPUs: indeed, outside of a few specific scenarios such as UC memory areas covered in Peter's answer (which will kill your performance since they don't use L3 either), the L1 in particular is fundamentally involved in reads and writes.

What you can do, however, is to use the fairly well-defined cache behavior of L1 and L2 to force evictions of data you only want to live in L3. On recent Intel architectures, both the L1 and L2 behave as pseudo-LRU "standard associative" caches. By "standard associative" I mean the cache structure you'd read about on wikipedia or in your hardware 101 course where a cache is divided into 2^N sets which have M entries (for an M-way associative cache) and N consecutive bits from the address are used to look up the set.

This means you can predict exactly which cache lines will end up in the same set. For example, Skylake has an 8-way 32K L1D and a 4-way 256K L2. This means cache lines 64K apart will fall into the same set on the L1 and L2. Normally having heavily used values fall into the same cache line is a problem (cache set contention may make your cache appear much smaller than it actually is) - but here you can use it to your advantage!

When you want to evict a line from the L1 and L2, just read or write 8 or more values to other lines spaced 64K away from your target line. Depending on the structure of your benchmark (or underlying application) you may not even need the dummy writes: in your inner loop you could simply use use say 16 values all spaced out by 64K and not return to the first value until you've visited the other 15. In this way each line would "naturally" be evicted before you use it.

Note that the dummy writes don't have to be the same on each core: each core can write to "private" dummy lines so you don't add contention for the dummy writes.

Some complications:

  • The addresses we discuss here (when we say things like "64K away from the target address") are physical addresses. If you're using 4K pages, you can evict from the L1 by writing at offsets of 4K, but to make it work for L2 you need 64K physical offsets - but you can't get that reliably since every time you cross a 4K page boundary you are writing to some arbitrary physical page. You can solve this by ensuring you are using 2MB huge pages for the involved cache lines.
  • I said "8 or more" cache lines need to be read/written. That's because the caches are likely to use some kind of pseudo-LRU rather than exact LRU. You'll have to test: you might find that the pseudo-LRU works just like exact LRU for the pattern you are using, or you might find that you need more than 8 writes to evict reliably.

Some other notes:

  • You can use performance counters exposed by perf to determine how often you are actually hitting in L1 vs L2 vs L3 to ensure your trick is working.
  • The L3 is usually no a "standard associative cache": rather the set is looked by hashing more bits of the address than a typical cache. The hashing means that you won't end up using only a few lines in L3: your target and dummy lines should be spread nicely around L3. If you find you are using an unhashed L3, it should still work (because the L3 is larger you'll still be spreading out among cache sets) - but you'll have to be more careful about possible evictions from L3 as well.
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • You could also use dummy `volatile` reads (or even `prefetcht0`) to get L1D/L2 evictions maybe more cheaply than writes, and so all threads could share the same set of dummy lines that cheaply stay hot in L3. Unless L1D favours evicting clean lines over dirty. Also a demand-load could stall OOO execution worse than a store. Anyway, worth trying if you can't use the obviously-better idea of making the conflicting writes all useful. – Peter Cordes Nov 04 '17 at 02:32
  • Yeah that's a why I said "read or write". Prefetch is a good idea, although you run a higher risk of the various heuristics dropping it and it may stop at L2 or L3 sometimes. Can a demand read stall OoO worse than a store to the same location if there are no instructions that depend on it? I guess because the load will sit in the scheduler until it is satisfied, but the store just has to sit in the store buffer? I admit to not really being clear on how load misses, the scheduler and the load buffer all interact. – BeeOnRope Nov 04 '17 at 02:40
  • Hmm, I'd always assumed that a load uop can't retire until the data actually arrives. I guess maybe if OoO sees that the physical register is "dead" then it could let it go and not even finish doing the load. I'm in the same boat; I'm not sure what work the load uop itself actually does in the load unit. I know it has to write the address into the memory-order buffer (aka the store buffer). Maybe it sets up forwarding and write-back (to a physical register) to point to its load-buffer entry? Probably OoO has to hold onto it in case of rollback for memory-order mis-speculation. – Peter Cordes Nov 04 '17 at 02:50
  • @peter - can a store retire before the actual store commits to L1 from the store buffer? Also something that I wasn't clear on but never made a question (yet) to ask. – BeeOnRope Nov 04 '17 at 02:54
  • In a few places you only said write, but now I see that in others you said read/write. Re: prefetch: if the memory pipeline is full of real work, it might be a *good* thing for prefetches to be dropped. It's not like this mechanism needs to work 100%, because the regular mechanism already work fairly well. A clean L3 hit isn't *that* much faster than reading a line that's still dirty in another core, is it? – Peter Cordes Nov 04 '17 at 02:54
  • It took me a long time to grok stores, too, but my current understanding is that a store *can't* commit to L1D until the store uop retires, i.e. becomes non-speculative. – Peter Cordes Nov 04 '17 at 02:56
  • 1
    @peter - the document you linked somewhere else shows the difference can be huge: 15 ns for a line in L3 and not in L2/L1 of another core vs 41 and 38 of the line is in the L1 or L2, respectively, of another core, on Sandy Bridge. So the idea of optimizing this is real in the vanishingly small number of places where that could matter. – BeeOnRope Nov 04 '17 at 02:57
  • Hmm, worse than a factor of 2. That's worse than I was thinking. I was tired and got bogged down reading/skimming that paper I linked in my answer (and in a comment on your ERMSB answer); I hadn't found those numbers. Still, if your producer thread is bottlenecked on L3 bandwidth, it might not help overall to do dummy reads or writes. Although in the best case your "write amplification" is only a factor of 2, not 8, if most of your "dummy" lines don't get evicted between writes. – Peter Cordes Nov 04 '17 at 03:02
  • @peter - about store retirement that would put stores in the same boat as loads with no dependents then, wouldn't it? Because you said "I'd always assumed that a load uop can't retire until the data actually arrives" I had interpreted that as perhaps being in contrast to stores. Conceptually it seems like store ops could retire before the store buffer commits: the retired but not committed part of the store buffer would be non-speculative and would have to commit eventually (ie on an interrupt it would have to be preserved or flushed). – BeeOnRope Nov 04 '17 at 03:06
  • @peter - my impression is that this kind of optimization is all about latency: passing values quickly from one core to another. Eg the OP wants a way for the producing core to signal that it doesn't want a line anymore, so put it "somewhere" it can best be consumed by another core - which is the L3 (well the L1 of the eventual consuming core is the "true" best place, but you might not know what core that was and htf are you supposed to pull that off if you did?). – BeeOnRope Nov 04 '17 at 03:12
  • Of course, at this level latency and throughput are tightly coupled: because tput is often ultimately limited by LFB occupancy time. So even in a throughput scenario (eg producer-consumer queue across threads) it could be useful. It also raises the interesting question of how prefetch interacts with caching/MESSI states. Mostly we think of prefetch as working on lines in memory or in some higher cache level but not owned by a other core. If the line is owned by another core will prefetch invalide it or change its state? – BeeOnRope Nov 04 '17 at 03:14
  • 1
    My assumption about prefetch was that it would end up sending a request to the core that owns the line, resulting in both cores having the line in Shared state. (Of for `prefetchw`, doing a RFO to get Exclusive and Invalidate the other copy). I'm not *sure* it works this way; HW prefetch that interfered with other cores would be worse than the usual extra-work penalty. – Peter Cordes Nov 04 '17 at 03:34
  • 1
    re: store retirement: I think you may have misread what I wrote. I think load uops can't retire until the data arrives. I think store uops *have to* retire before the data can commit. So they're opposite. Intel's optimization manual (2.3.5.2 SnB L1D cache) says "*Completion phase.* After the store retires, *the L1 DCache moves its data from the store buffers to the DCU, up to 16 bytes per cycle.*" – Peter Cordes Nov 04 '17 at 03:39
  • @PeterCordes - correct, I misread it. So we have the kind of interesting case where the internal CPU micro-architectural "commit point" (that is, retirement) is earlier and different from than the externally visible one. Doesn't this make interrupt latency suck? The store buffer might clogged up with a ton of missing stores, and then you take an interrupt, which has to wait for the whole store buffer to drain? – BeeOnRope Nov 04 '17 at 07:37
  • Interrupts are expensive anyway; the store buffer has some time to drain while the CPU is switching itself over to ring0 / IRQ context. Are interrupts a full barrier that makes loads wait for stores to commit? If not, then there's no need to flush before doing IRQ handler work. The IRQ handler's stores can be backed up behind the other work, yes, so if the work includes port I/O or other full-barrier stuff then it has to wait then. But that's only after any I$ / D$ misses triggered by the IRQ handler to get to the point. Maybe retired stores can only use a % of the buffer to limit this? – Peter Cordes Nov 04 '17 at 08:35
  • Miss handling is pipelined by the 10 LFBs, so that limits how bad the worst case can be. – Peter Cordes Nov 04 '17 at 08:40
  • Yeah IRQs are expensive, but 60 cache misses is worse :). I guess a really bad case would be split stores, which have less resources or need two LFBs or something like that, or page-splitting stores. Perhaps they block at retirement though, dunno. I also considering that perhaps the IRQ can start with stores still outstanding and they just keep draining while the IRQ is doing its thing, but this [comment](https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/537987#comment-1809755) indicates that Intel documents that IRQs do flush the buffer. – BeeOnRope Nov 04 '17 at 09:02
  • 1
    I also found a patent about mechanisms to adjust the tradeoff between store buffer size and maximum IRQ latency, so I guess it's a real thing. – BeeOnRope Nov 04 '17 at 09:03
  • Ok, yeah I thought IRQs probably were full barriers or even serializing. I think I'd seen that list before. I wonder how much happens in parallel with the store buffer draining. Probably interrupt -handler code fetch/decode happens (but not issue / execution), i.e. front-end fills up while the out-of-order core flushes. But possibly more work could happen speculatively with shootdown to preserve the illusion of serializing while reducing IRQ latency, if that even makes sense. – Peter Cordes Nov 04 '17 at 09:27
  • About prefetch & MESI - yeah I was thinking mostly of hardware prefetch. It's usual to pack structures on 64 or 128 byte boundaries or whatever to avoid false sharing, but imagine if HW prefetch (which I think activates after as little as the second access) caused you to access the next line every time you accessed some elements in your 128B structure: you'd have false sharing even though you followed the rules to avoid it. That's pattern is so common I figure they must have a way to avoid it. SW prefetch is different because if you do something stupid like that it's your own fault. @PeterCord – BeeOnRope Nov 04 '17 at 21:01
2

Intel has recently announced a new instruction that seems to be relevant to this question. The instruction is called CLDEMOTE. It moves data from higher level caches to a lower level cache. (Probably from L1 or L2 to L3, although the spec isn't precise on the details.) "This may accelerate subsequent accesses to the line by other cores ...."

https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf

prl
  • 11,716
  • 2
  • 13
  • 31
  • If it speeds up access from other cores, it'll be to L3. So it's a bit like CLWB but stopping at L3 instead of DRAM, and probably evicting the data from L1/L2 so an invalidate message doesn't need to be sent if another core wants exclusive access, not just shared read. – Peter Cordes Apr 17 '18 at 08:15
  • [Intel's patent on this guy](https://patents.google.com/patent/WO2016106128A1/en). – BeeOnRope Apr 17 '18 at 14:31
  • 1
    Interesting, this is worth experimenting with. Peter's prefetching mechanism in order to initiate the invalidation process helps performance, but some data will be left out - and requires an additional loop. – janjust Apr 17 '18 at 14:58
-2

I believe you should not (and probably cannot) care, and hope that the shared memory is in L3. BTW, user-space C code runs in virtual address space and your other cores might (and often do) run some other unrelated process.

The hardware and the MMU (which is configured by the kernel) will ensure that L3 is properly shared.

but I'd like to experiment with performance with and without bringing the shared data into private caches.

As far as I understand (quite poorly) recent Intel hardware, this is not possible (at least not in user-land).

Maybe you might consider the PREFETCH machine instruction and the __builtin_prefetch GCC builtin (which does the opposite of what you want, it brings data to closer caches). See this and that.

BTW, the kernel does preemptive scheduling, so context switches can happen at any moment (often several hundred times each second). When (at context switch time) another process is scheduled on the same core, the MMU needs to be reconfigured (because each process has its own virtual address space, and the caches are "cold" again).

You might be interested in processor affinity. See sched_setaffinity(2). Read about about Real-Time Linux. See sched(7). And see numa(7).

I am not sure at all that the performance hit you are afraid about is noticable (and I believe it is not avoidable in user-space).

Perhaps you might consider moving your sensitive code in kernel space (so with CPL0 privilege) but that probably requires months of work and is probably not worth the effort. I won't even try.

Have you considered other completely different approaches (e.g. rewriting it in OpenCL for your GPGPU) to your latency sensitive code ?

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • Thanks, I expanded my answer, prefetch is not ideal here, because as you said will do the opposite, adding even more data which will likely need to be invalidated. – janjust Nov 03 '17 at 14:29
  • 1
    I don't understand what scheduling has to do with it, or context switching. I don't think it's relevant here. – janjust Nov 03 '17 at 14:31
  • I explained why I believe context switching is highly relevant – Basile Starynkevitch Nov 03 '17 at 14:34
  • 1
    Thanks, I'm very familiar with process affinity scheduling and architecture implications of context switching etc. The latency in question is nanoseconds for a set of repeatable tests. 1 synchronization call executed trillions+ times adds up. At any rate I appreciate your input. – janjust Nov 03 '17 at 14:46
  • BTW without understanding the context and what motivates such a question, I guess you might not get helpful answers. You really should consider explaining what kind of application you are coding. – Basile Starynkevitch Nov 03 '17 at 14:55
  • 1
    @janjust: Prefetch from the reading CPU should help, if you can generate the address many cycles earlier than you're ready to do a demand-load. It should get the cache line in Shared state, triggering write-back from the Modified cache line in the private L1/L2 of another core. – Peter Cordes Nov 03 '17 at 17:37
  • 2
    @BasileStarynkevitch I'm gonna guess that the application here is HFT-related. And if I'm spot on as to exactly what the OP is trying to do, this is an XY problem. And while there are ways to do it, they are, unfortunately, secrets of the trade. – Mysticial Nov 03 '17 at 18:31
  • @PeterCordes, that's very insightful, I'm just starting to play with prefetching with the way you're suggesting - thanks! – janjust Nov 03 '17 at 20:13