2

Another question about L2/L3 caches explained that L3 can be used for inter process communication (IPC).

Are there other methods/pathways for this communication to happen?

The reason why it seems that there are other pathways is because Intel nearly halved the amount of L3 cache per core in their newest processor lineup (1.375 MiB per core in SKL-X) vs. previous generations (2.5 MiB per core in Broadwell EP).

Per-core private L2 increased from 256k to 1M, though.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Greg
  • 8,175
  • 16
  • 72
  • 125
  • 1
    Memory and IPIs. L3 is just a side effect. One could stretch it and includes devices too (e.g. a disk). – Margaret Bloom Sep 09 '17 at 14:26
  • bandwidth and capacity aren't necessarily related. L3 is still fairly large overall in SKL-X. (And note that the desktop Skylake/Kaby Lake CPUs still have the same amount of L3 as previous generations, and probably still a ring bus like Haswell instead of a mesh (which is probably what SKL-X uses.) – Peter Cordes Sep 10 '17 at 22:06
  • @PeterCordes My understanding was that SKL-X had enough L3 so that every core's L2 could be stored on it (and therefore information easily exchanged between the cores without going into DRAM). Wouldn't decreasing on L3 (to a level which does not fit all cores L2 cache) not increase the probability of the data being fetched from RAM in order to exchange it between cores? Am I missing something? – Greg Sep 11 '17 at 00:28
  • 1
    Sure, it means you have to design your software to reuse data sooner if you want it to still be hot in L3 when a consumer thread gets to it. It's unlikely that the only data in L3 is data that was written by one core and will next be read by another, though; most multi-threaded workloads involve plenty of private data, too. Also, SKL-X (and later?) L3 is not inclusive, so shared read-only data from L3 doesn't evict it from L2 of cores still using it. – Peter Cordes Sep 11 '17 at 00:34
  • @PeterCordes L3 is still bigger however the newest xeons have reduced L3 as a proportion of L2: https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#Xeon_Gold_.28quad_processor.29 – Greg Sep 11 '17 at 01:01
  • 1
    Neat, the low-core-count parts still have large L3, e.g. a 6-core Gold Xeon with 19.25MiB L3. That's 14 slices of 1.375MiB L3, but hopefully they don't actually have 14 mesh or ring nodes for just 6 active cores (that would be bad for latency). The smallest L3 cache is in the 10-core part, which may be using [the High-Core-Count (HCC) die](http://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/6), with only 13.75 MiB of L3 to go with its 10x1 MiB of L2. (One slice per active core, while the lower core counts are special.) – Peter Cordes Sep 11 '17 at 17:43

1 Answers1

3

There are inter-processor_interrupts, but that's not new, and not used directly by normal multi-threaded software. The kernel might use an IPI to wake up another core from low-power sleep, or maybe not notify it that a high-priority task became runnable after a task on this CPU released an OS-assisted lock / mutex that other tasks were waiting for.

So really no, there are no other pathways.

Reduced size means you have to design your software to reuse data sooner if you want it to still be hot in L3 when a consumer thread gets to it. But note that it's unlikely that the only data in L3 is data that was written by one core and will next be read by another; most multi-threaded workloads involve plenty of private data, too. Also note that SKX L3 is not inclusive, so shared read-only data can stay hot in L2 of the core(s) using it even when it's been evicted from L3.

It would be really nice for developers if L3 was gigantic and fast, but it isn't. Besides the reduced size of L3, the bandwidth and latency is also significantly worse in SKX than in BDW. See @Mysticial's comments about y-cruncher performance:

The L3 cache mesh on Skylake X only has about half the bandwidth of the L3 cache on the previous generation Haswell/Broadwell-EP processors. The Skylake X L3 cache is so slow that it's barely faster than main memory in terms of bandwidth. So for all practical purposes, it's as good as non-existant.

He's not talking about communication between threads, just the amount of useful cache per core for independent threads. But AFAIK, a producer/consumer model should be pretty similar.

From the software optimization standpoint, the cache bottleneck brings a new set of difficulties. The L2 cache is fine. It is 4x larger than before and has doubled in bandwidth to keep up with the AVX512. But the L3 is useless. The net effect is that the usable cache per core is halved compared to the previous Haswell/Broadwell generations. Furthermore, doubling of the SIMD size with AVX512 makes the usable cache 4x smaller than before in terms of # of SIMD words that fit in cache.

Given all that, it may not make a huge difference whether producer/consumer threads hit in L3 or go to main memory. Fortunately, DRAM is pretty fast with high aggregate bandwidth if many threads are active. Single-thread max bandwidth is still lower than in Broadwell.


Inter-thread bandwidth benchmark numbers:

SiSoft has an inter-core bandwidth and latency benchmark. Description here.

For a 10-core (20 thread) SKX (i9-7900X CPU @ nominal 3.30GHz), the highest result is one overclocked to 4.82GHz cores with 3.2GHz memory, achieving an aggregate(?) bandwidth of 105.84GB/s and latency of 54.9ns.

One of the lowest results is with 4GHz/4.5GHz cores, and 2.4GHz IMC: 66.11GB/s bandwidth, 76.6ns latency. (Scroll to the bottom of the page to see other submissions for the same CPU).

By comparison, a desktop Skylake i7-6700k (4C 8T 4.21GHz, 4.1GHz IMC) scores 35.51GB/s and 40.5ns. Some more overclocked results are 42.72GB/s and 36.3ns.

For a single pair of threads, I think SKL-desktop is faster than SKX. I think this benchmark is measuring aggregate bandwidth between 20 threads on the 10C/20T CPU.

This single-threaded benchmark shows only about 20GB/s for SKL-X for block sizes from 2MB to 8MB, pretty much exactly the same as main memory bandwidth. The Kaby Lake quad-core i7-7700k on the graph looks like maybe 60GB/s. It's not plausible that inter-thread bandwidth is higher than single-thread bandwidth for the SKX, unless SiSoft Sandra is counting loads + stores for the inter-thread case. (Single-thread bandwidth tends to suck on Intel many-core CPUs: see the "latency-bound platform" section of this answer. Higher L3 latency means bandwidth is limited by the number of outstanding L1 or L2 misses / prefetch requests.)

Another complication is that when running with hyperthreading enabled, some inter-thread communication may happen through L1D / L2 if the block size is small enough. See What will be used for data exchange between threads are executing on one Core with HT?, and also What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?.

I don't know how that benchmark pins threads to logical cores, and whether they try to avoid or maximize communication between logical cores of the same physical core.


When designing a multi-threaded application, aim for memory locality within each thread. Try to avoid passing huge blocks of memory between threads, because that's less efficient even in previous CPUs. SKL-AVX512 aka SKL-SP aka SKL-X aka SKX just makes it worse than before.

Synchronize between threads with flag variables or progress counters.

If memory bandwidth between threads is your biggest bottleneck, you should consider just doing the work in the producer thread (especially on the fly as the data is being written, instead of in separate passes), instead of using a separate thread at all. i.e. that maybe one of the boundaries between threads is not in an ideal place in your design.

Real life software design is complicated, and sometimes you end up having to choose between poor options.

Hardware design is complicated, too, with lots of tradeoffs. Although it appears that SKX's L3 cache + mesh seem to do worse than the old ring bus setup for medium core count chips. Presumably it is a win for the biggest chips for some kinds of workloads. Hopefully future generations will have better single-core latency / bandwidth.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I have to admit I didn't really understand Mysticial's comment about the bandwidth, how did he deduce that from the frequency experiment? – Leeor Sep 14 '17 at 17:41
  • @Leeor: Overclocking the cache+DRAM while keeping CPU core frequency constant shows you how much speedup you get *just* from making that part faster. There's a much bigger gain on SKX than on BDW, so the cache is a bigger bottleneck on SKX than on BDW. – Peter Cordes Sep 14 '17 at 22:07
  • Sure, this benchmark is more sensitive to mesh/mem (mostly mem) freq, but the overall performance is also much better, so it's possible that the working point shifted. One simple explanation is that we were bounded on execution in the core on BDW, but with AVX3 we get higher exec BW and therefore become bottlenecked on something else (memory), even if the memory capabilities had remained the same (and I assume it was even improved). I don't understand how this leads him to claim the SKX has half the BW of BDW. Higher sensitivity does not indicate a relative slowdown. – Leeor Sep 18 '17 at 16:59
  • @Leeor: Oh, I see what you mean. SKX L3 seems to have about half the absolute bandwidth IIRC (from separate tests, not the frequency experiment). Then the part you're talking about is Mysticial making the same point you are: with the higher execution throughput, cache is a much bigger bottleneck. Cache bandwidth compared to execution bandwidth due to wider vectors means that L3 is no longer useful, so you have to cache-block for L2. – Peter Cordes Sep 19 '17 at 02:22