0

I have a following problem: I have a low-latency application running on core 0, and a regular application running on core 1. I want to make sure that core 0 app gets as much cache as possible, therefore, I want to make core 1 bypass the L3 cache (not use it at all) and go directly in memory for data.

Are there any other ways I can achieve that core 0 app gets the priority in using the L3 cache?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Bogi
  • 2,274
  • 5
  • 26
  • 34

2 Answers2

3

Some Intel CPUs support partitioning the L3 cache between different workloads or VMs, Cache Allocation Technology (CAT). It's been supported since Haswell Xeon (v3), and apparently 11th-gen desktop/laptop CPUs.

Presumably you need to let each workload have some L3, probably even on Skylake-Xeon and later where L3 is non-inclusive, but you might be able to give it a pretty small share and still achieve your goal.


More generally, https://github.com/intel/intel-cmt-cat has tools (for Linux and somewhat for FreeBSD) for managing that and other parts of what Intel's now calling "Resource Director Technology (RDT)" for monitoring, CAT, and Memory Bandwidth Allocation. It also has a table of features by CPU.

What you describe would be literally impossible on a desktop Intel CPU (or Xeon before Skylake), as they use inclusive L3 cache: a line can only be in L2/L1 if it's in L3 (at least tags, not the data if a core has it in Modified or Exclusive state). Skylake-X and later xeons have non-inclusive L3 so it would be possible in theory; IDK if CAT lets you give one set of cores zero L3.


I don't know if any AMD or ARM CPUs have something similar. I just happen to know of the existence of Intel's hardware support for this, not something I've ever gone looking for or used myself.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Ok, another question: can I do it by rewriting the app on core 1, e.g. to use non-temporal loads instead of regular loads. Would this in theory decrease the consumption of L3 cache by the app on core 1? – Bogi Jan 20 '23 at 15:39
  • 1
    @Bogi: On x86, barely plausible in theory. You'd have to use WC (uncachable write-combining) memory, otherwise the NT hint on SSE4.1 `movntdqa` will be ignored by existing CPUs. And there are no other NT load instructions, only NT stores (which do bypass all levels of cache and force eviction if hot, even on WB memory). So you'd have to get a compiler to never use normal loads, bouncing all data through XMM or YMM registers. Maybe you'd be ok with some regular loads for scalar local vars and return addresses in stack memory, but this would still absolutely destroy performance. – Peter Cordes Jan 20 '23 at 15:59
  • 1
    @Bogi: I was assuming that a factor of maybe hundreds of times slower, and probably more memory bandwidth, would be unacceptable for the non-realtime application. NT *prefetch* can avoid L3 pollution entirely on CPUs with non-inclusive L3 cache, for arrays you're looping over, if you tune the prefetch distance correctly so data is ready in L1d before a demand-load. (Maybe a demand load that hits an in-flight NT prefetch avoids promoting it from NT to regular). Related: [Non-temporal loads and the hardware prefetcher, do they work together?](https://stackoverflow.com/q/32103968) – Peter Cordes Jan 20 '23 at 16:02
1

On AMD Epyc: It would be good if you could move your low-latency application into an isolated core complex.

In Zen 2, 4 cores share a 16 MB slice of L3. In Zen 3, 8 cores share a 32 MB slice. Make sure your low-latency application is the only one that can touch the L3 slice of the core it's running on.

northwindow
  • 178
  • 8