CPU bound vs Cache bound - Can instructions be executed without cache/memory access? Can memory access be as fast as instruction execution?

Question

I was looking up the difference between CPU bound and IO bound programs. That was when I came across answers that explain that there are other variants like Memory Bound, Cache bound, etc.

I understand how Memory Bound (Multiplication of 2 large matrices in Main Memory) and IO Bound (grep) differ from each other and from CPU bound/Cache bound.

However, the difference between CPU Bound programs and IO Bound programs doesn't seem as clear. Here is what I gathered :

Cache bound - Speed of cache access is an important factor in deciding the speed at which the program gets executed. For example, if the most visited part of a program is a small chunk of code inside a loop small enough to be contained within the cache, then the program may be cache bound.

CPU bound - The speed at which CPU executes instructions is an important factor in deciding the speed at which the program gets executed.

But how can processes be CPU bound? I mean, instructions need to be fetched before execution (from cache/ Main Memory) every time, so, no matter how fast the CPU is, it will have to wait for the cache to finish data transfer and thus will at least be Cache Bound or Memory bound, since memory access is slower than instruction execution.

So is CPU bound the same as cache bound?

*instructions need to be fetched before execution (from cache/ Main Memory)*. Almost all CPUs use a split L1 cache, so instruction-fetch doesn't compete with data load/stores (among other reasons). When code is hot in L1 cache, the cache itself is not the bottleneck. Fetch/decode bottlenecks are called "front-end" bottlenecks. — Peter Cordes, Dec 11 '16 at 06:31
Also, it's not even true that instructions need to be fetched from L1 I-cache every time they run: Intel SnB-family CPUs have a decoded-uop cache, and also a loop buffer, so they can run medium to small loops without re-decoding the instructions. — Peter Cordes, Dec 11 '16 at 06:33
I haven't heard the term cache-bound, but I assume it means that the working set fits in L2 or L3 cache, but not L1 cache. So the code bottlenecks on bandwidth and/or latency to a larger and slower cache than L1D. Code-cache bottlenecks are would probably be specifically mentioned, because that's relatively unusual. — Peter Cordes, Dec 11 '16 at 06:34
If you want to be really specific, there are different kinds of CPU-bound ([front-end, latency, or throughput of a specific execution port](http://stackoverflow.com/questions/40878534/latency-vs-throughput-in-intel-intrinsics/40879258#40879258), and also branch-mispredicts). These distinctions can make the difference between Hyperthreading (or any other kind of SMT) being useful or not. Code with lots of branch mispredicts or latency bottlenecks will probably scale well with HT, since each thread doesn't fully use the execution throughput of a core. — Peter Cordes, Dec 11 '16 at 06:38

score 3 · Accepted Answer · answered Dec 13 '16 at 08:52

CPU architecture is very much like plumbing, just without the smell. When one of the pipes gets clogged, some others will overflow, while others will remain empty - both cases are bad utilization, but you need to find the jam to release everything. Similarly, with a CPU you have multiple systems that need to work in unison to make the program progress. Each of these machines has an upper limit on the bandwidth it can work, and when it's reached - it will become a limitation, making the other systems underutilized or even stalled.

The main memory for example depends on the number of channels and the type of DRAM (and of course frequency), but let's say it commonly peaks at 25G/s in client CPUs. that means that any workload that tries to consume data beyond this rate, will become blocked by the memory BW (i.e. memory bound), and the rest of the systems will be underutilized.

Cache BW depends on the cache level (and the processor micro-architecture, and of course frequency of that cache domain), but you can find out where it peaks in the optimization guides.

According to 2.1.3 here, Intel Skylake for example provides 2 32B loads + 1 store per cycle from the L1 (though the actual utilization they quote is a little lower, probably due to collisions or writeback interference), L2 is effectively about 1/2 line per cycle and L3 a little less than 1/3. This means that if your data set is contained in one of these levels, you can reach that peak BW before being capped by that cache.

On the other hand, let's say you don't reach the peak cache bandwidth, instead consuming data from the L1 at a lower rate, but each element of data requires many complicated mathematical operations. In that case, you may be bounded by your execution bandwidth - more so if these operations are limited to only part of the execution ports (as is the case with some esoteric operations).

There are useful tools to determine what you're bounded by - look up TopDown analysis for example

I thought the actual reason for the HSW/BDW/SKL sustained L1 throughput quoted by Intel's optimization manual (~83B/c IIRC) being lower than peak (96B/cycle) is imperfect uop scheduling. Every time a store-address uop is scheduled to p23 instead of p7, it prevents that port from executing a load. — Peter Cordes, Dec 13 '16 at 14:12
No, I this they added a special port for STA's, see the port diagram here - http://wccftech.com/idf-2013-intel-details-haswell-microarchitecture-overclocking-features-4th-generation-hd-graphics-core/, it's even stated that the intention is to reduce contention with loads — Leeor, Dec 13 '16 at 22:16
Right, that's port 7. It can handle simple addressing modes only (non-indexed IIRC, but don't quote me on this). Store-address uops can still be allocated to ports 2 and 3. This does happen in practice even for simple addressing modes that could have run on port 7. uop -> port allocation happens at issue time, based on counters that give a heuristic picture of contention for each port. Apparently the logic doesn't special-case port7 and force STA uops to use it when possible. — Peter Cordes, Dec 13 '16 at 22:24
See also http://stackoverflow.com/questions/40681331/how-are-x86-uops-scheduled-exactly for uop-scheduling details, although it doesn't focus on port 7. Anyway, having port7 does significantly reduce contention for ports 2 and 3, which is why even the sustained throughput is higher than 64B/c. — Peter Cordes, Dec 13 '16 at 22:28
Link is dead and not on archive.org Can you verify https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/methodologies/top-down-microarchitecture-analysis-method.html roughly covers what the link should cover instead? — worldsmithhelper, Aug 10 '22 at 18:31

CPU bound vs Cache bound - Can instructions be executed without cache/memory access? Can memory access be as fast as instruction execution?

1 Answers1

Related