latency vs throughput in intel intrinsics

Question

I think I have a decent understanding of the difference between latency and throughput, in general. However, the implications of latency on instruction throughput are unclear to me for Intel Intrinsics, particularly when using multiple intrinsic calls sequentially (or nearly sequentially).

For example, let's consider:

_mm_cmpestrc

This has a latency of 11, and a throughput of 7 on a Haswell processor. If I ran this instruction in a loop, would I get a continuous per cycle-output after 11 cycles? Since this would require 11 instructions to be running at a time, and since I have a throughput of 7, do I run out of "execution units"?

I am not sure how to use latency and throughput other than to get an impression of how long a single instruction will take relative to a different version of the code.

throughput = 7 means one can start every 7 cycles. Latency = 11 means that a single result takes 11 cycles. So on average, ~1.5 are in-flight at any given time, and not more than 2. (Although it's a multi-uop instruction, so the scheduler might end up interleaving uops from more instructions for some reason). And BTW, Agner Fog's numbers for PCMPESTRI on Haswell don't match up with Intel's.) — Peter Cordes, Nov 30 '16 at 04:34

Peter Cordes · Answer 1 · 2023-02-10T04:37:30.187

For a much more complete picture of CPU performance, see Agner Fog's microarchitecture guide and instruction tables. (Also his Optimizing C++ and Optimizing Assembly guides are excellent). See also other links in the x86 tag wiki, especially Intel's optimization manual.

See also

https://uops.info/ for accurate tables collected programmatically from microbenchmarks, so they're free from editing errors like Agner's tables sometimes have.
How many CPU cycles are needed for each assembly instruction?
and What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? for more details about using instruction-cost numbers.
What is the efficient way to count set bits at a position or lower? For an example of analyzing short sequences of asm in terms of front-end uops, back-end ports, and latency.
Modern Microprocessors: A 90-Minute Guide! very good intro to the basics of CPU pipelines and HW design constraints like power.

Latency and throughput for a single instruction are not actually enough to get a useful picture for a loop that uses a mix of vector instructions. Those numbers don't tell you which intrinsics (asm instructions) compete with each other for throughput resources (i.e. whether they need the same execution port or not). They're only sufficient for super-simple loops that e.g. load / do one thing / store, or e.g. sum an array with _mm_add_ps or _mm_add_epi32.

You can use multiple accumulators to get more instruction-level parallelism, but you're still only using one intrinsic so you do have enough information to see that e.g. CPUs before Skylake can only sustain a throughput of one _mm_add_ps per clock, while SKL can start two per clock cycle (reciprocal throughput of one per 0.5c). It can run ADDPS on both its fully-pipelined FMA execution units, instead of having a single dedicated FP-add unit, hence the better throughput but worse latency than Haswell (3c lat, one per 1c tput).

Since _mm_add_ps has a latency of 4 cycles on Skylake, that means 8 vector-FP add operations can be in flight at once. So you need 8 independent vector accumulators (which you add to each other at the end) to expose that much parallelism. (e.g. manually unroll your loop with 8 separate __m256 sum0, sum1, ... variables. Compiler-driven unrolling (compile with -funroll-loops -ffast-math) will often use the same register, but loop overhead wasn't the problem).

Those numbers also leave out the third major dimension of Intel CPU performance: fused-domain uop throughput. Most instructions decode to a single uop, but some decode to multiple uops. (Especially the SSE4.2 string instructions like the _mm_cmpestrc you mentioned: PCMPESTRI is 8 uops on Skylake). Even if there's no bottleneck on any specific execution port, you can still bottleneck on the frontend's ability to keep the out-of-order core fed with work to do. Intel Sandybridge-family CPUs can issue up to 4 fused-domain uops per clock, and in practice can often come close to that when other bottlenecks don't occur. (See Is performance reduced when executing loops whose uop count is not a multiple of processor width? for some interesting best-case frontend throughput tests for different loop sizes.) Since load/store instructions use different execution ports than ALU instructions, this can be the bottleneck when data is hot in L1 cache.

And unless you look at the compiler-generated asm, you won't know how many extra MOVDQA instructions the compiler had to use to copy data between registers, to work around the fact that without AVX, most instructions replace their first source register with the result. (i.e. destructive destination). You also won't know about loop overhead from any scalar operations in the loop.

I think I have a decent understanding of the difference between latency and throughput

Your guesses don't seem to make sense, so you're definitely missing something.

CPUs are pipelined, and so are the execution units inside them. A "fully pipelined" execution unit can start a new operation every cycle (throughput = one per clock)

(reciprocal) Throughput is how often an operation can start when no data dependencies force it to wait, e.g. one per 7 cycles for this instruction.
Latency is how long it takes for the results of one operation to be ready, and usually matters only when it's part of a loop-carried dependency chain.

If the next iteration of a loop operates independently from the previous, then out-of-order execution can "see" far enough ahead to find the instruction-level parallelism between two iterations and keep itself busy, bottlenecking only on throughput.

See also Latency bounds and throughput bounds for processors for operations that must occur in sequence for an example of a practice problem from CS:APP with a diagram of two dep chains, one also depending on results from the other.

At a simple level, this confirms my suspicion that these numbers are only really straightforward when the intrinsics are used in isolation. What I still don't understand from your answer is what resources limit the execution of multiple instructions (generally of the same type) from executing sequentially. As you mention, the # of Execution units is one limitation. What about maxing out the number of SIMD registers? Agner's documents, particulary the microarchitecture guide, seem particularly interesting and relevant in understanding the implications of various design approaches. — Jimbo, Dec 01 '16 at 15:20
Yes, the main throughput resource they compete for is execution ports. e.g. on Haswell and later, all shuffles run on port 5, so they all compete with each other. PADD* (`_mm_add_epi8/16/32/64`) can run on p1 or p5, so shuffles reduce the maximum add throughput. (And due to imperfect out-of-order scheduling, some PADDB instructions will steal port5 even if the shuffle is on the critical path but the add isn't. Extra latency because of uops having to wait for an execution port after their operands are ready is called a "resource conflict".) — Peter Cordes, Dec 01 '16 at 15:41
@Jimbo: If the compiler runs out of vector regs, it has to use some extra load instructions. (And maybe stores, too, if it has to spill temporaries instead of just re-loading stuff that already needs to go to memory at some point (or was read-only in the first place).) Extra instructions = extra fused-domain uops. BTW, thanks for the feedback on exactly what this answer left unclear. That will help if/when I get back to improving it after posting in a hurry. — Peter Cordes, Dec 01 '16 at 15:43
I can't emphasize enough reading the guides that Peter links off the top, especially the [optimizing in assembly](http://www.agner.org/optimize/optimizing_assembly.pdf) guide - it goes over several worked examples that show *exactly* how this works - and answers the questions you don't even know you have yet. Don't be fooled - you might be "writing in C/C++", but when using intrinsics it's closer to assembly than C (and you should know assembly anyway to check the compiler hasn't done anything terrible - it often does). — BeeOnRope, Dec 01 '16 at 16:47
@Jimbo: Totally agree with BeeOnRope here. For really high performance, you need to check the compiler output. And you need to think of C + intrinsics as a "portable assembly language", so you write you code similar to how optimal asm would look (including for code around the intrinsics). Although that's not really true, because clang will often optimize your intrinsics (moreso than gcc or icc). e.g. it has its own internal representation for shuffles, so it just knows what goes where and forgets what intrinsic you used when choosing what instruction to emit. — Peter Cordes, Dec 01 '16 at 16:54

latency vs throughput in intel intrinsics

1 Answers1

Linked

Related