SIMD latency throughput

Question

On the Intel Intrisics Guide for most instructions, it also has a value for both latency and throughput. Example:

__m128i _mm_min_epi32

Performance
Architecture Latency Throughput
Haswell      1       0.5
Ivy Bridge   1       0.5
Sandy Bridge 1       0.5
Westmere     1       1
Nehalem      1       1

What exactly do these numbers mean? I guess slower latency means the command takes longer to execute, but Throughput 1 for Nehalem and 0.5 for Ivy, means the command is faster on Nehalem?

Modern cores have *two* execution units that can execute the instruction at the same time. So if the sun is shining and you've got the wind in your back and your program has two of these close together then they both complete in a single cycle. Making it look to your profiler that they took half a cycle. — Hans Passant, Feb 15 '15 at 23:22

score 20 · Accepted Answer · edited Nov 18 '19 at 23:50

The "latency" for an instruction is how many clock cycles it takes the perform one instruction (how long does it take for the result to be ready for a dependent instruction to use it as an input). If you have a loop-carried dependency chain, you can add up the latency of the operations to find the length of the critical path.

If you have independent work in each loop iteration, out-of-order exec can overlap it. The length of that chain (in latency cycles) tells you how much hard OoO exec has to work to overlap multiple instances of that dependency chain.

Normally throughput is the number of instructions per clock cycle, but this is actually reciprocal throughput: the number of clock cycles per independent instruction start - so 0.5 clock cycles means that 2 instructions can be issued in one clock cycle and the result is ready on the next clock cycle.

Note that execution units are pipelined, all but the divider being fully pipelined (start a new instruction every clock cycle). Latency is separate from throughput (how often an independent operation can start). Many instructions are single-uop so their throughput is usually 1/n where n is a small integer (the number of ports with an execution unit that can run that instruction).

Intel documents that here: https://software.intel.com/en-us/articles/measuring-instruction-latency-and-throughput

To find out whether two different instructions compete with each other for the same throughput resource, you need to consult a more detailed guide. For example, https://agner.org/optimize/ has instruction tables and a microarch guide. These go into detail about execution ports, and break down instructions into the three dimensions that matter: front-end cost in uops, which back-end ports, and latency.

For example, _mm_shuffle_epi8 and _mm_cvtsi32_si128 both run on port 5 on most Intel CPUs, so compete for the same 1/clock throughput. But _mm_add_epi32 runs on port 1 or port 5 on Haswell, so its 0.5c throughput only partially competes with shuffles.

https://uops.info/ has very detailed instruction tables from automated testing, including latency from each input separately to the output.

Agner Fog's tables are nice (compact and readable) but sometimes have typos or mistakes, and only a single latency number and you don't always know which input formed the dep chain.

See also What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

This is described here: https://software.intel.com/en-us/articles/measuring-instruction-latency-and-throughput — Phil Miller, Feb 15 '15 at 23:12
No, throughput is the number of instructions per clock cycle. Intel's is quoting the reciprocal throughput and calling it throughput. — Z boson, Feb 16 '15 at 08:01
Based on Intel's definition in the other answer here their use of throughput is consistent. I'm just used to the one used by Agner Fog. But in all his tables he uses the reciprocal throughput anyway so perhaps Intel's definition is more practical. His definition is useful when calculating FLOPS but in any case it's just a question of inverse. — Z boson, Feb 16 '15 at 15:22
Yes, it's just a case of "whichever way around you look at it" - even for many other things, there is often a "regular" and "inverse" measurement, for example resistance is measured as Ohm, but you can also measure conducitivity, which is typically measured in "mho" or 1/Ohm. — Mats Petersson, Feb 16 '15 at 20:07
@Zboson IMHO Fog's definition is much more consistent than Intel's with respect to what throughput actually means in *real life*, i.e. some quantity per amount of time. I claim that what Intel calls "throughput" **should** be called "reciprocal throughput". — hdl, Jan 15 '16 at 13:03
@Mats: I edited in a significant amount of new text. I could post that as a separate answer if you don't want it in your answer, but I think having it in the accepted answer is good. — Peter Cordes, Nov 19 '19 at 02:55

score 8 · Answer 2 · answered Feb 15 '15 at 23:14

The following is a quote from Intel's page Measuring Instruction Latency and Throughput.

Latency and Throughput

Latency is the number of processor clocks it takes for an instruction to have its data available for use by another instruction. Therefore, an instruction which has a latency of 6 clocks will have its data available for another instruction that many clocks after it starts its execution.

Throughput is the number of processor clocks it takes for an instruction to execute or perform its calculations. An instruction with a throughput of 2 clocks would tie up its execution unit for that many cycles which prevents an instruction needing that execution unit from being executed. Only after the instruction is done with the execution unit can the next instruction enter.

SIMD latency throughput

2 Answers2