Relation between CPI and number of execution units when looking at SIMD intrinsics

Question

I understand that the term Cycle Per Instruction closely relates to the superscalarity of the processor, a term which I have not fully understood. According to Wikipedia, "...a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor". In the same article, there is a hint that superscalarity is not necessarily related to instruction pipelining, a concept with which I'm fairly familiar.

Now, let's get concrete by taking the example of _mm256_shuffle_ps, which, according to https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#avxnewtechs=AVX,AVX2,FMA, has a CPI of 0.5 for the Alder Lake micro-architecture.

Questions:

Can I assume that there are exactly 2 identical execution units which execute _mm256_shuffle_ps in all Alder Lake chips?
How can a programmer know which separate instructions involve the same executions units?
If there are different numbers of execution units for different instructions (such as _mm256_shuffle_ps), how does the statement "X is a 4-way superscalar processor" make sense, seeing as no one number could describe the distinct multiplicities of each execution unit?

Thanks in advance for the transfer of knowledge.

More than two identical EU I'd say two EU that can perform vector shuffles. If I'm not wrong Alder Lake is based on Sunny Cove, see [this](https://en.wikichip.org/wiki/intel/microarchitectures/sunny_cove#Block_diagram). The "wayness" is probably the maximum number of dispatches possible per clock. — Margaret Bloom, Feb 09 '23 at 11:40
@MargaretBloom Looking at the image you've linked in your comment, and seeing only a handful of familiar sounding EUs, is it safe to say that operations similar to the shuffle such as e.g. `_mm256_permutevar8x32_ps` will also be implemented using the same 3 or so shuffle EUs? — Nitin Malapally, Feb 09 '23 at 12:52
That's `vpermps` and on Alderlake it can use (the EU behind) port5. There's this awesome site called uops.info that has detailed information about each instruction, including (`vpermps`](https://www.uops.info/html-instr/VPERMPS_YMM_YMM_YMM.html). — Margaret Bloom, Feb 09 '23 at 14:14
Hrm, there are multiple questions here. I picked duplicates mostly based on the part about knowing whether two different instructions will compete with each other for throughput resources (specifically back-end execution ports). The part about what 4-wide means is the narrowest part of the pipeline, the issue/rename stage. For more detail, read Agner Fog's microarch guide (at least the entry for Sandybridge) and then the linked duplicates. — Peter Cordes, Feb 10 '23 at 04:50

score 0 · Answer 1 · answered Feb 10 '23 at 04:34

Superscalar is usually a term you'd apply to CPU's of old, e.g. the original pentium. Back in those days, you'd have two seperate pipes, the U (primary) and V (secondary) pipe, which would allow you to potentially dispatch two instructions at the same time (i.e. it had 2 execution units). It was effectively a way of getting slightly better performance from an in-order processor core (although that came with caveats - e.g. pipeline bubbles could be an issue)

These days processors tend to use Out of Order Execution (OOOE) backed by a larger number of execution units. Alder Lake CPU's have 12 execution units, however those execution units tend to be specialised to some extent - e.g. load/store, pointer arithmetic, SIMD FPU units, etc. That's why you won't see 12 execution units capable of performing a shuffle. It can dispatch 12 micro-ops per cycle, but those ops can't all be the same instruction.

Can I assume that there are exactly 2 identical execution units which execute _mm256_shuffle_ps in all Alder Lake chips?

No, you can't assume that. You can assume that there are two execution units which are capable of executing _mm256_shuffle_ps, but that doesn't mean those two units are identical. For example, we can see there are 3 execution units that can work on 256bit YMM registers, and we can see from the instruction timings that all 3 can perform _mm_add_epi32. However, only 2 can perform _mm_shuffle_ps, and only 1 can perform _mm_div_ps, so they are clearly not the same....

How can a programmer know which separate instructions involve the same executions units?

Unless the manufacturer explicitly states the capabilities of each execution port (sometimes you'll find that info in the technical manual for the CPU), you're pretty much limited to making educated guesses (e.g. the Apple M1)

If there are different numbers of execution units for different instructions (such as _mm256_shuffle_ps), how does the statement "X is a 4-way superscalar processor" make sense, seeing as no one number could describe the distinct multiplicities of each execution unit?

Modern Intel processors are not superscalar, therefore describing them as such makes no sense at all.

Alder Lake is able to dispatch 12 instructions per clock, using Out-Of-Order-Execution. The types of instruction the execution units can handle, is typically geared up to cover a range of common cases. For example, consider this code:

void func(float* r, float* a, float* b) {

   // basic integer ops: increment and less-than
   for(int i = 0; i < 128; ++i) {

      // 2 address manipulation instructions
      float* addr_a = a + i * 4;
      float* addr_b = b + i * 4;

      // 2 load instructions
      __m128 A = _mm_load_ps(addr_a);
      __m128 B = _mm_load_ps(addr_b);

      // an addition
      __m128 R = _mm_add_ps(A, B);

      // another address manipulation op
      float* addr_r = r + i * 4;

      // a store instruction
      _mm_store_ps(addr_r, R);
   }
}

Providing 12 execution units that are all capable of executing an _mm_add_ps instruction doesn't really make any sense. It makes more sense to balance the number of SIMD execution units with all those other common tasks (e.g. address manipulation, looping, etc).

More relevantly, all shuffles can run on port 5, but only *some* common ones can also run on port 1 on Ice Lake and later. The FP divider is of course a separate execution unit, and it's on port 0, not even on the same port as the extra shuffle unit. The question of whether there are two identical shuffle units on different ports has to be answered by looking for any shuffle uops that can't run on port 1 for YMM or narrower, and the answer is yes, there are, such as `vpermilps`, port 5 only. (https://uops.info/). Seems arbitrary, not based on being 1-input or variable control (`vpshufb`) — Peter Cordes, Feb 10 '23 at 04:47
*you're pretty much limited to making educated guesses* - or using hardware performance counters like `uops_dispatched_port.port_0` in a microbenchmark that runs that instruction many times. https://uops.info/ goes a step farther: for multi-uop instructions, mixing them with other instructions that can only run on a specific port, to see if any of their uops still have to get scheduled to a busy port. That lets them figure out whether an instruction is really 2p0156 vs. p15 + p06 for examples. — Peter Cordes, Feb 10 '23 at 04:52
And yeah, Intel's optimization guide has good info, and summary versions of that kind of thing show up in technical presentations that CPU review sites get, and presentations at chip conferences. So people can make diagrams like in https://www.realworldtech.com/haswell-cpu/4/ and https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Individual_Core — Peter Cordes, Feb 10 '23 at 04:55
*Modern Intel processors are not superscalar* - What? Superscalar means the pipeline is wider than 1 instruction wide. All Intel CPUs since P5 Pentium have been superscalar. IDK what you think it means, perhaps in-order execution? There aren't any modern in-order Intel/AMD cores, not since first gen Xeon Phi (Knight's Corner) and pre-Silvermont Atom. But all modern x86 CPUs are superscalar, able to sustain an IPC greater than 1 on some code. — Peter Cordes, Feb 10 '23 at 04:57
*Alder Lake is able to dispatch 12 instructions per clock* - that's the number of back-end execution ports. The issue/rename stage is "only" 6 uops wide, so that's the most it can *sustain*. Having lots of different execution units means it can do a lot of whatever the code is currently doing, e.g. loads+stores *or* ALU, as well as catch up quickly after something slow finished and lots of instructions have operands ready, so it can free space in the RS and make room to issue new instructions. But we say it's 6-wide because that's the narrowest point (not counting 5-wide legacy decode). — Peter Cordes, Feb 10 '23 at 05:01

Relation between CPI and number of execution units when looking at SIMD intrinsics

1 Answers1