Can x86_64 CPU execute two same operations on the same stage of pipeline?

Question

As known Intel x86_64 processors are not only pipelined architecture, but also superscalar.

This is mean that CPU can:

Pipeline - At one clock, execute some stages of one operation. For example, two ADDs in parallel with shifting of stages:
- ADD(stage1) -> ADD(stage2) -> nothing
- nothing -> ADD(stage1) -> ADD(stage2)
Superscalar - At one clock, execute some different operations. For example, ADD and MUL in parallel in the same stages:
- ADD(stage1) -> ADD(stage2)
- MUL(stage1) -> MUL(stage2)

enter image description here

This is possible due to the fact that the processor has several schedulers of instructions (Intel Core have 4 Simple Decoder).

But are there only duplicates of schedulers (4 Simple Decoders), or also are there duplicates of arithmetic unit?

I.e. can we execute, for example, two ADDs in the same stages, but on the independent arithmetic units (for example, ALU on Port 0 and ALU on Port 1) on the same CPU-Core?

ADD1(stage1) -> ADD1(stage2)
ADD2(stage1) -> ADD2(stage2)

Are there duplicates of the any executing unit which make able to execute two the same instructions at the same one clock?

Provided that two operations are data-independent, there is no restriction. Some interesting details here (slide 9): https://www.cis.upenn.edu/~milom/cis371-Spring09/lectures/06_superscalar.pdf — Jean-Baptiste Yunès, Jan 27 '15 at 15:35
@Jean-Baptiste Yunès Thank you! Can you please clarify, this is a simple example of ILP(Instruction-Level Parallelism) on the link, increases performance at the expense of **Pipeline** or **Superscalar** architecture, or both? http://stackoverflow.com/a/27748348/1558037 — Alex, Jan 27 '15 at 15:47
`optimized` is optimized essentially by the use of superscalar, because there is 4 flows of independent computations. Pipeline can optimize one flow of instruction by decomposing instructions into elementary operations. Remember that executing an instruction can be decomposed (roughly) as : fetching, decoding, loading data, computing, storing results. So you can do arithmetic of instruction *i* while decoding instruction *i+1* while fetching instruction *i+2*, for example... This accelerates the throughput, not individual instructions. — Jean-Baptiste Yunès, Jan 27 '15 at 16:13
@Jean-Baptiste Yunès Thank you very much! I.e. example `for(i = 0; i < arr_size; ++i) result += arr[i];` already is optimal for Pipeline, and can't be more optimized for Pipeline, isn't it? (During processing `i`-iteration, by using branch prediction pipeline already loads next `i+1`, and has no any downtime due to the fact that added to the same register `result`.) — Alex, Jan 27 '15 at 16:24
Decoders are not schedulers. They just decode the data and send it to the out-of-order machine (starting around the RAT/ROB). Scheduling is the action done by the Reservation Station (RS) in order to pick the ready operations and send them to execution. This is completely decoupled from the decode (i.e. - it's not done after a fixed amount of cycles) - it's not a single long pipeline. — Leeor, Jan 27 '15 at 16:38
@Leeor Thank you for the correction. But how many operations the processor can execute at the same time (at the same clock): 4 - equal to the number of decoders? — Alex, Jan 27 '15 at 17:12
No, that depends on the type of operations, and on the exact processor microarchitecture. A SandyBridge is different than a Haswell, and so on. The diagram you show (which looks like a Merom?) shows ports 0, 1 and 5 support ALU operations, hence you can execute 3 per cycle. Perhaps with some cleverness you can also use the memory address ports for simple arithmetic operations with a `LEA`, which adds some extra bandwidth. — Leeor, Jan 27 '15 at 18:27
@Leeor Thank you very much! I.e. I can execute together: 3 ALU operations (Ports 0/1/5) + 3 memory operations (Ports 2/3/4) per cycle? And the number of decoders equal 4 would not serve to limit to use only 4 Ports per cycle? — Alex, Jan 27 '15 at 18:47
The memory ports are reserved for specific operations - you have one for loads and effectively one for stores (since the store needs both address and data ports for every single store). Also note that the decoder width will limit your sustainable bandwidth to the 4 instructions per cycle it can emit (in the best case, no one guarantees all 4 can work every cycle), so it's not easy to saturate your ALU ports potential. Modern CPUs don't have to feed off the decoders though (Sandy Bridge added a decoded uop cache) — Leeor, Jan 27 '15 at 19:01
See [this table](http://www.agner.org/optimize/instruction_tables.pdf) for details such as which ports every instruction can go to on specific microarchitectures. Btw LEA goes to an ALU port. — harold, Jan 27 '15 at 19:37
@Leeor, due to micro-op and macro-op fusion more than 4 instructions can be executed per clock cycle. I'm not sure that disagrees with what you said. I got five instructions (and six micro-ops) per clock cycle in this [example](http://stackoverflow.com/questions/25899395/obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62). I think it's more precise to say that only 4 fused micro-ops can execute per clock cycle. — Z boson, Jan 28 '15 at 03:19
@Zboson, I didn't say they can't, only that 4 instructions can be *decoded* per cycle at best, even if you execute more due to fusion, you'll just run out of instructions (unless they come from other sources, for e.g. micro code urom). The point is - you can have code that is bounded on decode BW, and code that is bounded on execution BW (and of course other possible bottlenecks) - the designs need to be generic enough to support either case, but it's quite hard to find an example that actually saturates both. — Leeor, Jan 28 '15 at 07:01
@Jean-Baptiste Yunès You said earlier, that "`optimized` is optimized essentially by the use of superscalar, because there is 4 flows of independent computations.", but it doesn't demonstrate superscalar because there is only one floating-point adder, isn't it? http://stackoverflow.com/questions/27748020/is-there-a-really-working-example-which-showing-the-benefits-of-ilpinstruction/27748348#27748348 — Alex, Jan 28 '15 at 19:47
Weel it can be tricky. If they is only one adder, you can use the full pipeline as there is no dependency, each one can be shifted by one stage. This is why he claimed that it takes 1 cycle an ADD. — Jean-Baptiste Yunès, Jan 28 '15 at 21:15

score 3 · Accepted Answer · edited May 23 '17 at 12:14

yes. The question already contained the answer, as the comments explained. :P

(just posting an answer to get this out of the unanswered questions list.)

I will add that Sandybridge and later Intel CPUs, with their uop cache, can more often come close to sustaining 4 uops per cycle in loops than previous CPUs (if the frontend is the bottleneck, rather than data dependencies (latency) or execution-port contention (throughput).) This is esp. helpful with longer-to-encode vector instructions, because the decoders can only handle 16B / cycle, which is often less than 4 uops.

See http://agner.org/optimize/, esp. the microarch doc, for details about instruction throughput out of the uop cache, and how uop cache-line boundaries can interfere with delivering the constant 4 uops per cycle the pipeline can handle. Small loops that fit in the loop buffer don't suffer from this potential bottleneck.

In reply to one of the comments: micro-fusion doesn't let you get more than 4 instructions per cycle to run. Only macro-fusion combines multiple instructions into a single uop. (micro-fusion does make it cheaper to use instructions with memory operands, but apparently only works with one-register addressing modes. This does increase IPC, and could bring the average above 4.)

Can x86_64 CPU execute two same operations on the same stage of pipeline?

1 Answers1