Reordering commands using super-pipeline CPU

Question

There is an issue I'm having with understanding a "feature" super-pipeline CPU has. we learned that it is possible to make programs run "faster" by reordering the commands

one example was if you add to register 1 the value 1000 and then sub from register 2 the value of register 1 and then inc register 9

by putting the inc in between and add and sub-commands the CPU will be able to do that while it's doing the sub-command (Parallel)

how is that possible if all commands are using the ALU unit? I could understand if it was MOV (or any command that doesn't use the ALU unit)

thanks in advance

The CPU must be superscalar, it must be able to execute more than one instruction per clock cycle. Either by having more than one Execution Unit (the ALU is just one of these EU) or by pipelining the ALU itself so that while an instruction is being executed another one can start. By executing out of order it's possible to start executing later instructions earlier thereby reducing the total latency. — Margaret Bloom, Feb 02 '23 at 08:33
so using super-pipeline, the order of commands i wrote above will not be done "parallel" I can commands like MOV while queuing INC commands but nothing that uses ALU — Lostdawn, Feb 02 '23 at 08:54
Since register 9 has nothing to do with the other two it should be executed in parallel, I would imagine. My understanding of this concept is a bit limited, but from what I gather, the CPU is able to execute instructions in a different order than written as long as doing so won't affect the outcome. — puppydrum64, Feb 02 '23 at 10:45
Suppose a division takes 20 cycles. If you have two independent divisions back to back, a non-superscalar CPU or a CPU with non-pipelined ALU/division will need 20+20=40 cycles. Even if the CPU sees that the second division could start in "parallel" with the first, there's no hardware to execute it. If the CPU has a fully pipelined division, then executing both divisions will take 20+1 = 21 cycles (usually a bit more). If the CPU has two division units, it will take 20 cycles to execute both (granted it can execute out-of-order). — Margaret Bloom, Feb 02 '23 at 12:49
Why is this tagged x86? There hasn't been an in-order x86 microarchitecture for about a decade (e.g. Xeon Phi or the last in-order Atom). Out-of-order exec CPUs will find instruction-level parallelism for you (e.g. across loop iterations) without needing to schedule instructions on a small scale, just worry about the dependency chains. [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) / [What considerations go into predicting latency for operations on modern superscalar processors](https://stackoverflow.com/q/51607391) — Peter Cordes, Feb 02 '23 at 15:28
See [Modern Microprocessors A 90-Minute Guide!](https://www.lighterra.com/papers/modernmicroprocessors/) for the basics, like superscalar in-order where it matters what order you put instructions in on a small scale. Such CPUs of course have multiple ALUs. Since you're talking about x86, P5 Pentium is dual-issue in-order, with a U pipe and V pipe; some instructions can run in either pipe, and thus can pair with each other. See https://agner.org/optimize/ — Peter Cordes, Feb 02 '23 at 15:30

score 1 · Answer 1 · answered Feb 02 '23 at 16:00

Let's take your example

I1:
add r1, r1, 1000
I2:
sub r2, r2, r1
I3:
add r9, r9, 1

and suppose we have a single pipelined ALU, say, with x > 1 stages. Each instruction in the above sequence takes x cycles from issue to writeback, since they all use the ALU.

First consider how this code would run without reordering. I1 is issued at cycle 1. There is a read-after-write (RAW) dependence of I2 on I1 via r1, so I2 has to wait for I1 to complete execution before it can even issue, which happens at cycle x + 1. Since there is no dependence of I3 on the previous instructions, it can issue the cycle after I2 is issued, at cycle x + 2. I3 completes after cycle 2 x + 1.

Now consider what would happen if the CPU can execute out-of-order. The CPU issues any instruction whose inputs and output are all different from the outputs of all executing instructions (the restriction that the output must differ from outputs of executing instructions can be relaxed via techniques such as register renaming). Here, the CPU begins by issuing I1 on cycle 1. On cycle 2, it attempts to issue I2, but cannot do so because of the RAW dependence, but it can issue I3, which it does. I1 is completed after cycle x, so I2 can be issued at cycle x + 1. I3 completes after cycle x + 1. I2 completes after cycle 2 x.

So, 1 cycle was saved by executing I3 ahead of I2, which serves to hide the latency due to the RAW dependence. If x is large, then this savings is not very much. But this is really because our program is small. Imagine there were another instruction after I3, as in

I1:
add r1, r1, 1000
I2:
sub r2, r2, r1
I3:
add r9, r9, 1
I4:
add r4, r4, 4

You can show that without reordering, the CPU as described would take 2 x + 2 cycles to run this code. With reordering, it would still just take 2 x cycles to run. We now have a savings of 2 cycles since we were able to continuing issuing independent instructions to hide the latency between I1 and I2. Obviously, the savings increases the more independent instructions we have.

Note

These arguments do not require that the CPU implement register renaming or be superscalar (defined as the ability to issue multiple instructions per cycle). We just require that the execution units are pipelined or are duplicated, which can be shown to be organizationally equivalent.

Some sources define Superscalar as the ability to execute multiple instructions per cycle. This can be achieved with a single pipelined execution unit. Note that a pipelined architecture does not imply a pipelined EU/ALU. Other sources define Superscalar as the property of having multiple EU and the ability to dispatch two or more instructions per cycle. The first definition is more general and I think it captures the concept best. — Margaret Bloom, Feb 02 '23 at 16:20
Interesting, which sources define superscalar that way (only requiring multiple instructions executing in each cycle)? Under that definition, any machine with a pipelined ALU would be considered superscalar, but to my knowledge, pipelining and superscalar are two independent techniques to exploit instruction-level parallelism. See, e.g. Smith and Sohi IEEE 1995 83(12) — K. Jiang, Feb 02 '23 at 16:50
On Google there are a lot of different definitions of superscalar, hard to say who is authoritative. I won't think too much about it, definitions used in processor design are not always universal (eg: issue, dispatch, throughput) — Margaret Bloom, Feb 02 '23 at 17:16
`x` (latency of simple ALU instructions) is usually 1 cycle, since real-world CPUs normally implement bypass forwarding. (https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Data_hazards / https://en.wikipedia.org/wiki/Operand_forwarding). Without that, CPUs would spend a lot of time stalling on typical code, and be very hard to schedule for: examples like [Placing NOPs in order to ensure no RAW Data hazard in MIPS assembly](https://stackoverflow.com/q/74912794) show how bad things are without bypass forwarding. — Peter Cordes, Feb 02 '23 at 17:35

Reordering commands using super-pipeline CPU

1 Answers1