Why some instructions take less cycles while all go through the same pipeline stages?

Question

In a processor's instruction manual, some instructions are mentioned to take less number of cycles while others take more. The processor has a n-stage pipeline and all instructions go via the same pipeline then shouldn't they all take n-cycles as each stage take 1 cycle to complete? Is it because a few instruction starts in middle of pipeline and/or can skip a few stages each cycle?

Are you talking about a modern out-of-order CPU like Skylake or Ryzen? Or an in-order MIPS or ARM with a multiply instruction? Or high-latency loads? Instructions start in order, but after dispatching to different execution units, they can take different numbers of cycles to complete. — Peter Cordes, May 19 '18 at 05:54
Also, if you're asking about latency, then the time you list for an instruction is from when its inputs are ready to when a dependent instruction can use its outputs. That's not the full length of the pipeline, just the relevant execution unit. — Peter Cordes, May 19 '18 at 05:55
@Peter. For your first comment, let's keep out-of-order out to make it simple. Agree each instruction starts at FETCH stage, but how does it skips non-relevant stages in the pipeline? — sanjivgupta, May 19 '18 at 06:04
@Peter. For you second comment, what do the further stages of an instruction do when the result was already available for consumption earlier. — sanjivgupta, May 19 '18 at 06:06
From Intel manual : "An instruction with a throughput of 2 clocks would tie up its execution unit for that many cycles which prevents an instruction needing that execution unit from being executed." That means not all stages for an instruction finish in one cycle each. — sanjivgupta, May 19 '18 at 06:11
If you're looking at Intel's optimization manual, then you *must* consider out-of-order execution for anything to make sense. Instructions (or uops) don't simply move through the pipeline at a fixed pace, they sit in the scheduler (RS) until dispatched to an execution unit, and then the result comes out the other end of the execution unit N cycles later, where N = latency for that operation. See http://agner.org/optimize/. See also [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?](https://stackoverflow.com/q/45113527) for more about register renaming. — Peter Cordes, May 19 '18 at 06:18
Re: stages after execute: in a classic in-order RISC pipeline (https://en.wikipedia.org/wiki/Classic_RISC_pipeline), the write-back stage comes after execute: results are written back to the register file. This is separate from bypass forwarding them to a later instruction that needs them right away. Memory stores may also go beyond exec, although normally you'd have a store buffer to decouple stores from the rest of the pipeline. In an OoO CPU, you'd have retirement (when instructions that are known to be non-speculative can commit). — Peter Cordes, May 19 '18 at 06:23
Thanks Peter. I guess I lacked a basic understanding of pipeline execution. Now that things are a little clear, I guess this is how it works: Every instruction will proabably go through a fetch and decode stage. At the decode stage it will be clear that what other units (or stages) of the pipeline will be used by this instruction, and since each instruction will use different number of units (and may be some units repeatedlly as mentioned in the definition of throughput) the number of cycles required for each instruction may vary. — sanjivgupta, May 19 '18 at 15:57

Why some instructions take less cycles while all go through the same pipeline stages?

0 Answers0