why could some instructions be excuted in one clock cycle in modern cpu?

Question

many related questions says, for example, some instructions can be excuted in one clock cycle.

however, as far as I know(from the book of Computer Systems: A Programmer's Perspective), there are many steps in pipelines such as fetch, decode, execute, store etc. Each step costs at least one cycle. If so, why could any instructions be excuted in one clock cycle?

Although it typically takes a number of cycles to fetch-decode-execute-retire an instruction, the *execute* part can often be 1 cycle for simple instructions, and this is all that really matters from a throughput perspective - the rest is just latency. — Paul R, Jan 16 '21 at 08:14

Peter Cordes · Accepted Answer · 2021-01-16T15:47:05.423

The linked question makes a distinction between throughput and latency. e.g. after dec eax, how soon can another dec eax execute? It only needs the EAX value to be ready when it hits the EXEC stage of a simple in-order pipeline. Keeping the latency of the execution unit itself down to 1 cycle is what enables back-to-back exec of dependent instructions.

Total latency of the pipeline from fetch to exec only matters for mispredicted branches.

Having multiple instructions in the pipeline is the entire point of pipelining; you wouldn't call it a pipeline if you were going to require one instruction to make it all the way through the pipeline before you started fetching another one.

See also https://en.wikipedia.org/wiki/Classic_RISC_pipeline and
Modern Microprocessors A 90-Minute Guide!.

Or keep reading your CS:APP textbook.

Also related, for modern CPUs like current x86 and high-end ARM (superscalar out-of-order):

What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
How many CPU cycles are needed for each assembly instruction? - that's not how performance even works on modern CPUs, there is no fixed cycle cost you can just add up across instructions to find the total time. Front-end cost, latency, and back-end execution unit throughput, are the 3 major dimensions that might be part of the main bottleneck for a loop.
https://softwareengineering.stackexchange.com/questions/349972/how-does-a-single-thread-run-on-multiple-cores/350024#350024 - it doesn't, but my answer there explains how a single modern core contains multiple execution units to find instruction-level parallelism and run multiple instructions in parallel.

@scottxiao: Added some links to other existing Q&As about superscalar out-of-order CPUs. — Peter Cordes, Jan 16 '21 at 15:47

why could some instructions be excuted in one clock cycle in modern cpu?

1 Answers1