The linked question makes a distinction between throughput and latency. e.g. after dec eax
, how soon can another dec eax
execute? It only needs the EAX value to be ready when it hits the EXEC stage of a simple in-order pipeline. Keeping the latency of the execution unit itself down to 1 cycle is what enables back-to-back exec of dependent instructions.
Total latency of the pipeline from fetch to exec only matters for mispredicted branches.
Having multiple instructions in the pipeline is the entire point of pipelining; you wouldn't call it a pipeline if you were going to require one instruction to make it all the way through the pipeline before you started fetching another one.
See also https://en.wikipedia.org/wiki/Classic_RISC_pipeline and
Modern Microprocessors
A 90-Minute Guide!.
Or keep reading your CS:APP textbook.
Also related, for modern CPUs like current x86 and high-end ARM (superscalar out-of-order):