The performance of the pipeline depends on instruction latency. Which microprocessor has the best performance in the pipeline?(Intel) How will Pipelining technology evolve in the future?
1 Answers
Most CPUs have 1-cycle latency for common instructions like add
. Intel CPUs have 1-cycle latency for integer SIMD instructions like paddd xmm,xmm
. Latency for FP math is higher, like 4 cycles for mul/add/FMA on Skylake.
Integer multiply and other things (e.g. popcnt, lzcnt) are 3-cycle latency on Intel, and run on the execution unit on port 1. (The only GP-integer ALU that can run multi-cycle latency uops.) Sandybridge-family standardized latencies to simplify the scheduler (easier to avoid write-back conflicts) so there are no 2-cycle latency uops. (Some 2-uop instructions have 2c latency, 1 per uop.)
Intel since IvyBridge can run mov reg,reg
and movdqa xmm,xmm
instructions with 0 cycle latency, handling them in the register-rename stage without needing a back-end uop. See Can x86's MOV really be "free"? Why can't I reproduce this at all? AMD since Bulldozer can do the same thing for XMM regs, and AMD since Zen can do that for integer registers, too.
See also https://agner.org/optimize/ for instruction tables (latency, front-end uop cost, and back-end ports), and a microarch guide to understand what those numbers mean.
Other than zero latency, Intel Pentium 4 (before Prescott) with its double-pumped ALUs is the only x86 CPU to ever have instructions with less than 1 cycle latency. It can do two dependent add
instructions in the same clock cycle; the ALU latency is 0.5 cycles. I don't know if any non-x86 microarchitectures have ever done that; I know some have used narrower ALUs but usually those were not high-performance.
64-bit P4 (Prescott / Nocona) dropped that; the ALUs are still double-pumped for throughput, but can't do 2 dependent adds in the same cycle. (Was there a P4 model with double-pumped 64-bit operations?). Agner Fog shows add
latency as 1 cycle.
Unfortunately the rest of P4 is full of bottlenecks and performance pitfalls / "glass jaw" effects so real-world performance is much lower than modern CPUs like Sandybridge-family or Zen. IDK how much benefit half-cycle integer ALU latency would make in a modern CPU. (It would probably be problematic to implement for 64-bit integers; even P4 Nocona didn't do it. But it would be interesting to consider.)
Often out-of-order execution can hide latency by overlapping independent work. Compilers that make code which tries to keep critical paths short can help.
Which microprocessor has the best performance in the pipeline?
That's a very different question and much broader. Consult benchmarks like SPECint and SPECfp for performance on real-world workloads. (Although that includes memory).
How will Pipelining technology evolve in the future?
Wider and more heavily out-of-order to extract ILP over a larger window.
It's unlikely that pipelines will get a lot longer; P4 went down that road to the point where branch mispredict cost was way too high.
But it's also unlikely that multi-cycle latency instructions will get much lower latency. A multiply is more complex than and add, and floating-point is complex. Making those ALUs lower latency in clock cycles would limit clock speed because one of the stages in those ALUs would be on the critical path for propagation delay.
(The longest single pipeline stage anywhere in the CPU, measured in gate-delays or nanoseconds, sets your max clock speed => minimum cycle time.)
Some software can take advantage of thread-level parallelism. CPUs can and do exploit that with hyperthreading (SMT) to keep execution units fed with work when running multiple latency-bound threads on the same physical core. Notably Xeon Phi (KNL) has higher latencies for vector instructions than Skylake, and depends on 4-wide SMT for good performance in code that doesn't have enough ILP in a single instruction stream.

- 328,167
- 45
- 605
- 847