It's a store, of course it doesn't have single-cycle latency for the data. It is a single uop for the front-end, but uops.info unfortunately shows back-end uop count, not fused-domain, in their table.
The numbers for push
are very similar to the numbers for mov (m64, r64)
, including latency and uops, e.g. latency listed as [≤2;≤10]
for SKX.
the CYCLE of push is only 1
This doesn't even make sense. The cost model for superscalar out-of-order CPUs isn't 1 dimensional. You can't just get 1 number for each instruction and add them up to find a total cost. See @BeeOnRope's answer to How many CPU cycles are needed for each assembly instruction?
The common bottlenecks (other than memory and branch misses) are front-end throughput, back-end ports, and latency.
and there is some special process in processors that are designed for push which makes it special
The effective latency for modifying the stack pointer is zero, thanks to the stack engine.
It's so special that https://uops.info/ doesn't even try to measure RSP->RSP latency the way they measure other instructions. Stack-sync uops would complicate that.
e.g. from the SKX latency results test details, you can see that they only tested latency from register input to reloading memory, never anything for the RSP operand itself except as part of chaining a reload of [rsp]
back into a dependency chain for RSP for the next push
.
Operand 1 (r): Register (RAX, RCX, RDX, RBX, RSP, RBP, RSI, RDI, R8, R9, R10, R11, R12, R13, R14, R15)
Operand 2 (r/w, suppressed): Register (RSP)
Operand 3 (w, suppressed): Memory
Latency operand 1 → 3: ≤2
Latency operand 3 → 3 (address, base register): ≤11
I already explained how to look at what's being measured on your last question, What do multiple values or ranges means as the latency for a single instruction?.
And if you're looking at the uop counts for Intel CPUs, unfortunately https://uops.info/ shows unfused domain uop counts in the table; you have to dig in to the measurements page (e.g. throughput for SKX: https://www.uops.info/html-tp/SKX/PUSH_R64-Measurements.html) to see RETIRE_SLOTS: 1.04. In the front-end it's a single-uop micro-fused store, just like mov [rsp], rbx
which is also 2 back-end uops.
But push
measures at just over 1 when tested just a big block of push r8
instructions. The .04
is the amortized cost of stack-sync uops when the stack engine offset overflows. What is the stack engine in the Sandybridge microarchitecture? (this is the "special mechanism" you referred to.)
Read Agner Fog's microarch guide to get some background details that will help you make sense of the tables.
AMD CPUs don't call it "micro-fusion", they just always keep the store-address and store-data part of a store together as 1 uop in the front-end. That's why uops.info lists it as 1 uop for AMD, even though it's not really different from how Intel handles push
.