push latency listed on uops.info is higher than I expected

Question

I heard that the CYCLE of push is only 1 and there is some special process in processors that are designed for push which makes it special and instead of doing this:

sub rsp, 24

mov [rsp], rbx
mov [rsp+8], rcx
mov [rsp+16], rdx

it's better to use push like this:

push rbx
push rcx
push rdx

but today, i saw the latency for PUSH (R64) is [≤2;≤11] with uops 2 !!! what the ... !!! so push is not only 1 latency !!!!!!!!! and it's a heavy instruction ! it's amazing only in AMD Zen2 CPU which is ≤0 with uops 1 and it's bad for intel CPUs !!! Am i right ?

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

It's a store, of course it doesn't have single-cycle latency for the data. It is a single uop for the front-end, but uops.info unfortunately shows back-end uop count, not fused-domain, in their table.

The numbers for push are very similar to the numbers for mov (m64, r64), including latency and uops, e.g. latency listed as [≤2;≤10] for SKX.

the CYCLE of push is only 1

This doesn't even make sense. The cost model for superscalar out-of-order CPUs isn't 1 dimensional. You can't just get 1 number for each instruction and add them up to find a total cost. See @BeeOnRope's answer to How many CPU cycles are needed for each assembly instruction?

The common bottlenecks (other than memory and branch misses) are front-end throughput, back-end ports, and latency.

and there is some special process in processors that are designed for push which makes it special

The effective latency for modifying the stack pointer is zero, thanks to the stack engine.

It's so special that https://uops.info/ doesn't even try to measure RSP->RSP latency the way they measure other instructions. Stack-sync uops would complicate that.

e.g. from the SKX latency results test details, you can see that they only tested latency from register input to reloading memory, never anything for the RSP operand itself except as part of chaining a reload of [rsp] back into a dependency chain for RSP for the next push.

Operand 1 (r): Register (RAX, RCX, RDX, RBX, RSP, RBP, RSI, RDI, R8, R9, R10, R11, R12, R13, R14, R15)

Operand 2 (r/w, suppressed): Register (RSP)

Operand 3 (w, suppressed): Memory

Latency operand 1 → 3: ≤2

Latency operand 3 → 3 (address, base register): ≤11

I already explained how to look at what's being measured on your last question, What do multiple values or ranges means as the latency for a single instruction?.

And if you're looking at the uop counts for Intel CPUs, unfortunately https://uops.info/ shows unfused domain uop counts in the table; you have to dig in to the measurements page (e.g. throughput for SKX: https://www.uops.info/html-tp/SKX/PUSH_R64-Measurements.html) to see RETIRE_SLOTS: 1.04. In the front-end it's a single-uop micro-fused store, just like mov [rsp], rbx which is also 2 back-end uops.

But push measures at just over 1 when tested just a big block of push r8 instructions. The .04 is the amortized cost of stack-sync uops when the stack engine offset overflows. What is the stack engine in the Sandybridge microarchitecture? (this is the "special mechanism" you referred to.)

Read Agner Fog's microarch guide to get some background details that will help you make sense of the tables.

AMD CPUs don't call it "micro-fusion", they just always keep the store-address and store-data part of a store together as 1 uop in the front-end. That's why uops.info lists it as 1 uop for AMD, even though it's not really different from how Intel handles push.

push latency listed on uops.info is higher than I expected

1 Answers1