Store forwarding Address vs Data: What the difference between STD and STA in the Intel Optimization guide?

Question

I'm wondering if any Intel experts out there can tell me the difference between STD and STA with respect to the Intel Skylake core.

In the Intel optimization guide, there's a picture describing the "super-scalar ports" of the Intel Cores.

Here's the PDF. The picture is on page 40.

Here's a picture of the relevant graphic .

Here's another picture from page 78, this picture describes "Store Address" and "Store Data":

Prepares the store forwarding and store retirement logic with the address of the data being stored.
Prepares the store forwarding and store retirement logic with the data being stored.

Considering that Skylake can perform #1 3x per clock cycle, but can only perform #2 once per clock cycle, I was curious what the difference was between these two.

It seems "natural" to me that store-forwarding would be done to the address of the data. But I can't understand when store-forwarding on the data (aka: STD / Port 4) would ever be done. Are there any assembly / optimization experts out there that can help me understand exactly the difference between STD and STA is?

I think I would at least ask this question on the electronics stack exchange site (http://electronics.stackexchange.com/) because I think programmers are not that deep into CPU internals... — Martin Rosenau, Dec 07 '17 at 20:41
... however I **guess** that in the case of the following instructions: `add eax, [rbp]`, `mov ebx, [rsi]`, `mov [rdi], ecx` the values of `rbp` and `rsi` would be written to ports 2 and 3 and the value of `rdi` would be written to port 7. The value of `ecx` would be written to port 4. The memory logic would return the memory content at `[rbp]` and `[rsi]` in ports 0 and 1. — Martin Rosenau, Dec 07 '17 at 20:47
If I understand correctly ANY kind of memory access is done using ports 0 to 7. So any kind of memory access must be done using one of these ports - including the data to be written to the memory. ... however as I already said: I'm only guessing! — Martin Rosenau, Dec 07 '17 at 20:50
I guess STA is the step that computes SIB addresses and check if any store forwarding or retirement needs to be done. STD would be the step that actually stores the data and forwards it if needed. — fuz, Dec 07 '17 at 20:50
@fuz I think you're right. I will give this post a few more days to see if an expert can chime in. I might "self-answer" this question with essentially your comment if nothing comes about in a few days. — Dragontamer5788, Dec 07 '17 at 23:32

Peter Cordes · Accepted Answer · 2017-12-12T03:54:09.073

6

Intel CPUs have been splitting stores into store-address and store-data since the first P6-family microarchitecture, Pentium Pro.

But store-address and store-data uops can micro-fuse into one fused-domain uop. On Sandy/IvyBridge, indexed addressing modes are un-laminated as described in Intel's optimization manual. But Haswell and later can keep them micro-fused even in the ROB, so they aren't un-laminated. See Micro fusion and addressing modes. (Intel doesn't mention this, and Agner Fog hasn't had time to test extensively for Haswell/Skylake so his usually-good microarch PDF doesn't even mention un-lamination at all. But you should still definitely read it to learn more about how uops work and how instructions are decoded and go through the pipeline. See also other x86 performance links in the x86 tag wiki)

Considering that Skylake can perform #1 3x per clock cycle, but can only perform #2 once per clock cycle

Ports 2 and 3 can also run load uops on their AGUs, leaving the load-data part of the port unused that cycle. Port7 only has a dedicated store-AGU for simple addressing modes.

Store addressing modes with an index register can't use port 7, only p2/p3. But if you do use "simple" addressing modes for stores, the peak throughput is 2 loads + 1 store per clock.

On Nehalem and earlier (P6 family), p2 was the only load port, p3 was the store-address port, and p4 was store-data.

On IvyBridge/Sandybridge, there weren't separate ports for store-address uops, they always just ran on the AGU (Address Generation Unit) in the load ports (p23). With 256b loads / stores, the AGU was only needed every other cycle (256b load or store uops occupy the load or store-data ports for 2 cycles, but the load ports can accept a store-address uop during that 2nd cycle). So 2 load / 1 store per clock was in theory sustainable on Sandybridge, but only if most of it was with AVX 256-bit vector loads / stores running as two 128-bit halves.

Haswell added the dedicated store-AGU on port7 and widened the load/store execution units to 256b, because there aren't spare cycles when the load ports don't need their AGUs if there's a steady supply of loads.

A store-address uop writes the address (and width, I guess) into the store buffer (aka Memory Order Buffer in Intel's terminology). Having this happen separately, and possibly before the data to be stored is even ready lets later loads (in program order) detect whether they overlap the store or not.

Out-of-order execution of loads when there are pending stores with unknown address is problematic: a wrong guess means having to roll back the pipeline. (I think the machine_clears.memory_ordering perf counter event includes this. It is possible to get non-zero counts for this from single-threaded code, but I forget if I had definite evidence that Skylake sometimes speculatively guesses that loads don't overlap unknown-address stores).

As David Kanter points out in his Haswell microarch writeup, a load uop also needs to probe the store buffer to check for forwarding / conflicts, so an execution unit that only runs store-address uops is cheaper to build.

Anyway, I'm not sure what the performance implications would be if Intel redesigned things so port7 had a full AGU that could handle indexed addressing modes, too, and made store-address uops only run on p7, not p2/p3.

That would stop store-address uops from "stealing" p23, which does happen and which reduces max sustained L1D bandwidth from 96 bytes / cycle (2 load + 1 store of 32-byte YMM vectors) down to ~81 bytes / cycle for Skylake according to a table in Intel's optimization manual. But under the right circumstances, Skylake can sustain 2 loads + 1 store per clock of 4-byte operands, so maybe that 81-byte / cycle number is limited by some other microarchitectural limit. The peak is 96B/clock, but apparently that can't happen back-to-back indefinitely.

One downside to stopping store-address uops from running on p23 is that it would take longer for store addresses to be known, maybe delaying loads more.

I can't understand when store-forwarding on the data (aka: STD / Port 4) would ever be done.

A store/reload can have the load take the data from the store buffer, instead of waiting for it to commit to L1D and reading it from there.

Store/reload can happen when a function spills some registers before calling a function, of as part of passing args on the stack (especially with crappy stack-args calling conventions that pass all args on the stack). Or passing something by reference to a non-inline function. Or in a histogram, if the same bin is hit repeatedly, you're basically doing a memory-destination increment in a loop.

edited Dec 12 '17 at 03:54

answered Dec 11 '17 at 21:46

Peter Cordes

328,167
45
605
847

Related: [Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?](https://stackoverflow.com/q/54084992) : stores and loads to the same cache line can conflict on CPUs before Skylake, leading to the scheduler thinking loads will be ready sooner than they are and having to replay the load uops from the RS. – Peter Cordes Jan 21 '19 at 19:03
There is one thing that confuses me: AGU **only** generates the memory access address without doing absolutely anything else? Or also works as a load / store address functional unit? In your answer I have understood that also works as a Load/store address functional unit but in the Haswell individual core diagram we can see that we have an AGU which is not connected to MOB: https://en.wikichip.org/wiki/File:haswell_block_diagram.svg – isma Jul 21 '20 at 19:56
And I want to add to my previous comment: What's the difference between the LEA execution unit and the AGU execution unit? Don't do both of them the same? I've thought that the only difference is that AGU unit uses the result as the address for a memory load/store operation in the next pipeline stage, instead of storing the result in a register, and the LEA unit stores the result in a register and nothing more. Am I correct? And looking modern intel microarchitectures I have not seen the "store address" and "load address" anymore. Have they disappeared? – isma Jul 21 '20 at 20:34
1

@isma: That looks like a bug or simplification in the diagram, with it showing only *data* connections, not address. Address-generation (doing `segment_base + base + index*scale + displacement`) is *part of* running a store-address uop. It's the part that the AGU does: just addressing-mode calculations. The other parts are translating virtual to physical and permission checking, and writing both virt and phys addresses into the store-buffer entry for that store. – Peter Cordes Jul 22 '20 at 01:27
1

@isma: LEA execution units do the same base + idx * scale + displacement calculations as AGUs (but without the segment_base part). Their outputs are connected back into data results part of the CPU, i.e. the bypass forwarding network and write-back to the physical register file, not to memory addressing hardware. They're part of the execution units on ALU execution ports, so LEA competes for ALU time, not against actual load/store. (Some in-order CPUs like Atom (not Silvermont) run LEA on the same actual AGUs that memory ops use, but OoO exec CPUs separate mem from ALU for scheduling.) – Peter Cordes Jul 22 '20 at 01:32
Thank you, I think I understood everything except a few little things: 1- What is that "segment_base" in the AGU? I mean, I thought there were only existint something like "addl %eax, offset(base, index, scale)". 2- You talked about AGU being a part of store-address uop. And what about load-address uop? They don't need AGU? 3- When you mentioned "the other parts are translating virt to phys and permission checking" are executed in the same AGU's cycle? Or are executed in another port in another pipeline stage before going to the load/store data port? 4- (continue in the following comment) – isma Jul 22 '20 at 07:45
1

@isma: Yes, of course address generation is part of a load uop, too. (Note that on Intel, a single load uop does both address and data, with the address as an input and the data as an output. Unlike stores where store-address and store-data are separate *inputs* to a store instruction). – Peter Cordes Jul 22 '20 at 07:47
1

@isma: No, I said the other parts of a store-address execution unit include checking the (linear AGU-result) address for permissions, and translating to phys. This should be obvious if you think about what a store-address operation has to do: write the physical address into the store buffer (and also virtual for fast-path store forwarding probes), and check that the address was valid. This is explained in my answer. Also, we know that a store instruction decodes to a micro-fused pair of store-address and store-data uops, not 3 or more, so you could have ruled out that guess by testing. – Peter Cordes Jul 22 '20 at 07:52
(read the previous comment first, please) 4- The AGU result is not copied back in any register (neither those of x86 nor those of that microarchitecture), right? And, when the store-address is executed, the following step will be go to load/store data port? Or as the uop to generate the address and to write the data are different uops, is it possible to do the store-address long before the store-data (and the same for load)? – isma Jul 22 '20 at 07:53
1

@isma: Every x86 addressing mode implies a segment if no segment override prefix was used. Normally DS, but SS if the base register is R/ESP or R/EBP. The segment base is fixed at `0` in 64-bit mode (except for FS and GS), but not in 32 and 16-bit mode. So `16(%esp)` is really `ss:16(%esp)`. Translating from seg:off to linear by adding the segment base to the offset, as well as calculating the "offset" value, is a necessary part of address generation. (But in practice CPUs are optimized for the segbase=0 case which is common even in 32-bit mode, and take an extra cycle otherwise). – Peter Cordes Jul 22 '20 at 07:55
1

@isma: right, load/store units in modern uarches like Sandybridge-family and Zen have dedicated AGUs in their load/store ports that don't connect back the registers. It's just a couple 64-bit adders, and a variable count left shifter, this is very cheap to replicate and put close to where it's needed, instead of sending data far across the chip to that hardware in the normal ALU. (Or vice versa if you want LEA to use an AGU). In-order atom runs LEA on an AGU in an earlier pipeline stage. – Peter Cordes Jul 22 '20 at 08:00
1

Re: multiple uops: yes, store addr or data can execute independently in either order (or the same clock cycle; the store buffer entry was reserved at issue/rename time. I explained that in [Size of store buffers on Intel hardware? What exactly is a store buffer?](https://stackoverflow.com/q/54876208), go read it. See also discussion on [Are two store buffer entries needed for split line/page stores on recent Intel?](https://stackoverflow.com/q/61180829) – Peter Cordes Jul 22 '20 at 08:01
Thank you! It's true, the load/store-addr is only 1 uop, which means that is executed only in one execution unit of one execution unit (in the case of the haswell digram (link in my first comment) this is in ports 2,3 or 7). And I already read that answer! But there is still only 1 thing that I didn't understood in that answer: Does a store-address take away bandwidth from a store-data execution? I mean, is it blocking any write port or something? And about your last comment: store data can be really executed at the same time or even before that the store addr of that block? – isma Jul 22 '20 at 08:13
1

@isma: store-data has a dedicated execution port, so no, it's 100% independent of store-address execution. For the store instruction to retire (and the store-buffer entry to "graduate" = become eligible to commit), both uops have to be individually complete, but there's no other ordering requirement before retirement. Store-forwarding may result in a load uop *waiting* for a store-data uop to execute (or retrying until it does), if the store-address uop executed earlier so the load can tell for sure that this load is reloading an in-flight store. – Peter Cordes Jul 22 '20 at 08:19
Nice! Thank you so much! rereading your comments (I always do it to be sure that I have not missed any detail) I remembered a question that I forgot to ask you, perhaps the fastest to answer of all: You said: "Note that on Intel, a single load uop does both address and data", does it mean that an intel load is NOT decoded to a micro-fused pair of addr and data? Is it only decoded in 1 uop that will do both things (addr and data) "at the same time"? – isma Jul 22 '20 at 08:28
1

@isma: Yes, and I already explained that and the reason for it in an earlier comment! This is something you could have tested yourself, and found in lots of other places. That's why loads can (sometimes) micro-fuse with ALU uops. [Micro fusion and addressing modes](https://stackoverflow.com/q/26046634). Note that it's not "at the same time", though: for a load, address goes in, data comes out (a few cycles later). It's from the same uop, but it has non-zero latency. As opposed to stores, where 2 different things are inputs to the store instruction, and get written to the MOB. – Peter Cordes Jul 22 '20 at 08:35

score 2 · Answer 2 · answered Dec 11 '17 at 18:55

Its been a few days without a response, so here's my best guess at "answering my own question".

The raw x86 instruction set isn't executed directly by modern processors. Instead, the x86 instruction set is "compiled" down into Micro-ops (uOps) before being executed by the Intel core. This shouldn't be too surprising, because some x86 instructions can be complex. An example taken from the optimization guide is as follows:

Similarly, the following store instruction has three register sources and is broken into "generate store address" and "generate store data" sub-components.
MOV [ESP+ECX*4+12345678], AL

This is currently found on page 50 of the optimization manual (2.3.2.4 Micro-op Queue and the Loop Stream Detector (LSD)).

In this case, the address of the store operation is complex, so it is its own uOp. So at very least, this singular x86 instruction gets converted into two uOps internally. The names of these two uOps are "Store Address" and "Store Data". The manual doesn't describe the internal uOps at all, so it may take even more than two uOps to accomplish.

Since there's only one "store data" port on Skylake systems, that means that Skylake can only modify at most one memory location per cycle. The three "Store Address" ports means that Skylake can calculate the effective address of many instructions simultaneously (possibly because some very complicated addresses may take more than one uOp to execute??).

I must have missed this question when you originally posted it :( p2/p3 also run load uops. A single store only ever needs one store-address uop. (Except for AVX512 scatters on SKX, of course.) — Peter Cordes, Dec 11 '17 at 21:47

Store forwarding Address vs Data: What the difference between STD and STA in the Intel Optimization guide?

2 Answers2

Linked