Actual 8086 isn't pipelined at all (except for prefetch); it's microcoded. It finishes write-back of one instruction before starting decode of the next; the only hazard effect is discarding the prefetch buffer after branches.
x86 instructions can be hard to pipeline (especially memory-destination); it wasn't until 486 / Pentium that it was really done, and then complex instructions would stall the in-order pipeline (basically a hazard within one instruction, like add [edx], eax
or pop eax
). It wasn't until Pentium Pro (P6 microarchitecture) that even instructions like that could be handled efficiently (by decoding to 1 or more uops and handling those via out-of-order exec). See Agner Fog's microarch guide https://agner.org/optimize/
(Real P6-family and other out-of-order exec x86 microarchitectures hide WAW and WAR hazards by register renaming. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) and Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs.)
The code you've shown isn't strictly x86; there's no LOAD
mnemonic; x86's pure load instruction is called mov
. Also, LOAD DL, BL
makes no sense; neither operand can address memory; they're only 8-bit registers. If you meant copy between registers, that's also mov dl, bl
.
I would like to see an example of both hazards (control hazard + data hazard) in the same code
A simple example would be an indirect branch (control hazard) whose target was recently written (true RAW data dependency).
e.g. if we assume 16-bit mode (since you mentioned 8086):
push offset target ; modifies SP (the stack pointer), then stores to memory at SS:SP
ret ; ordinary near return = pop ip
target:
push 123
ret
has 2 inputs:
- the SP register (just written by pop: RAW hazard)
- the memory pointed to by SP (also just written by pop, also a RAW memory hazard).
RET writes SP (WAR hazard, although RET itself was the last reader). Also WAW if we consider that push and ret both write SP.
RET does an indirect jump (basically pop ip
) using the address loaded from memory (control hazard for the pipeline, if any). All current CPUs will mispredict that ret
because they have a special call/ret predictor stack that assumes ret
will jump to the return-address of a matching call
, like normal code uses. (http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/)
The push 123
at the ret target address
- reads and writes SP (RAW and WAR hazards)
- writes to memory that the previous push wrote (WAW memory hazard), and which RET just read (WAR memory hazard).
I put a push
after the ret
in case you want to look at just the ret/push pair, with the push in the "shadow" of a possibly mispredicted branch.
Of course a store buffer with store forwarding hides / handles the memory data hazards, effectively renaming the memory / cache location. (x86's memory ordering model is basically program-order + a store buffer with store-forwarding: cores are allowed to reload their own stores before they become globally visible.)
Modern x86 CPUs handle the RAW data dependency chain through the stack pointer with a "stack engine" that can keep track of multiple offsets to the stack pointer per clock cycle. (And equally importantly, removes the need for an extra uop to actually do the addition to E/RSP in the back-end, so push
/ pop
can be single uop.) So it's effectively an alternate zero-latency mechanism for execution the stack-pointer modification parts of stack instructions. Using E/RSP directly (e.g. mov bp, sp
) lead to a stack-sync uop (on Intel CPUs) that zero the saved offset and apply it to the back-end's value. If the offset was non-zero.