How does the JMP instruction affect the 8086's working?

Question

I get that the 8086 has a BIU and an EU and that helps in pipelining the processor. The BIU has a 6-Byte prefetch queue which fetches the bytes that follow the address that the Instruction Pointer points to. Now, when the instruction to be executed turns out to be a Jump instruction to another location, what happens to all the 6-bytes which were prefetched? Do they get flushed and then reloaded? (that destroys the pipelining efficiency of the processor doesn't it?)

Peter Cordes · Answer 1 · 2019-07-19T16:12:37.057

Yes, jumps discard the instruction prefetch queue in 8086 and later microarchitectures that work similarly. Instruction fetch after any control transfer starts with an empty buffer.

For JIT / self-modifying code, this means any jump is sufficient to avoid stale instruction fetch.

Do they get flushed and then reloaded? That destroys the pipelining efficiency of the processor doesn't it?

The instructions that were in the buffer are from the wrong path unless it was a jmp +0 nop. So they're not re-loaded; they were useless and the correct path has to be loaded.

That's not great and is an extra cost for jumps. That's why later in-order CPUs like Pentium have branch prediction, so they can be fetching from the correct path before a jump even decodes. (Branch prediction needs to predict the existence of branches, e.g. given a fetch-block address, predict what block to fetch next. As well as predicting which way conditional branches will go.)

8086 was hardly an efficient pipeline the way that a 5-stage RISC is.

Instruction-fetch was typically the main bottleneck on 8086 anyway, so probably the buffer wasn't usually full on most jumps. You're only losing at most 6 bytes (3 word fetches) of wasted prefetch work, and probably less. (That's why optimizing for speed on 8086 is pretty much optimizing for code-size, except for avoiding a few slow instructions like multiply. That's also why x86's compact variable-length instruction was a good design for 8086.)

I don't know how long it took a jump to decode/execute, but a jump is 2 or 3 bytes long (in x86-16), or even 4+ for indirect jumps with opcode+modrm+disp16 + optional prefixes. The jump instruction that just executed probably left the prefetch buffer close to empty on 8086.

Formally on paper, the x86 ISA at least used to require a serializing instruction like iret or cpuid to avoid any risk of a stale instruction. But to avoid breaking existing code, no real x86 CPU has ever required more than a jump¹

Modern OoO x86 CPUs with branch prediction + speculative execution like P6-family don't require anything; they aggressively snoop the pipeline to detect stores that overlap in-flight instructions. Observing stale instruction fetching on x86 with self-modifying code

Footnote 1: Going beyond the paper specs to maintain compat with widely-used software is a common occurrence in x86 over the years; true backwards compat and binary compatibility was basically the main selling point of x86 vs. cleaner RISC ISAs until x86 became so dominant that other ISAs gave up on targeting the high-power / high-performance market. (Thanks to growing transistor budgets and clever design ideas making it possible to pay "the x86 tax" and still run fast.

There would be huge market resistance to a new faster CPU that couldn't run existing versions of DOS, Windows, Lotus Notes, or whatever.

Just curious, is there a reference for a serializing instruction being required? The modern documentation says either a jump or a `cpuid` suffices. — Nate Eldredge, Dec 03 '22 at 05:35
@NateEldredge: I haven't checked recently. I think I remember reading that at some point a serializing instruction was required on paper, even though real CPUs had always been fine with a jump. But I don't remember who said that. — Peter Cordes, Dec 03 '22 at 07:31

How does the JMP instruction affect the 8086's working?

1 Answers1