What happens to software interrupts in the pipeline?

Question

After reading this:

When an interrupt occurs, what happens to instructions in the pipeline?

There is not much information on what happens to software interrupts but we do learn the following:

Conversely, exceptions, things like page faults, mark the instruction affected. When that instruction is about to commit, at that point all later instructions after the exception are flushed, and instruction fetch is redirected.

I was wondering what would happen to software interrupts (INT 0xX) in the pipeline, firstly, when are they detected? Are they detected at the predecode stage perhaps? In the instruction queue? At the decode stage? Or do they get to the backend and immediately complete (don't enter the reservation station), retire in turn and the retirement stage picks up that it is an INT instruction (seems wasteful).

Let's say it is picked up at predecode, there must be a method of signalling the IFU to stop fetching instructions or indeed clock/power gating it, or if it's picked up in the instruction queue, a way of flushing instructions before it in the queue. There must then be a way of signalling to some sort of logic ('control unit') for instance to generate the uops for the software interrupt (indexing to IDT, checking DPL >=CPL >=segment RPL, etc, etc), naive suggestion, but if anyone knows any better about this process, great.

I also wonder how it handles it when this process is disturbed, i.e. a hardware interrupt occurs (bearing in mind traps don't clear IF in EFLAGS) and now has to begin a whole new process of interrupt handling and uop generation, how would it get back to its state of handling the software interrupt afterwards.

Do the statements from the quoted answer _"most machines throw away all instructions in the pipeline, in pipestages before the pipestage where the interrupt logic lives"_ and _"the interrupt logic typically lives in the last stage of the pipeline, WB, corresponding roughly to the commit pipestage of advanced machines."_ not answer this? — Patrick Roberts, Jan 29 '19 at 19:03
@PatrickRoberts The first one I was pretty much already aware of, the second one, thanks for finding that, I must have skimmed over it. I'm still not sure about the quandary in the final paragraph though, it doesn't address that of course. I'd like to know the specifics of the control unit that deals with this — Lewis Kelsey, Jan 29 '19 at 19:18
@PatrickRoberts but more importantly what *are* those instructions after it that it throws away, does that mean that the BPU is capable of resteering the pipeline to the address in the IDT? Otherwise what instructions would they be, what's the point in fetching them if they're just going to be thrown away, wouldn't it be more efficient to catch the software interrupt earlier in the pipeline — Lewis Kelsey, Jan 30 '19 at 01:00

Hadi Brais · Answer 1 · 2019-05-22T05:04:23.390

I agree with everything Peter said in his answer. While they can be many ways to implement the INTn instructions, the implementation would most probably be tuned for CPU design simplicity rather than performance. The earliest point at which it can be non-speculatively determined that such an instruction exists is at the end of the decode stage of the pipeline. It might be possible to predict whether the fetched bytes may contain an instruction that may or does raise an exception, but I couldn't find a single research paper that studies this idea, so it doesn't seem to be worth it.

Execution of INTn involves fetching the specified entry from the IDT, performing many checks, calculating the address of the exception handler, and then telling the fetch unit to start prefetching from there. This process depends on the operating mode of the processor (real mode, 64-bit mode, etc.). The mode is described by multiple flags from the CR0, CR4, and Eflags registers. Therefore, it would take many uops to actually invoke an exception handler. In Skylake, there are 4 simple decoders and 1 complex decoder. The simple decoders can only emit a single fused uop. The complex decoder can emit up to 4 fused uops. None of them can handle INTn, so the MSROM needs to be engaged in order to execute the software interrupt. Note that the INTn instruction itself might cause an exception. At this point, it's unknown whether INTn itself will change control to the specified exception handler (whatever its address is) or to some other exception handler. All that is known for sure is that instruction stream will definitely end at INTn and begin somewhere else.

There are two possible ways in which the microcode sequencer is activated. The first one is when decoding a macroinstruction that requires more than 4 uops, similar to rdtsc. The second is when retiring an instruction and at least of its uops has a valid event code in its ROB entry. According to this patent, there is a dedicated event code for software interrupts. So I think INTn is decoded into a single uop (or up to 4 uops) that carries with it the interrupt vector. The ROB already needs to have a field to hold information that describes whether the corresponding instruction has raised an exception and what kind of exception. The same field could be use to hold the interrupt vector. The uop simply passes through the allocation stage and may not need to be scheduled into one of the execution units because no computation needs to be done. When the uop is about to retire, the ROB determines that it is INTn and that it should raise an event (see Figure 10 from the patent). At this point, there are two possible ways to proceed:

The ROB invokes a generic microcode assist that first checks the current operating mode of the processor and then selects a specialized assist that corresponds to the current mode.
The ROB unit itself includes logic to check the current operating mode and selects the corresponding assist. It passes the assist address to the logic responsible for raising events, which in turn directs the MSROM to emit the assist routine stored at that address. This routine contains uops that fetch the IDT entry and perform the rest of the exception handler invocation process.

During the execution of assist, an exception may occur. This will be handled like any other instruction that causes an exception. The ROB unit extracts the exception description from the ROB and invokes an assist to handle it.

Invalid opcodes can be handled in a similar fashion. At the predcode stage, the only thing that matters is correctly determining the lengths of the instructions that precede the invalid opcode. After these valid instructions, the boundaries are irrelevant. When a simple decoder receives an invalid opcode, it emits a special uop whose sole purpose is just to raise an invalid opcode exception. The other decoders responsible for instructions that succeed the last valid instruction can all emit the special uop. Since instructions are retired in order, it's guaranteed that the first special uop will raise an exception. Unless of course a previous uop raised an exception or a branch misprediction or memory ordering clear event occurred.

When any of the decoders emit that special uop, the fetch and decode stages could stall until the address of the macro-instruction exception handler is determined. This could be either for the exception specified by the uop or some other exception. For every stage that processes that special uop, the stage can just stall (power down / clock gate) itself. This saves power and I think it would be easy to implement.

Or if the other logical core is active, treat it like any other reason for this logical thread to give up its front-end cycles to the other hyperthread. Allocation cycles normally alternate between hyperthreads, but when one is stalled (e.g. ROB full or front-end empty) the other thread can allocate in consecutive cycles. This might also happen in the decoders, but and maybe that could be tested with a large enough block of code to stop it running from the uop cache. (Or too dense to go into the uop cache).

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/187692/discussion-on-answer-by-hadi-brais-what-happens-to-software-interrupts-in-the-pi). — Samuel Liew, Feb 01 '19 at 00:16
@HadiBrais if you think about it, the decoders have to issue uops for push and pop. If it sees push eip and it is in real mode for instance it would have to issue `mov [sp-2], ip` and `sp-2`, `sp-2` is then removed by the stack engine and the whole thing becomes `mov[addr], ip`. Perhaps the decoders don't need an assist to check the mode and just keep track of the mode themselves (by detecting IA32_EFER.LME or PE bit insructions) and setting a private bit when they are detected and issue instructions accordingly and let the OoO core catch up and the mode doesn't officially change until retire. — Lewis Kelsey, Mar 05 '19 at 23:29
So it stalls, issues 1 uop based on the mode it detects (1 for each mode) and this causes an exception and the decoders issue the routine corresponding to the uop. It might choose to stall/not stall if it keeps track of the number of conditonal branch instructions before it — Lewis Kelsey, Mar 05 '19 at 23:42
@LewisKelsey Right. For a simple instruction like `PUSH`, the number of uops are the same in all operating modes. It's possible also that the uops to be the same as well because this may reduce the total number of uops to encode, thereby reducing the number of bit required to encode all the uops. The mode need only be checked when the size of the operands is required, which is at the time of reading/writing from/to registers/memory. But more complex instructions such as`FXSAVE` may require different number of uops in different modes and this can be implemented in different ways as dicussed — Hadi Brais, Mar 05 '19 at 23:59
@LewisKelsey We know that changing the operating mode is a fully serializing operation. So there can be no two instruction in-flight in different operating modes. — Hadi Brais, Mar 06 '19 at 00:04
@HadiBrais I forgot about the manual saying writing to control registers is serialising. It makes sense, it would be silly to optimise for the case of changing the cpu mode I suppose, seeing as it's only going to happen a trivial amount of times. So it's probably ok to stall when changing the mode rather than adding extra tracking logic like I surmised. — Lewis Kelsey, Mar 06 '19 at 00:57

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

That quote from Andy @Krazy Glew is about synchronous exceptions discovered during execution of a "normal" instruction, like mov eax, [rdi] raising #PF if it turns out that RDI is pointing to an unmapped page.¹ You expect that not to fault, so you defer doing anything until retirement, in case it was in the shadow of a branch mispredict or an earlier exception.

But yes, his answer doesn't go into detail about how the pipeline optimizes for synchronous int trap instructions that we know upon decode will always cause an exception. Trap instructions are also pretty rare in the overall instruction mix, so optimizing for them doesn't save you a lot of power; it's only worth doing the things that are easy.

As Andy says, current CPUs don't rename the privilege level and thus can't speculate into an interrupt/exception handler, so stalling fetch/decode after seeing an int or syscall is definitely sensible thing. I'm just going to write int or "trap instruction", but the same goes for syscall/sysenter/sysret/iret and other privilege-changing "branch" instructions. And the 1-byte versions of int like int3 (0xcc) and int1 (0xf1). The conditional trap-on-overflow into is interesting; for non-horrible performance in the no-trap case it's probably assumed not to trap. (And of course there are vmcall and stuff for VMX extensions, and probably SGX EENTER, and probably other stuff. But as far as stalling the pipeline is concerned, I'd guess all trap instructions are equal except for the conditional into)

I'd assume that like lfence, the CPU doesn't speculate past a trap instruction. You're right, there'd be no point in having those uops in the pipeline, because anything after an int is definitely getting flushed.

IDK if anything would fetch from the IVT (real-mode interrupt vector table) or IDT (interrupt descriptor table) to get the address of an int handler before the int instruction becomes non-speculative in the back-end. Possibly. (Some trap instructions, like syscall, use an MSR to set the handler address, so starting code fetch from there would possibly be useful, especially if it triggers an L1i miss early. This has to be weighed against the possibility of seeing int and other trap instructions on the wrong path, after a branch miss.)

Mis-speculation hitting a trap instruction is probably rare enough that it would be worth it to start loading from the IDT or prefetching the syscall entry point as soon as the front-end sees a trap instruction, if the front-end is smart enough to handle all this. But it probably isn't. Leaving the fancy stuff to microcode makes sense to limit complexity of the front end. Traps are rare-ish, even in syscall-heavy workloads. Batching work to hand off in bigger chunks across the user/kernel barrier is a good thing, because cheap syscall is very very hard post Spectre...

So at the latest, a trap would be detected in issue/rename (which already knows how to stall for (partially) serializing instructions), and no further uops would be allocated into the out-of-order back end until either the int was retired and the exception was being taken.

But detecting it in decode seems likely, and not decoding further past an instruction that definitely takes an exception. (And where we don't know where to fetch next.) The decoder stage does know how to stall, e.g. for illegal-instruction traps.

Let's say it is picked up at predecode

That's probably not practical, you don't know it's an int until full decode. Pre-decode is just instruction-length finding on Intel CPUs. I'd assume that the opcodes for int and syscall are just two of many that have the same length.

Building in HW to look deeper searching for trap instructions would cost more power than it's worth in pre-decode. (Remember, traps are very rare, and detecting them early mostly only saves power, so you can't spend more power looking for them than you save by stopping pre-decode after passing along a trap to the decoders.

You need to decode the int so its microcode can execute and get the CPU started again running the interrupt handler, but yes in theory you could have pre-decode stall in the cycle after passing it through.

It's the regular decoders where jump instructions that branch-prediction missed are identified, for example, so it makes much more sense for the main decode stage to handle traps by not going any further.

Hyperthreading

You don't just power-gate the front-end when you discover a stall. You let the other logical thread have all the cycles.

Hyperthreading makes it less valuable for the front-end to start fetching from memory pointed to by the IDT without the back-end's help. If the other thread isn't stalled, and can benefit from the extra front-end bandwidth while this thread sorts out its trap, the CPU is doing useful work.

I certainly wouldn't rule out code-fetch from the SYSCALL entry-point, because that address is in an MSR, and it's one of the few traps that is performance-relevant in some workloads.

Another thing I'm curious about is how much if any performance impact one logical core switching privilege levels has on performance of the other core. To test this, you'd construct some workload that bottlenecked on your choice of front-end issue bandwidth, a back-end port, back-end dep chain latency, or the back-end's ability to find ILP over a medium to long distance (RS size or ROB size). Or a combination or something else. Then compare cycles/iteration for that test workload running on a core to itself, sharing a core with a tight dec/jnz thread, a 4x pause / dec/jnz workload, and a syscall workload that makes ENOSYS system calls under Linux. Maybe also an int 0x80 workload to compare different traps.

Footnote 1: Exception handling, like #PF on a normal load.

(Off topic, re: innocent looking instructions that fault, not trap instructions that can be detected in the decoders as raising exceptions).

You wait until commit (retirement) because you don't want to start an expensive pipeline flush right away, only to discover that this instruction was in the shadow of a branch miss (or an earlier faulting instruction) and shouldn't have run (with that bad address) in the first place. Let the fast branch-recovery mechanism catch it.

This wait-until-retirement strategy (and a dangerous L1d cache that doesn't squash the load value to 0 for L1d hits where the TLB says it's valid but no read permission) is the key to why Meltdown and L1TF exploit works on some Intel CPUs. (http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/). Understanding Meltdown is pretty helpful to understanding synchronous exception handling strategies in high-performance CPUs: marking the instruction and only doing anything if it reaches retirement is a good cheap strategy because exceptions are very rare.

It's apparently not worth the complexity to have execution units signal back to the front-end to stop fetch / decode / issue if any uop in the back end detects a pending #PF or other exception. (Presumably because that would more tightly couple parts of the CPU that are otherwise pretty far apart.)

And because instructions from the wrong path might still be in flight during fast recovery from a branch miss, and making sure you only stop the front-end for expected faults on what we think is the current correct path of execution would require more tracking. Any uop in the back-end was at one point thought to be on the correct path, but it might not be anymore by the time it gets to the end of an execution unit.

If you weren't doing fast recovery, then maybe it would be worth having the back-end send a "something is wrong" signal to stall the front-end until the back-end either actually takes an exception, or discovers the correct path.

With SMT (hyperthreading), this could leave more front-end bandwidth for other threads when a thread detected that it was currently speculating down a (possibly correct) path that leads to a fault.

So there is maybe some merit to this idea; I wonder if any CPUs do it?

So faults are handled at retire, that clears that up, also, the instructions would be in protected pages so perhaps it cannot speculate if coming from user mode. 'issue/rename (which already knows how to stall for (partially) serializing instructions' are you talking about the FENCE instructions? Does it stall here? I thought maybe it would stall in the store/load buffers, and you know what, I'm starting to think int decodes from the MSROM into the appropriate instructions, roughly 20 or so I think, it dawned on me when you said it's impractical to pick up at predecode which... — Lewis Kelsey, Jan 30 '19 at 08:40
...made me think about it making more sense to pick up at decode where it has to check what the instructions are anyway (rather than wasting resources checking every instruction at predecode for a property) so I thought perhaps int comes from the Microsequencer — Lewis Kelsey, Jan 30 '19 at 08:44
@LewisKelsey: oh, yes, `IA32_LSTAR` (syscall entry point) will normally be pointing to an address in a page with the supervisor bit set. But (from Meltdown) we already know L1d can speculatively read actual data from lines that hit in cache from such pages... Still, intentionally doing that would be surprising, even if we're just talking about pinging the cache line to start a demand load in case it's cold in L1i, not actually using code-fetch results. It could also probe the iTLB to start a page walk in case that iTLB entry was cold, though. — Peter Cordes, Jan 30 '19 at 08:46
@LewisKelsey: No, I don't mean `mfence`/`sfence` memory barriers, I mean `lfence`, the *instruction*-serializing barrier. And `cpuid`, and other true Serializing Instructions, to use Intel's technical term. https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-273.html excerpts the relevant section of an Intel manual that defines and lists them. See [Are loads and stores the only instructions that gets reordered?](//stackoverflow.com/q/50494658) for the effect of LFENCE. — Peter Cordes, Jan 30 '19 at 08:49
@LewisKelsey: yes, `int` is microcoded as many uops. And yes, I mentioned (somewhere in this big answer) that searching for it in pre-decode isn't worth the power that takes, because stopping pre-decode slightly sooner doesn't save enough power vs. just waiting for decode to find it. And yes, decode has to decode it, stopping maybe the cycle *after* doing so. — Peter Cordes, Jan 30 '19 at 08:53
'It's the regular decoders where jump instructions that branch-prediction missed are identified' you said this on another post and I believe it was about unconditional jumps and then resteering on mispredicts, but how would it mispredict an unconditional jmp, doesn't the BPU use static prediction for that — Lewis Kelsey, Jan 30 '19 at 08:53
Decoding a microcoded instruction just means making the "uop" a pointer to MSROM. It's like a single uop but it takes a whole uop-cache line to itself. MSROM is only accessed when that uop reaches the *end* of the IDQ, where normal uops are just issued/renamed into the RS and ROB. But microcoded uops redirect the IDQ to fetch from the MSROM instead of the IDQ, so things like `rep movsb` can feed an arbitrary number of uops into the back-end. (I forget if we know this detail for sure, of it's just what BeeOnRope and I came up with as a best-guess theory.) — Peter Cordes, Jan 30 '19 at 08:55
You're right actually about meltdown I wasn't thinking about it again. Perhaps speculation does ignore page privileges in the TLB until it reaches retire — Lewis Kelsey, Jan 30 '19 at 08:57
@LewisKelsey: even direct unconditional jumps need prediction. Fetch runs far ahead of final decode, especially on x86 where pre-decode is needed. After fetching 32 bytes, you need to know which 32-byte block to fetch next, long before the instructions in that block are decoded. Thus a next-fetch-block prediction is needed, based on current block addr. My understanding is there are multiple levels of branch prediction. See for example [Slow jmp-instruction](//stackoverflow.com/q/38811901) : enough back-to-back `jmp +0` instructions will overflow the branch predictors and be much slower. — Peter Cordes, Jan 30 '19 at 08:58
Hmm. Yeah maybe the BPU uses the global history table for these predictions and the BTB for specific instruction boundaries at predecode. Perhaps it has another BTB for these 16 byte boundary addresses. — Lewis Kelsey, Jan 30 '19 at 09:53
interestingly, the BOB is one of the components of the intel microarchitecture that I haven't got my head around yet. I'm aware it takes snapshots of the RAT whenever rename encounters an unconditional branch and perhaps pairs it with an instruction address. Im guessing the advantage is is that the snapshot can immediately be used and the ROB purges all instructions before it; the frontend can immediately fetch, rather than the original waiting until retirement and using the architectural state from the Retirement RAT (PRF architecture, SNB etc) as the new image — Lewis Kelsey, Jan 30 '19 at 20:45
@LewisKelsey: I don't think it snapshots for *un*conditional direct branches; we can count on the front-end decoders having the correct target for `call/jmp rel32`. Only conditional and indirect branches might detect a mispredict after they make it to the back-end. The point of the BOB is fast-recovery: flush uops only *after* the discovered mispredict, allowing execution of uops *before* to continue. (e.g. an unrolled loop, the loop logic can often run far ahead of the loop body, so a mispredict of the last iter fall through can be sorted without stalling any "real work".) — Peter Cordes, Jan 30 '19 at 21:06
I meant conditional, wasn't thinking at all, and of course, and yes, my understanding is having the snapshot allows the ROB to flush the instructions *before* and leave the instructions that are *after* to continue executing, whereas without the BOB you have to wait until retire and use the architectural state. Anyway, fruitful discussion today it has brought my attention to many things and nudged me on the right path of thinking. — Lewis Kelsey, Jan 30 '19 at 21:21
@LewisKelsey: you're still saying it backwards: instructions that (in program order) come *after* a mispredicted branch are on the wrong path and must be flushed. But yeah, overall interesting discussion. Kinda fun writing it all down and making sure the puzzle pieces in my head actually fit together coherently. Brings back memories of my own process of piecing together my current understanding from reading Agner Fog's guides, David Kanter's write-ups, discussion around Meltdown / Spectre, and other bits and pieces of CPU-architecture lore. :) — Peter Cordes, Jan 30 '19 at 22:03
At that moment I was just talking about before and after visually in terms of sequential order in the ROB, I guess that can be a bit confusing at times. — Lewis Kelsey, Jan 30 '19 at 23:09
@LewisKelsey: I wondered if you were thinking in terms of backwards terminology like that. Nobody does that :P Yes, the ROB works like a queue where allocate adds uops to one end, and retire removes uops at the other. (Physically implemented as a circular buffer.) But I think everyone talks about before/after in program order, just so it's always clear which direction we're talking about. Or "younger" or "older" is also used by some people / documents, which makes it very clear. — Peter Cordes, Jan 30 '19 at 23:18

What happens to software interrupts in the pipeline?

2 Answers2

Hyperthreading

Footnote 1: Exception handling, like #PF on a normal load.

Linked