10

I was reading the Intel instruction manual and noticed there is a 'NOP' instruction that does nothing on the main CPU, and a 'FNOP' instruction that does nothing on the FPU. Why are there two separate instructions to do nothing?

The only thing different I saw was they throw different exceptions, so you might watch for an exception from FNOP to detect whether there's an FPU available. But aren't there other mechanisms like CPUID to detect this? What practical reason is there to have two separate NOP instructions?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Michael Burge
  • 425
  • 9
  • 16
  • 8
    Originally, the 8086 and 8087 were separate chips. The 8086 did integer arithmetic, and `nop` told the 8086 to do nothing. The 8087 did floating point arithmetic, and `fnop` told the 8087 to do nothing. – Raymond Chen Jul 29 '14 at 05:34
  • I do think it is the #MF exception. It only gets raised when explicitly writing a waiting FPU instruction, usually FWAIT. When you start nopping out code with 0x90 then that exception gets raised at a wildly different place in the code. – Hans Passant Jul 29 '14 at 08:10

1 Answers1

22

Expanding on Raymond Chen and Hans Passant's comments, there are historical reasons for there being two separate instructions and why they don't quite have the same effect.

Neither of the two instructions, NOP and FNOP, were originally designed as an explicit no-operation instruction. The NOP instruction is actually just an alias for the instruction XCHG AX,AX. (Or in 32-bit mode XCHG EAX, EAX.) On early Intel processors it didn't actually do nothing. While it had no externally visible effect, internally it was executed just like an XCHG instruction, taking as many cycles to execute. The '486 was the first Intel CPU to treat it specially, it could execute a NOP in 1 cycle, while it took 3 cycles to execute any other register-to-register XCHG instruction.

Treating XCHG AX,AX instruction specially becomes very important in modern Intel processors. If it were still actually exchanging the same register with itself, it could introduce pipeline stalls if a nearby instruction also used the AX register. By treating it specially the CPU doesn't end up thinking the NOP needs to wait for a previous instruction that sets AX or that a following instruction needs to wait for the NOP.

This brings up the fact that there are lots of different instructions that do nothing, though XCHG AX,AX is the only one that's a single byte (as a special case of the the exchange-register-with-accumulator single byte XCHG encodings). Often these instructions are used as single instruction substitute for consecutive NOP instructions, like when aligning the start of loop for performance reasons. For example if you wanted a 6 byte NOP you could use LEA EAX,[EAX + 00000000]. Intel eventually added an explicit multiple byte NOP instruction. (Well, not so much added as officially documented an instruction that had been there since the Pentium Pro.) However only the single byte form is treated specially; the multiple byte NOPs will generate stalls if nearby instructions use the same registers.

When AMD added 64-bit support to their CPUs they went even further. NOP is no longer the equivalent of XCHG EAX,EAX in 64-bit mode. One of the problems with the Intel instruction set is that there are a lot of instructions that modify only part of register. For example MOV BX,AX only modifies the lower 16-bits of EBX leaving the upper 16-bits unmodified. These partial modifications make it hard for the CPU avoid stalls, so AMD decide to prevent that when using 32-bit instructions in 64-bit mode. Whenever the result of a 32-bit operation is stored in a (64-bit) register, the value is zero extended to 64-bits so that entire register is modified. This means XCHG EAX,EAX is no longer a NOP, as it clears the upper 32-bits of EAX (and thus if you explicitly write XCHG EAX,EAX, it can't assemble to 0x90 and has to use the 87 C0 encoding). In 64-bit mode NOP is now an explicit NOP with no other interpretation.


As for the FNOP instruction, on the original 8087 it's not entirely clear how the FPU treated this instruction, but I'm pretty sure it wasn't handled as an explicit no-operation either. At least one old Intel manual, the ASM86 Language Rerefence Manual does document as doing something with no effect ("stores the stack top to the stack top"). From its position in the opcode map it looks like it might an alias for either FST ST or FLD ST, both of which would copy the top of the stack to the top of the stack. However it did get some special treatment, it executed in an average of 13 cycles instead of the average 18 or 20 cycles for a stack to stack FST or FLD instruction respectively. If it were being treated as no-operation instruction I'd expect it be even faster, as there are a number of 8087 instructions that can execute in half the time.

More importantly the FNOP instruction behaves differently than NOP because of how FPU instructions used to be implemented on Intel processors. The CPU itself didn't support floating-point arithmetic, instead these duties were offloaded onto an optional floating-point coprocessor, originally the 8087. One of the nice things about the coprocessor was that it executed instructions in parallel with the CPU. However this means that the CPU sometimes needs to wait for the FPU to finish an operation. The CPU automatically waits for it to finish executing the previous instruction before giving it another instruction, but a program would need to explicitly wait (using a WAIT instruction) before it could read a result that the coprocessor wrote to memory.

Because the coprocessor worked in parallel this also meant that if an FPU instruction generated a floating-point exception, by the time it detected this the CPU would already have moved on to execute the next instruction. Normally when an instruction generates an exception on the CPU, it's handled while that instruction is still being executed, but when an FPU instruction generates an exception the CPU has already completed executing that instruction by handing it off to the FPU. Instead of interrupting the CPU and delivering the floating-point exception asynchronously, the CPU is only notified when it waits for the coprocessor, either explicitly or implicitly.

In modern processors the FPU is no longer a coprocessor, it's an integral part of the CPU. This means programs no longer have to wait for the FPU to write values to memory. However how FPU exceptions are handled hasn't changed. (It turns out that delivering exceptions immediately is difficult to implement on modern CPUs so they took advantage of the one case where they didn't have to.) So if a previous FPU instruction generated an undelivered floating-point exception, a NOP leave the exception undelivered, while FNOP, because it's an FPU instruction, will do an implicit "wait" that results in the floating point exception being delivered.

This example demonstrates the difference:

FLD1       ; push 1.0 onto the FPU stack
FLDZ       ; push 0.0
FDIV       ; divide 1.0 by 0.0
NOP        ; does nothing
NOP        ; does nothing
FNOP       ; signals a FP zero-divide exception and then does nothing
Community
  • 1
  • 1
Ross Ridge
  • 38,414
  • 7
  • 81
  • 112
  • 1
    And I thought I knew everything about NOP :) Nicely written. – Sedat Kapanoglu Jul 31 '14 at 09:13
  • Yah, I only stumbled across the 64-bit NOP part after searching the web in order to verify some other details. – Ross Ridge Jul 31 '14 at 15:00
  • Thanks for your response - this is a great answer to my question. – Michael Burge Aug 03 '14 at 05:37
  • So [FNOP](http://www.felixcloutier.com/x86/FNOP.html) and [FWAIT](http://www.felixcloutier.com/x86/WAIT:FWAIT.html) both wait for x87 exceptions? The manual doesn't mention that effect of FNOP, and implies that's *all* FWAIT does. Is that really correct, that there's no difference between FNOP and FWAIT? If not, then we're back to the question of why FNOP exits. – Peter Cordes Oct 10 '16 at 09:05
  • @PeterCordes Well, the question was what was the difference between NOP and FNOP. There isn't much difference between FNOP and FWAIT, aside from the fact that FNOP tells the FPU to do nothing while FWAIT doesn't tell the FPU to do anything. Complicating matters was the fact that with the original 8087 the FWAITs were never actually implicit, they were encoded as the first byte of instruction by the assembler. So FNOP is documented as being encoded `98 D9 D0` in the old manuals instead as `D9 D0` as it is now. (Hmm.. it also appears that FNOP is actually `FST ST`). – Ross Ridge Oct 10 '16 at 16:19
  • Ah, right. Despite the F name, FWAIT is purely a CPU instruction. IDK why Intel wanted to spend an opcode on sending a NOP to the 8087, but maybe they needed it for testing / verification purposes? (And didn't have the transistor budget for a separate testing mode?) – Peter Cordes Oct 10 '16 at 16:24
  • 1
    @PeterCordes After looking at the old manuals I found out I was wrong and FNOP is actually just the FST instruction where the source operand is ST, so it sets ST to ST. So it doesn't waste any encoding space, though it does seem to be specially treated as it executes in a few less cycles (13 vs. 18). – Ross Ridge Oct 10 '16 at 16:31
  • According to Intel's current manuals, `fst st0, st0` is `DD D0`. The opcode map in *A.5.2.2 Escape Opcodes with D9 as First Byte* in vol2 shows FNOP (`D9 D0`) as the only entry on that row, and `D9 D1..7` (and `D8..F`) unused. Did old CPUs use that encoding for `FST ST(i), ST(0)`? (It would make sense, it's right next to `FLD ST(0), ST(i)` at `D9 C0..7` and FXCHG `D9 C8..F`) Anyway, if so that neat, since FNOP is another special case of a register-in-opcode encoding like XCHG AX,AX. And there are other x87 NOPs, like `FXCHG ST(0), ST(0)`. – Peter Cordes Oct 10 '16 at 17:11
  • 1
    @PeterCordes Oops... it looks like I didn't look close enough. It just looked at the `D0 + i` part of the encoding and didn't notice the first byte (after the FWAIT) was different between FST and FNOP. Hmm... I'm guessing that that its an alias for either FST or FLD instructions as the manual (ASM86 Language Reference Manual) says that "This operation stores the stack top to the stack top and thus effectively performs no operation". Hmm... now I'm not sure whether FNOP was actually an explicit NOP or not. – Ross Ridge Oct 10 '16 at 17:28