AMD64 -- nopw assembly instruction?

Question

In this compiler output, I'm trying to understand how machine-code encoding of the nopw instruction works:

00000000004004d0 <main>:
  4004d0:       eb fe                   jmp    4004d0 <main>
  4004d2:       66 66 66 66 66 2e 0f    nopw   %cs:0x0(%rax,%rax,1)
  4004d9:       1f 84 00 00 00 00 00

There is some discussion about "nopw" at http://john.freml.in/amd64-nopl. Can anybody explain the meaning of 4004d2-4004e0? From looking at the opcode list, it seems that 66 .. codes are multi-byte expansions. I feel I could probably get a better answer to this here than I would unless I tried to grok the opcode list for a few hours.

That asm output is from the following (insane) code in C, which optimizes down to a simple infinite loop:

long i = 0;

main() {
    recurse();
}

recurse() {
    i++;
    recurse();
}

When compiled with gcc -O2, the compiler recognizes the infinite recursion and turns it into an infinite loop; it does this so well, in fact, that it actually loops in the main() without calling the recurse() function.

editor's note: padding functions with NOPs isn't specific to infinite loops. Here's a set of functions with a range of lengths of NOPs, on the Godbolt compiler explorer.

Maybe! I don't really know! That's the beauty of it all! WHEEE. Really, though, I get from the linked that the processor WOULD be loading a block as one instruction for speed optimization, though thanks to the `jmp`, it doesn't. I just get the meaning of it. I know what 0x90 is, but I don't know what's going on with `66 .. ..`, or why it is 72 bits long. — Jeff Ferland, Jan 25 '11 at 20:41
It's not the reason here, but you may find [My, what strange NOPs you have! - The Old New Thing](http://blogs.msdn.com/b/oldnewthing/archive/2011/01/12/10114521.aspx) an interesting read. — ephemient, Jan 25 '11 at 21:49
nopl: https://stackoverflow.com/questions/12559475/what-does-nopl-do-in-x86-system/12564044 — Ciro Santilli OurBigBook.com, Dec 02 '18 at 20:06

score 28 · Accepted Answer · edited Jul 23 '16 at 01:56

28

The 0x66 bytes are an "Operand-Size Override" prefix. Having more than one of these is equivalent to having one.

The 0x2e is a 'null prefix' in 64-bit mode (it's a CS: segment override otherwise - which is why it shows up in the assembly mnemonic).

0x0f 0x1f is a 2 byte opcode for a NOP that takes a ModRM byte

0x84 is ModRM byte which in this case codes for an addressing mode that uses 5 more bytes.

Some CPUs are slow to decode instructions with many prefixes (e.g. more than three), so a ModRM byte that specifies a SIB + disp32 is a much better way to use up an extra 5 bytes than five more prefix bytes.

AMD K8 decoders in Agner Fog's microarch pdf:

Each of the instruction decoders can handle three prefixes per clock cycle. This means that three instructions with three prefixes each can be decoded in the same clock cycle. An instruction with 4 - 6 prefixes takes an extra clock cycle to decode.

Essentially, those bytes are one long NOP instruction that will never get executed anyway. It's in there to ensure that the next function is aligned on a 16-byte boundary, because the compiler emitted a .p2align 4 directive, so the assembler padded with a NOP. gcc's default for x86 is
-falign-functions=16. For NOPs that will be executed, the optimal choice of long-NOP depends on the microarchitecture. For a microarchitecture that chokes on many prefixes, like Intel Silvermont or AMD K8, two NOPs with 3 prefixes each might have decoded faster.

The blog article the question linked to ( http://john.freml.in/amd64-nopl ) explains why the compiler uses a complicated single NOP instruction instead of a bunch of single-byte 0x90 NOP instructions.

You can find the details on the instruction encoding in AMD's tech ref documents:

http://developer.amd.com/documentation/guides/pages/default.aspx#manuals

Mainly in the "AMD64 Architecture Programmer's Manual Volume 3: General Purpose and System Instructions". I'm sure Intel's technical references for the x64 architecture will have the same information (and might even be more understandable).

edited Jul 23 '16 at 01:56

Peter Cordes

328,167
45
605
847

answered Jan 25 '11 at 21:43

Michael Burr

333,147
50
533
760

Re the ModRM byte meaning... http://ref.x86asm.net/coder64.html#x0F1F lists the ModRM byte as being used for Hintable NOPs, with references to this: 1. See U.S. Patent 5,701,442 2. sandpile.org -- IA-32 architecture -- opcode groups. I have not checked those, but in case you care. – Bahbar Jan 26 '11 at 08:21
It's a NOP, so the mod/rm byte doesn't *do* anything. It's part of the instruction as a way to allow a large range of instructions lengths in a way that the decoders can decode quickly. Decoding many prefixes is slow on some CPUs, so just repeating the `66` operand-size prefix 5 more times is a lot worse than a mod/rm that codes for an addressing mode that uses a SIB + disp32. – Peter Cordes Jul 23 '16 at 01:42
So I can understand why you'd want to push the boundary of the next function to 16 bytes, but why do you have to fill it with valid code? If a function ends at address 0x000d (that's the last byte of retq for example), you need 2 bytes of padding to get you to 0x0010. Why not just put zeroes? Nothing is going to execute after the retq. The only thing I can guess is so disassemblers can parse it, but then why not just use 0x90 for one byte nop? Why does the padding have to be an efficient use of memory if it's never going to get executed? – stu Aug 06 '18 at 21:32
1

@stu: You don't have to fill it with valid code. GCC does because it emits a `.p2align` directive, and doesn't override the fill pattern when used between instead of inside functions. So GAS expands it to long-NOPs because that's what it always does. MSVC for example pads between functions with `0xCC` bytes (`int3` instructions), so if execution ever happens to land there, it takes a debug exception. That's probably a good thing, and better than a "nop sled" of 0x90 bytes which could possibly help a ROP attack get execution into `system()` with RDI pointing at a command string... – Peter Cordes Feb 22 '22 at 08:50
1

@stu: But yes, filling with something that disassembles cleanly is good. Something that the CPU won't stall when doing parallel fetch/decode of the bytes that include the `ret` or `jmp` at the bottom of the function is also a good thing, although I'm not sure if that's a real concern or not. Decode might know not to look past an unconditional jump, even in the early pre-decode stage when it's finding instruction boundaries. But using the same logic as for aligning tops of loops means fewer special cases in compilers. – Peter Cordes Feb 22 '22 at 08:54
Seems to me a sane safe and quick thing to do would be to fill with unconditional branches to the ret/rts/whatever that is being filled after. – stu Feb 23 '22 at 18:50

score 3 · Answer 2 · answered Jan 25 '11 at 23:23

3

The assembler (not the compiler) pads code up to the next alignment boundary with the longest NOP instruction it can find that fits. This is what you're seeing.

answered Jan 25 '11 at 23:23

R.. GitHub STOP HELPING ICE

208,859
35
376
711

score -1 · Answer 3 · answered Jan 25 '11 at 20:20

-1

I would guess this is just the branch-delay instruction.

answered Jan 25 '11 at 20:20

Oliver Charlesworth

267,707
33
569
680

It is not just a branch-delay instruction. It's also used as padding. Try to disasm a C program using objdump -d . You will find a lot of this kind of instruction following ret instructions – Alex8752 May 19 '20 at 13:51

score -3 · Answer 4 · answered Jan 25 '11 at 21:28

-3

I belive that the nopw is junk - i is never read in your program, and there are thus no need to increment it.

answered Jan 25 '11 at 21:28

Arne Bergene Fossaa

650
5
13

`i` gave me a convenient way to check the stack size when it failed. Gdb, as far as my limited knowledge goes, doesn't have a "print size of stack" key. It is further interesting to watch the compiler remove the incrementing of it once the optimization level is ratcheted up. The program is intentionally "insane." – Jeff Ferland Jan 25 '11 at 21:44
My point was that the compiler optimized it away - since you never read i. – Arne Bergene Fossaa Jan 26 '11 at 00:05
The question isn't about that, though. The point of the question is why the `nop` (`nopw` here) come out that way. The standard `nop` is 0x90 and just repeated. Putting `i` in there as an unused variable was purposeful and externally useful even if it is not touched in the code. – Jeff Ferland Jan 26 '11 at 14:18

AMD64 -- nopw assembly instruction?

4 Answers4

Linked

Related