What does "nop dword ptr [rax+rax]" x64 assembly instruction do?

Question

I'm trying to understand the x64 assembly optimization that is done by the compiler.

I compiled a small C++ project as Release build with Visual Studio 2008 SP1 IDE on Windows 8.1.

And one of the lines contained the following assembly code:

B8 31 00 00 00   mov         eax,31h
0F 1F 44 00 00   nop         dword ptr [rax+rax]

And here's a screenshot:

As far as I know nop by itself is do nothing, but I've never seen it with an operand like that.

Can someone explain what does it do?

It is a multi-byte NOP. The instruction Set reference explains this: http://www.felixcloutier.com/x86/NOP.html — Michael Petch, May 16 '17 at 01:45
@MichaelPetch: Thanks. Just curious, what is the purpose of adding that instruction in a `Release` build? — c00000fd, May 16 '17 at 01:51
Usually for alignment. often you'll see it before loops to align them on a 16 or 32 byte boundary (16 is often a default). This can improve performance of the loop. — Michael Petch, May 16 '17 at 01:52
If you look at the address 7ff673c0146b is the start of the NOP instruction. It is 5 bytes so the instruction after the NOP will start at 7ff673c01470 which is 16 byte aligned. Pretty good chance the next instruction is the start of the loop. — Michael Petch, May 16 '17 at 01:56
Possible duplicate of [What is faster: JMP or string of NOPs?](http://stackoverflow.com/questions/6776385/what-is-faster-jmp-or-string-of-nops) — Raymond Chen, May 16 '17 at 02:47
Possible duplicate of [gdb - nop with extra data, why?](http://stackoverflow.com/questions/22486415/gdb-nop-with-extra-data-why) — phuclv, May 16 '17 at 02:47
BTW, what's up with this trend of good answers appearing only in comments? I'll just write the answer as an answer, but credit goes to @MichaelPetch for the actual content. — BeeOnRope, May 17 '17 at 23:44
why wouldn't you use 90 90 90 90 90 if you needed to waste 5 bytes? Because 0F... is faster? Then why do you mandate that it needs to be 0F 1F 44 00 00, why can't it be 0F plus any four bytes, which would allow e.g. a marker in the bytecode? In order to make room for further x86 enhancements, right? Then why is the mnemonic so strange, why is the mnemonic not just NOP or NOP5byte? — Thorsten Staerk, May 01 '20 at 14:35

Glenn Slayden · Accepted Answer · 2019-03-03T06:43:11.490

In a comment elsewhere on this page, Michael Petch points to a web page which describes the Intel x86 multi-byte NOP opcodes. The page has a table of useful information, but unfortunately the HTML is messed up so you can't read it. Here is some information from that page, plus that table presented a readable form:

Multi-Byte NOP
^{http://www.felixcloutier.com/x86/NOP.html}
The one-byte NOP instruction is an alias mnemonic for the XCHG (E)AX, (E)AX instruction.

The multi-byte NOP instruction performs no operation on supported processors and generates undefined opcode exception on processors that do not support the multi-byte NOP instruction.

The memory operand form of the instruction allows software to create a byte sequence of “no operation” as one instruction.

For situations where multiple-byte NOPs are needed, the recommended operations (32-bit mode ~~and 64-bit mode~~) are: [my edit: in 64-bit mode, write rax instead of eax.]
Length    Assembly                                     Byte Sequence
-------   ------------------------------------------   --------------------------
1 byte    nop                                          90
2 bytes   66 nop                                       66 90
3 bytes   nop dword ptr [eax]                          0F 1F 00
4 bytes   nop dword ptr [eax + 00h]                    0F 1F 40 00
5 bytes   nop dword ptr [eax + eax*1 + 00h]            0F 1F 44 00 00
6 bytes   66 nop word ptr [eax + eax*1 + 00h]          66 0F 1F 44 00 00
7 bytes   nop dword ptr [eax + 00000000h]              0F 1F 80 00 00 00 00
8 bytes   nop dword ptr [eax + eax*1 + 00000000h]      0F 1F 84 00 00 00 00 00
9 bytes   66 nop word ptr [eax + eax*1 + 00000000h]    66 0F 1F 84 00 00 00 00 00

Note that the technique for selecting the right byte sequence--and thus the desired total size--may differ according to which assembler you are using.

For example, the following two lines of assembly taken from the table are ostensibly similar:

nop dword ptr [eax + 00h]
nop dword ptr [eax + 00000000h]

These differ only in the number of leading zeros, and some assemblers may make it hard to disable their "helpful" feature of always encoding the shortest possible byte sequence, which could make the second expression inaccessible.

For the multi-byte NOP situation, you don't want this "help" because you need to make sure that you actually get the desired number of bytes. So the issue is how to specify an exact combination of mod and r/m bits that ends up with the desired disp size--but via instruction mnemonics alone. This topic is complex, and certainly beyond the scope of my knowledge, but Scaled Indexing, MOD+R/M and SIB might be a starting place.

Now as I know you were just thinking, if you find it difficult or impossible to coerce your assembler's cooperation via instruction mnemonics you can always just resort to db ("define bytes") as a simple no-fuss alternative which is, um, guaranteed to work.

I should clarify my previous comment, which is the last comment in the [archived chat](https://chat.stackoverflow.com/rooms/172226/discussion-on-answer-by-glenn-slayden-what-does-nop-dword-ptr-raxrax-x64-as); I'm not sure why alignment fill sequences should need to be interpretable by the CPU in the first place when there are no code paths that leads to them. Because otherwise, isn't it perfectly fine--and in fact, better, against spurious `jmp` landings--to just fill with `db 90 90 90...`? — Glenn Slayden, Mar 03 '19 at 06:59
Depending on the assembler, it doesn't know whether a padding directive is going to be executed or not. But between functions, MSVC does I think fill with repeated `0xcc` (`int3`). If you know execution shouldn't be somewhere, it makes more sense to trap than to silently fall through into the next function. — Peter Cordes, Mar 03 '19 at 07:36
@PeterCordes Doh! my mistake; I meant to say `CC` instead of `90`. It's embarrassing because I previously discussed that exact point [here](https://stackoverflow.com/a/48255562/147511)... — Glenn Slayden, Mar 03 '19 at 11:51
@PeterCordes Nitpicking “next function,” I’m sure you mean “next substantial instruction,” since (i.e., loop) alignment sequences can occur fully within a single function... — Glenn Slayden, Jun 20 '20 at 20:37
Within a function, the padding usually is actually executed, not jumped over, so that has to be NOP not INT3 (usually one or two long nops, not `90 90 90 ...`). So the only place MSVC can prevent execution from falling into is the top of the next function. That's why I phrased it that way. — Peter Cordes, Jun 20 '20 at 20:47
Of course it's usually irrelevant; only a bad indirect jump target could get you there. It might possibly affect speculative fetch/decode, stopping it from wasting power if it doesn't decode past a CC but would past a 90. Usually a `jmp` or `ret` will stop decode anyway, but a func could end with an indirect tailcall `jmp reg`. If no better prediction is available, some CPUs will predict the target as the next instruction. — Peter Cordes, Jun 20 '20 at 20:48
@JoeHuang [**NOP**](https://en.wikipedia.org/wiki/NOP_(code)) means "no instruction." For a microprocessor, it is an instruction to do nothing. — Glenn Slayden, Jan 09 '23 at 22:06

score 13 · Answer 2 · edited May 23 '17 at 11:55

As pointed out in the comments, it is a multi-byte NOP usually used to align the subsequent instruction to a 16-byte boundary, when that instruction is the first instruction in a loop.

Such alignment can help with instruction fetch bandwidth, because instruction fetch often happens in units of 16 bytes, so aligning the top of a loop gives the greatest chance that the decoding occurs without bottlenecks.

The importance of such alignment is arguably less important than it once was, with the introduction of the loop buffer and the uop cache which are less sensitive to alignment. In some cases this optimization may even be a pessimization, especially when the loop executes very few times.

score 1 · Answer 3 · answered Mar 01 '20 at 08:29

This code alignment is done when there are used jump instructions that perform jumps from bigger addresses to lower (0EBh XX - jmp short) and (0E9h XX XX XX XX - jmp near), where XX in both cases is a signed negative number. So, the compiler is aligning that chunk of code where the jump needs to be performed to 10h bytes boundary. This will give an optimization and code execution speedup.

What does "nop dword ptr [rax+rax]" x64 assembly instruction do?

3 Answers3