Long nop instructions in nasm

Question

Does nasm have any built-in way to emit long-nop (aka multi-byte nops) instructions of a given length?

See [smartalign in the nasm manual](http://www.nasm.us/xdoc/2.11.08/html/nasmdoc5.html#section-5.2) — Jester, Jan 13 '18 at 23:57
@Jester - yup, I'm aware of this and use it - but it doesn't give direct access to the nop instructions: they are only inserted indirectly via `align` directives. I'd like to use them directly, e.g,. "insert a 2-byte nop here". — BeeOnRope, Jan 14 '18 at 00:01
You can use the macros directly, e.g. `db __ALIGN_32BIT_2B__` should insert a 2 byte NOP into 32 bit code. — Jester, Jan 14 '18 at 00:15
@Jester - thanks - where do I see those? I don't find that string in the manual. — BeeOnRope, Jan 14 '18 at 00:17
It's in the macro package itself, as such it is undocumented and could change. — Jester, Jan 14 '18 at 00:19
A quick peek at [felixcloutier](http://www.felixcloutier.com/x86/NOP.html)'s suggests the official notation is `nop [rm16/rm32]`. If nasm accepts this, you could create a set of macros by yourself, so you don't need to rely on "undocumented" external macros. (Or - duh, belated thought - copy their definitions...) — Jongware, Jan 14 '18 at 11:55

BeeOnRope · Answer 1 · 2018-01-18T06:17:01.960

The answer seems to be that no, out of the box, there is no official way to emit these long-nops in nasm¹ out of the box.

So I just wrote my own macros for 1 to 9 bytes based on the recommended sequences from the Intel manuals²:

;; long-nop instructions: nopX inserts a nop of X bytes
;; see "Table 4-12. Recommended Multi-Byte Sequence of NOP Instruction" in
;; "Intel® 64 and IA-32 Architectures Software Developer’s Manual" (325383-061US)
%define nop1 nop                                                     ; just a nop, included for completeness
%define nop2 db 0x66, 0x90                                           ; 66 NOP
%define nop3 db 0x0F, 0x1F, 0x00                                     ;    NOP DWORD ptr [EAX]
%define nop4 db 0x0F, 0x1F, 0x40, 0x00                               ;    NOP DWORD ptr [EAX + 00H]
%define nop5 db 0x0F, 0x1F, 0x44, 0x00, 0x00                         ;    NOP DWORD ptr [EAX + EAX*1 + 00H]
%define nop6 db 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00                   ; 66 NOP DWORD ptr [EAX + EAX*1 + 00H]
%define nop7 db 0x0F, 0x1F, 0x80, 0x00, 0x00, 0x00, 0x00             ;    NOP DWORD ptr [EAX + 00000000H]
%define nop8 db 0x0F, 0x1F, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00       ;    NOP DWORD ptr [EAX + EAX*1 + 00000000H]
%define nop9 db 0x66, 0x0F, 0x1F, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 ; 66 NOP DWORD ptr [EAX + EAX*1 + 00000000H]

I've also added these to the nasm-utils project, so that's one way to get them if you have the same need.

¹Although as Jester points out, you can dig into the internals to find some macros used to implement the "smart align" feature.

²For the record, I believe these first appeared in the AMD manuals and that eventually Intel adopted the same recommended sequences.

Given the operand size prefix 66h on the *nop6* and *nop9* macros, I think the comments on these lines should read `WORD PTR` instead of `DWORD PTR`. — Sep Roland, Jan 14 '18 at 22:51
@SepRoland - well in that case, I shouldn't use indicate the 66 at all, since it is implied by `WORD`. Basically `66H, NOP DWORD ptr [EAX + EAX*1 + 00H]` and `NOP WORD ptr [EAX + EAX*1 + 00H]` are two ways of writing the same same thing, and if you explicitly encode a `db 0x66` followed by the `DWORD` version you get the identical bytes. The form I show above was copied verbatim from the Intel manual's. Note that the comments aren't even correct for the 64-byte mode: the nops with addresses would all be 1-byte longer if you actually assembled the commented version since they'd have ... — BeeOnRope, Jan 14 '18 at 23:35
... and extra 0x67 prefix since 32-bit address 9 (like `[eax ...]`) are not the default. The byte-wise encoding is fine in either mode though. — BeeOnRope, Jan 14 '18 at 23:36
Does x86-64 allow a 10-byte NOP that decodes efficiently on all CPUs, using `66 REX nopl ...`? Even on CPUs where the `0F` escape byte counts as a prefix (Silvermont), it would still only be 3 total prefixes. — Peter Cordes, Jan 15 '18 at 05:44

score 2 · Answer 2 · answered Jan 24 '18 at 13:35

Just quoting https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf page 124 (3-28) from December, 2017 :

3.5.1.10 Using NOPs

Code generators generate a no-operation (NOP) to align instructions. Examples of NOPs of different lengths in 32-bit mode are shown below:

1-byte: XCHG EAX, EAX
2-byte: 66 NOP
3-byte: LEA REG, 0 (REG) (8-bit displacement)
4-byte: NOP DWORD PTR [EAX + 0] (8-bit displacement)
5-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (8-bit displacement)
6-byte: LEA REG, 0 (REG) (32-bit displacement)
7-byte: NOP DWORD PTR [EAX + 0] (32-bit displacement)
8-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
9-byte: NOP WORD PTR [EAX + EAX*1 + 0] (32-bit displacement)

These are all true NOPs, having no effect on the state of the machine except to advance the EIP.

Because NOPs require hardware resources to decode and execute, use the fewest number to achieve the desired padding.

The one byte NOP:[XCHG EAX,EAX] has special hardware support. Although it still consumes a µop and its accompanying resources, the dependence upon the old value of EAX is removed.

This µop can be executed at the earliest possible opportunity, reducing the number of outstanding instructions and is the lowest cost NOP.

The other NOPs have no special hardware support. Their input and output registers are interpreted by the hardware. Therefore, a code generator should arrange to use the register containing the oldest value as input, so that the NOP will dispatch and release RS resources at the earliest possible opportunity.

Try to observe the following NOP generation priority:

• Select the smallest number of NOPs and pseudo-NOPs to provide the desired padding.
• Select NOPs that are least likely to execute on slower execution unit clusters.
• Select the register arguments of NOPs to reduce dependencies.

It looks like these sequences avoid the long-nop `0F 1F modrm ...` introduced with P6, instead using `lea`. LEA used that way is a "true nop" in the architectural sense, but not the microarchitectural sense, while `0F 1F` runs the same as `90` short-NOP, using no execution port and not lengthening the dep chain involving any register. In x86-64 code, you should always use `0F 1F` NOPs instead of LEA, or in 32-bit code that also uses CMOV or other P6 features. — Peter Cordes, Jan 24 '18 at 18:03

score -2 · Answer 3 · answered Jan 14 '18 at 11:41

-2

Note that code wise there is only one NOP instruction in the Intel processors. This has code 0x90 and it's just one byte.

The longer "nop"'s are instructions that do nothing such as XCHG of a register with itself. For example, for a "2 bytes NOP", you write:

XCHG AL, AL

Which is encoded as:

86 C0

So you could write macros to get any size you'd like. It's a bit of work to find all of those "do nothing" instructions. Plus, at times (most often) the compiler tries to optimize expressions on you. That's where entering the codes may be a requirement.

The longest encoding that I knew about would use the LEA instruction. This is where the size of the address offsets could be optimized out, since they're going to be zeroes, many zeroes, and they should be optimized.

And as Jester mentioned, you could use the existing macros. There is a copy of the file on the Internet.

https://github.com/letolabs/nasm/blob/master/macros/smartalign.mac

It can be fun to decode all of those instructions and see what they are.

For example, they use a MOV %si, %si to create a 2 bytes NOP.

answered Jan 14 '18 at 11:41

Alexis Wilke

19,179
10
84
156

2

[Multi-byte NOP opcode made official](https://software.intel.com/en-us/forums/watercooler-catchall/topic/307174) is from 2006 ... See also [felixcloutier](http://www.felixcloutier.com/x86/NOP.html) – they all are called `nop`. – Jongware Jan 14 '18 at 11:49
1

@usr2564301 Ah! I guess I missed these few lines... Most still are decoded as their corresponding instructions, though. I guess disassemblers programmers should be made aware of that feat and also the compiler probably should introduce a `nop +size` or something of the sort. – Alexis Wilke Jan 14 '18 at 12:15
I suppose it's the same problem as `mov al,[eax+1]` vs. an artificially lengthened `mov al,[eax+0x0000001]`. Disassemblers used to show a literal byte-by-byte representation of the underlying byte code but nowadays that doesn't work anymore. – Jongware Jan 14 '18 at 12:18
1

FYI, 0x90 is `xchg ax, ax`. It's not an "official" NOP, it's just the do-nothing instruction that Intel marked as a NOP. – David Hoelzer Jan 14 '18 at 13:04
4

@DavidHoelzer: In x86-64 `0x90` is a true NOP. If it ran as `xchg eax,eax`, it would zero-extend EAX into RAX. If you write `xchg eax,eax`, it can't be encoded as `90`; it has to use the general `xchg r/m32, r32` encoding because the NOP instruction takes over the `0x90` xchg-with-eax opcode. (Other registers can still use the short encoding, like `0x91` to xchg eax with ecx). I'm not sure how assemblers choose to encode `xchg ax,ax` in 32 or 64 bit mode. `66 90` would be legal, although it's a true NOP, not a 3-uop `xchg`. (The LEA encodings used as NOPs aren't true NOPs, though.) – Peter Cordes Jan 14 '18 at 13:10

Long nop instructions in nasm

3 Answers3