How many 1-byte NOPs can Skylake execute at one cycle

Question

I'm aligning branch targets with NOPs, and sometimes the CPU executes these NOPs, up to 15 NOPs. How many 1-byte NOPs can Skylake execute in one cycle? What about other Intel-compatible processors, like AMD? I'm interested not only in Skylake but in other microarchitectures as well. How many cycles may it take to execute a sequence of 15 NOPs? I want to know whether the extra code size and extra execution time of adding these NOPs worth its price. This is not me who adding these NOPs but an assembler automatically whenever I write an align directive.

Update: I have managed the assembler to insert multibyte NOPs automatically.

Have a look at [Agner Fog's tables](http://www.agner.org/optimize/instruction_tables.pdf). It should give you the numbers you need. — fuz, Jul 11 '17 at 17:48
@fuz - it tells 0.25, i.e. 4 `NOP`s per cycle? This is quite slow! — Maxim Masiutin, Jul 11 '17 at 17:52
Sounds about right! Consider using multi-byte nops (opcode `0f 1f /0`) to get more nops per cycle. — fuz, Jul 11 '17 at 17:56
@fuz - I can't -- that's not me who puts NOPs but the assembler whenever I write '.align 16' - I'm not inclined to put NOPs manually since it would be tedious to realign when I change the code. I should probably use '.align 4', not '.align 16' somewhere when the NOPs are executed, i.e. follow a conditiona jump like `jz`, not unconditinal like `jmp'. — Maxim Masiutin, Jul 11 '17 at 18:16
The GNU assembler has an option to generate long nops automatically. — fuz, Jul 11 '17 at 18:45

Cody Gray - on strike · Answer 1 · 2017-07-13T09:45:49.533

This is not me who adding these NOPs but an assembler. It is pretty dumb and do not support options (BASM) for alignment - there is just one option - boundary size.

I don't know what "BASM" is, and I can't find any reference to it online (except this, which obviously isn't x86), but if it doesn't support multi-byte NOPs, you really need a different assembler. This is just really basic stuff that's been in the Intel and AMD architecture manuals for years. The Gnu assembler can do this for ALIGN directives, and so can Microsoft's MASM. The open-source NASM and YASM assemblers support this as well, and either of these can be integrated into any existing build system easily.

By multi-byte NOPs, I mean the following, which you can find in AMD and Intel processor manuals:

Length   |  Mnemonic                                 |  Opcode Bytes
---------|-------------------------------------------|-------------------------------------
1 byte   |  NOP                                      |  90
2 bytes  |  66 NOP                                   |  66 90
3 bytes  |  NOP DWORD [EAX]                          |  0F 1F 00
4 bytes  |  NOP DWORD [EAX + 00H]                    |  0F 1F 40 00
5 bytes  |  NOP DWORD [EAX + EAX*1 + 00H]            |  0F 1F 44 00 00
6 bytes  |  66 NOP DWORD [EAX + EAX*1 + 00H]         |  66 0F 1F 44 00 00
7 bytes  |  NOP DWORD [EAX + 00000000H]              |  0F 1F 80 00 00 00 00
8 bytes  |  NOP DWORD [EAX + EAX*1 + 00000000H]      |  0F 1F 84 00 00 00 00 00
9 bytes  |  66 NOP DWORD [EAX + EAX*1 + 00000000H]   |  66 0F 1F 84 00 00 00 00 00

The sequence recommendations offered by the two manufacturers diverge slightly after 9 bytes, but NOPs that long are…not terribly common. And probably don't matter very much, since the extremely long NOP instructions with the excessive number of prefixes are going to degrade performance anyway. These work all the way back to the Pentium Pro, so they are universally supported today.

Agner Fog has this to say about multi-byte NOPs:

The multi-byte NOP instruction has the opcode 0F 1F + a dummy memory operand. The length of the multi-byte NOP instruction can be adjusted by optionally adding 1 or 4 bytes of displacement and a SIB byte to the dummy memory operand and by adding one or more 66H prefixes. An excessive number of prefixes can cause delay on older microprocessors, but at least two prefixes is acceptable on most processors. NOPs of any length up to 10 bytes can be constructed in this way with no more than two prefixes. If the processor can handle multiple prefixes without penalty then the length can be up to 15 bytes.

All of the redundant/superfluous prefixes are simply ignored. The advantage, of course, is that many newer processors have lower decode rates for multi-byte NOPs, making them more efficient. They will be faster than a series of 1-byte NOP (0x90) instructions.

Perhaps even better than multi-byte NOPs for alignment is using longer forms of the instructions that you're already using in your code. These lengthier encodings don't take any longer to execute (they affect only decode bandwidth), so they're faster/cheaper than NOPs. Examples of this are:

Using the mod-reg-r/m byte forms of instructions like INC, DEC, PUSH, POP, etc., instead of the short versions
Using an equivalent instruction that is longer, like ADD instead of INC or LEA instead of MOV.
Encoding longer forms of immediate operands (e.g., 32-bit immediates instead of sign-extended 8-bit immediates)
Adding SIB bytes and/or unnecessary prefixes (e.g., operand-size, segment, and REX in long mode)

Agner Fog's manuals speak at length about and give examples of these techniques, as well.

I don't know of any assemblers that will do these conversions/optimizations for you automatically (assemblers pick the shortest version, for obvious reasons), but they usually have a strict mode where you can force a particular encoding to be used, or you can just manually emit the instruction bytes. You only do this in highly performance-sensitive code anyway, where the work will actually pay off, so that limits the scope of the effort required substantially.

I want to know whether extra code size and extra execution time of adding these NOPs worth its price.

In general, no. While data alignment is extremely important and essentially free (size of the binary notwithstanding), code alignment is a lot less important. There are cases in tight loops where it can make a significant difference, but this only matters in hot spots in your code, which your profiler will already be identifying, and then you can perform the manipulations to manually align the code if necessary. Otherwise, I wouldn't worry about it.

It makes sense to align functions, as the padding bytes between them are never executed (rather than using NOPs here, you'll often see INT 3 or an invalid instruction, like UD2), but I wouldn't go around aligning all of your branch targets within functions simply as a matter of course. Only do it in known critical inner loops.

As ever, Agner Fog talks about this, and says it better than I could:

Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an important subroutine entry or jump label happens to be near the end of a 16-byte block then the microprocessor will only get a few useful bytes of code when fetching that block of code. It may have to fetch the next 16 bytes too before it can decode the first instructions after the label. This can be avoided by aligning important subroutine entries and loop entries by 16. Aligning by 8 will assure that at least 8 bytes of code can be loaded with the first instruction fetch, which may be sufficient if the instructions are small. We may align subroutine entries by the cache line size (typically 64 bytes) if the subroutine is part of a critical hot spot and the preceding code is unlikely to be executed in the same context.

A disadvantage of code alignment is that some cache space is lost to empty spaces before the aligned code entries.

In most cases, the effect of code alignment is minimal. So my recommendation is to align code only in the most critical cases like critical subroutines and critical innermost loops.

Aligning a subroutine entry is as simple as putting as many NOP's as needed before the subroutine entry to make the address divisible by 8, 16, 32 or 64, as desired. The assembler does this with the ALIGN directive. The NOP's that are inserted will not slow down the performance because they are never executed.

It is more problematic to align a loop entry because the preceding code is also executed. It may require up to 15 NOP's to align a loop entry by 16. These NOP's will be executed before the loop is entered and this will cost processor time. It is more efficient to use longer instructions that do nothing than to use a lot of single-byte NOP's. The best modern assemblers will do just that and use instructions like MOV EAX,EAX and LEA EBX,[EBX+00000000H] to fill the space before an ALIGN nn statement. The LEA instruction is particularly flexible. It is possible to give an instruction like LEA EBX,[EBX] any length from 2 to 8 by variously adding a SIB byte, a segment prefix and an offset of one or four bytes of zero. Don't use a two-byte offset in 32-bit mode as this will slow down decoding. And don't use more than one prefix because this will slow down decoding on older Intel processors.

Using pseudo-NOPs such as MOV RAX,RAX and LEA RBX,[RBX+0] as fillers has the disadvantage that it has a false dependence on the register, and it uses execution resources. It is better to use the multi-byte NOP instruction which can be adjusted to the desired length. The multi-byte NOP instruction is available in all processors that support conditional move instructions, i.e. Intel PPro, P2, AMD Athlon, K7 and later.

An alternative way of aligning a loop entry is to code the preceding instructions in ways that are longer than necessary. In most cases, this will not add to the execution time, but possibly to the instruction fetch time.

He also goes on to show an example of another way to align an inner loop by moving the preceding subroutine entry. This is kind of awkward, and requires some manual adjustment even in the best of assemblers, but it may be the most optimal mechanism. Again, this only matters in critical inner loops on the hot path, where you're probably already digging in and micro-optimizing anyway.

Anecdotally, I've benchmarked code that I was in the middle of optimizing several times, and didn't find very much if any benefit to aligning a loop branch target. For example, I was writing an optimized strlen function (Gnu libraries have one, but Microsoft's don't), and tried aligning the target of the main inner loop on 8-byte, 16-byte, and 32-byte boundaries. None of these made much of a difference, especially not when compared to the other drastic performance headway that I was making in rewriting the code.

And beware that if you're not optimizing for a specific processor, you can make yourself crazy trying to find the best "generic" code. When it comes to the effect of alignment on speed, things can vary wildly. A poor alignment strategy is often worse than no alignment strategy at all.

A power-of-two boundary is always a good idea, but this pretty readily achieved without any extra effort. Again, don't dismiss alignment out of hand, because it can matter, but by the same token, don't obsess about trying to align every branch target.

Alignment used to be a bit bigger deal on the original Core 2 (Penryn and Nehalem) microarchitecture, where substantial decode bottlenecks meant that, despite a 4-wide issue width, you had a hard time keeping its execution units busy. With the introduction of the µop cache in Sandy Bridge (one of the few nice features of the Pentium 4 that was ultimately reintroduced into the P6 extended family), the front-end throughput was increased pretty significantly, and this became a lot less of a problem.

Frankly, compilers aren't very good at making these types of optimizations, either. The -O2 switch for GCC implies the -falign-functions, -falign-jumps, -falign-loops, and -falign-labels switches, with a default preference to align on 8-byte boundaries. This is a pretty blunt approach, and mileage varies. As I linked above, reports vary about whether disabling this alignment and going for compact code might actually increase performance. Moreover, about the best you're going to see a compiler doing is inserting multi-byte NOPs. I haven't seen one that uses longer forms of instructions or drastically rearranges code for alignment purposes. So we still have a long way to go, and it's a very difficult problem to solve. Some people are working on it, but that just goes to show how intractable the problem really is: "Small changes in the instruction stream, such as the insertion of a single NOP instruction, can lead to significant performance deltas, with the effect of exposing compiler and performance optimization efforts to perceived unwanted randomness." (Note that, while interesting, that paper hails from the early Core 2 days, which suffered more than most from misalignment penalties, as I mentioned earlier. I'm not sure if you'd see the same drastic improvements on today's microarchitectures, but I can't say for sure either way, because I haven't run the test. Maybe Google will hire me and I can publish another paper?)

How many 1-byte NOPs can Skylake execute in one cycle? What about other Intel-compatible processors, like AMD? I'm interested not only in Skylake but in other microarchitecrutes as well. How many cycles may it take to execute a sequence of 15 NOPs?

Questions like this can be answered by looking at Agner Fog's instruction tables and searching for NOP. I won't bother extracting all of his data into this answer.

In general, though, just know that NOPs are not free. Although they don't require an execution unit/port, they still have to run through the pipeline like any other instruction, and so they are ultimately bottlenecked by the issue (and/or retirement) width of the processor. This generally means you can execute somewhere between 3 to 5 NOPs per clock.

NOPs also still take up space in the µop cache, which mean reduced code density and cache efficiency.

In many ways, you can think of a NOP as being equivalent to a XOR reg, reg or MOV that gets elided in the front-end due to register renaming.

Thank you for the excellent reply! I have managed the assembler to enter multibyte-nops automatically. I'm speficying to align from 2 to 16 bytes, depending on context and the importance, but, on general, I'm trying that after the alignment, at least two instructions will fit the boundary. So, if it's just two `pop`'s, I'm aligning by 2, but if there is an important AVX loop to copy memory, I'm alinging by 16. I agree with your reasoning that lost space and the time to process these NOPs, even multibyte NOPs may not worth its price, especially when the code gets larger and short `jz`s go long. — Maxim Masiutin, Jul 12 '17 at 18:25
@MaximMasiutin: If you want that kind of flexibility with alignment, the GNU assembler might be a good choice. [`.p2align 4,,10`](https://sourceware.org/binutils/docs/as/P2align.html) will align to 16 (1<<4), but only if that skips 10 bytes or fewer. gcc often emits `.p2align 4,,10` ; `.p2align 3` one after the other, so you always get 8-byte alignment, but maybe also 16 unless that would waste most of 16B. But since no assemblers will pad instructions for you and avoid NOPs entirely, you may have to do that yourself. — Peter Cordes, Jul 13 '17 at 03:26
My assembler uses slightly different opcodes for multi-byte `NOP`s - these are various LEA RAX/EAX with or without FS segment prefix byte (64h) — Maxim Masiutin, Jul 13 '17 at 12:48

Peter Cordes · Answer 2 · 2017-07-13T13:27:34.560

4

See also Cody's answer for lots of good stuff I'm leaving out because he covered it already.

Never use multiple 1-byte NOPs. All assemblers have ways to get long NOPs; see below.

15 NOPs take 3.75c to issue at the usual 4 per clock, but might not slow down your code at all if it was bottlenecked on a long dependency chain at that point. They do take up space in the ROB all the way until retirement. The only thing they don't do is use an execution port. The point is, CPU performance isn't additive. You can't just say "this takes 5 cycles and this takes 3, so together they will take 8". The point of out-of-order execution is to overlap with surrounding code.

The worse effect of many 1 byte short-NOPs on SnB-family is that they tend to overflow the uop-cache limit of 3 lines per aligned 32B chunk of x86 code. This would mean that the whole 32B block always has to run from the decoders, not the uop cache or loop buffer. (The loop buffer only works for loops that have all their uops in the uop cache).

You should only ever have at most 2 NOPs in a row that actually execute, and then only if you need to pad by more than 10B or 15B or something. (Some CPUs do very badly when decoding instructions with very many prefixes, so for NOPs that actually execute it's probably best not to repeat prefixes out to 15B (the max x86 instruction length).

YASM defaults to making long NOPs. For NASM, use the smartalign standard macro package, which isn't enabled by default. It forces you to pick a NOP strategy.

%use smartalign
ALIGNMODE p6, 32     ;  p6 NOP strategy, and jump over the NOPs only if they're 32B or larger.

IDK if 32 is optimal. Also, beware that the longest NOPs might use a lot of prefixes and decode slowly on Silvermont, or on AMD. Check the NASM manual for other modes.

The GNU assembler's .p2align directive gives you some conditional behaviour: .p2align 4,,10 will align to 16 (1<<4), but only if that skips 10 bytes or fewer. (The empty 2nd arg means the filler is NOPs, and the power-of-2 align name is because plain .align is power-of-2 on some platforms but byte-count on others). gcc often emits this before the top of loops:

  .p2align 4,,10 
  .p2align 3
.L7:

So you always get 8-byte alignment (unconditional .p2align 3), but maybe also 16 unless that would waste more than 10B. Putting the larger alignment first is important to avoid getting e.g. a 1-byte NOP and then an 8-byte NOP instead of a single 9-byte NOP.

It's probably possible to implement this functionality with a NASM macro.

Missing features no assembler has (AFAIK):

A directive to pad preceding instructions by using longer encodings (e.g. imm32 instead of imm8 or unneeded REX prefixes) to achieve the desired alignment without NOPs.
Smart conditional stuff based on the length of following instructions, like not padding if 4 instructions can be decoded before hitting the next 16B or 32B boundary.

It's a good thing alignment for decode bottlenecks isn't usually very important anymore, because tweaking it usually involves manual assemble/disassemble/edit cycles, and has to get looked at again if the preceding code changes.

Especially if you have the luxury of tuning for a limited set of CPUs, test and don't pad if you don't find a perf benefit. In a lot of cases, especially for CPUs with a uop cache and/or loop buffer, it's ok not to align branch targets within functions, even loops.

Some of the performance-variation due to varying alignment is that it makes different branches alias each other in the branch-prediction caches. This secondary subtle effect is still present even when the uop cache works perfectly and there are no front-end bottlenecks from fetching mostly-empty lines from the uop cache.

See also Performance optimisations of x86-64 assembly - Alignment and branch prediction

edited Jul 13 '17 at 13:27

answered Jul 13 '17 at 04:18

Peter Cordes

328,167
45
605
847

2

*"Especially if you have the luxury of tuning for a limited set of CPUs…"* I'd draw the same conclusion you did here, but for the opposite case! You can't possibly test on every single CPU, so there are always going to be some on which you code runs non-optimally. Better to just make good, common-sense choices for the general case, and that usually means not going overboard with inserting NOPs for alignment purposes. Also, I think the next bolded statement, about perf differences being due to different branches aliasing each other in the BP is analysis that's missing from that paper I cited. – Cody Gray - on strike Jul 13 '17 at 09:49
2

Anyway, great answer. Thanks for filling in some of the details that I glossed over or forgot, like how to use smartalign in NASM and how `.p2align` works in Gas. I think it would be really interesting to see an assembler work on a directive to choose longer encodings of instructions for padding/alignment reasons. I wonder if this would be something that the NASM or YASM folks would be interested in looking into? Seems the common candidate instruction mappings could be table-driven, and that would be enough to make a difference in a lot of cases. Prefixes would be even easier to auto-insert. – Cody Gray - on strike Jul 13 '17 at 09:51
1

@CodyGray: the risk with prefixes (other than REX) is that a future CPU might give them a different meaning. e.g. `rep bsf` is `tzcnt` on newer CPUs. I think REX.W=0 should always be safe, though, except for instructions using AH/.../DH. (Also have to check that you don't end up with more than 3 total prefixes, or else Silvermont/KNL will stall on decode.) – Peter Cordes Jul 13 '17 at 12:47
Thank you very much! I'm also using NASM, so, whenever I use nasm with 64-bit code, I will use `%use smartalign` and `ALIGNMODE p6, 32`. `p6` uses NOP instructions first appeared on Pentium Pro, so it difinitely exists on 64-bit processors, On 32-bit code, I will use `ALIGNMODE generic` to work on even very old CPU prior to Pentium Pro. I didn't know about these NASM options that they are off by default! Great answer! – Maxim Masiutin Jul 13 '17 at 12:56
1

@MaximMasiutin: Are you sure you need to care about P5 and older for 32-bit code? `gcc -m32` and most other compilers default to assuming PPro, using instructions like `cmov`. Recent gcc even defaults to assuming SSE2 for 32-bit code. I would assume PPro unless you specifically are building binaries that do need to run on very old hardware with very little RAM. If you're ever making Windows binaries, keep in mind that most recent Windows versions require many more features than PPro. So if you depend on anything at all recent from Windows, you can make CPU assumptions. – Peter Cordes Jul 13 '17 at 13:23
1

Also, the `32` threshold for using a jump over the NOPs is just something I made up because the default looked too low. Whether 3 or 4 long NOPs are better than a `jmp` or not might depend on surrounding code. Also, keep in mind that AMD CPUs might slow down decoding very long NOPs. – Peter Cordes Jul 13 '17 at 13:26
@PeterCordes - our software is a general-use software, running even under Windows 2000, which may be installed on a CPU prior to Pentium Pro. – Maxim Masiutin Jul 14 '17 at 01:09
1

@MaximMasiutin: ok, then yeah if you do explicitly support ancient CPUs and compile the compiled parts of your 32-bit code with 486/586-friendly options like `-march=i486`, then you can't use the usual long NOPs. Hopefully `alignmode generic` can use other instructions like `lea` in ways that have no effect, in your 32-bit code. (A lower jump threshold might be appropriate there, though. P6 can use *very* long NOPs.) Or if it always uses short NOPs, then use a low jump threshold to jump over them instead of running them. – Peter Cordes Jul 14 '17 at 01:16
@PeterCordes I can't use long NOPs, but I can use plain simple various LEA EAX combinations, as I have explained - the assembler already includes them, e.g. LEA EAX, [EAX+EAX+00], LEA EAX, [EAX+EAX+00000000], LEA EAX, FS:[EAX+EAX+00], `db 66; LEA EAX, FS:[EAX+EAX+00000000]`, etc - that kind of things - pretty compatible. – Maxim Masiutin Jul 14 '17 at 03:20
1

@MaximMasiutin: Ok, good. I guess it's a matter of terminology whether you call `LEA EAX, FS:[EAX+EAX+00]` a "long NOP" or not. It's less efficient than one using the P6 2-byte dedicated NOP opcode, but it still doesn't affect the architectural state. (But it does affect the *micro*architectural state differently, introducing an extra cycle of latency for EAX. Actually, what you wrote isn't a NOP because it does `eax = eax+eax`. I assume you meant an addressing mode that eventually resolves to just EAX, like `[0 + EAX*1]` to get a SIB byte + disp32.) Anyway, ok, problem solved. :) – Peter Cordes Jul 14 '17 at 03:29
2

For what it's worth, I've been looking at loop alignment lately on Skylake, and empirically it seems that aligning by 16 or more is almost never worth it, largely because the various front-end parts that are most helped by alignment have all been getting better and are less commonly the bottleneck. In fact, for any given loop I often find that align-by-16 is slower than several other random alignments (usually there are 2 or 3 performance levels, repeating periodically). – BeeOnRope Jul 15 '17 at 19:23
2

The biggest culprits seem to be branch prediction behavior, especially for nested loops, and scheduler port-binding behavior, especially for high IPC code with port contention. For example, you might have code that should 4 hit IPC if scheduled correctly, but it only actually gets there for 4 alignments out of every 20, or whatever, and not necessarily "even" ones. The behavior is very hard to control since it seems to depend on many address bits which are likely to change when unrelated code changes. – BeeOnRope Jul 15 '17 at 19:24
1

About what to optimize for - it depends on the delivery model for your software, but for long lived software that you aren't going to update frequently (or that you will update but where end users often won't use the update) it may may sense to target approximately the newest or second architecture (something like Broadwell or Skylake today), with the idea that you want to perform will on average, but over time the average will end up being centered more on the currently modern chips than the average today, and that the best prediction for what will be good in the future is the current chips. – BeeOnRope Jul 15 '17 at 19:28
1

... the reverse argument is that people with the newest chips may not have any issues with speed, so perhaps you should target the "low end" (generally oldest) of your supported range, since those are the users who will care. Sort of like setting pointing your solar panels in the direction that gives the most juice in the winter (where you care) even though it doesn't come close to maximizing total energy. – BeeOnRope Jul 15 '17 at 19:30

BeeOnRope · Accepted Answer · 2017-07-16T01:09:19.893

Skylake can generally execute four single-byte nops in one cycle. This has been true at least back to the Sandy Bridge (hereafter SnB) micro-architecture.

Skylake, and others back to SnB, will also generally be able to execute four longer-than-one-byte nops in one cycle as well, unless they are so long as to run into front-end limitations.

^{The existing answers are much more complete and explain why you might not want to use such single-byte nop instructions so I won't add more, but it's nice to have one answer that just answers the headline question clearly, I think.}

How many 1-byte NOPs can Skylake execute at one cycle

3 Answers3

Related