11

I have the following NASM assembly program, which runs in around 9.5 seconds:

section .text
global _start

_start:
  mov eax, 0
  mov ebx, 8
  loop:
    inc dword [esp + ebx]
    inc eax
    cmp eax, 0xFFFFFFFF
    jne loop

  mov eax, 1
  mov ebx, 0
  int 0x80

But if I replace [esp + ebx] with either [esp + 8] (same memory location since ebx = 8) or even just [esp], it runs in 10.1 seconds...

How is this possible? Isn't [esp] easier for the CPU to calculate than [esp + ebx]?

Johan
  • 74,508
  • 24
  • 191
  • 319
  • 4
    It could be because the `[esp + 8]` and `[esp]` encodings are one byte longer to encode the displacement (8 and 0 respectively). – Ross Ridge Dec 31 '16 at 19:42
  • 2
    Have you tried to align the loop entry point to, say, 16 bytes? – fuz Dec 31 '16 at 20:40
  • I don't know how you measured, but are you sure this speed difference is measurable by your means? 600 ms differences seems pretty huge for me and I'd rather blame other things for that differences than this instruction. – cadaniluk Dec 31 '16 at 23:56
  • 2
    @Downvoter I was able to confirm the speed difference. It goes away when aligning the loop entrance to 16 bytes. – fuz Jan 01 '17 at 00:29
  • 2
    @fuz and Runemoro: What hardware did you test on? I tested all three ways, and with alignment, on Intel Skylake. I used `perf stat` on Linux to count clock cycles, to make sure CPU frequency scaling stuff had no effect (the system was idle so they all ran at ~4.35GHz). **Every version I tested ran the same, at 0.73 instructions per cycle, bottlenecked on the latency of a memory-destination read-modify-write ALU operation (`inc`)**. I assembled the code as 32-bit, with `nasm -felf32`, and statically linked it with `ld`. – Peter Cordes Jan 01 '17 at 06:12
  • Strangely, this works out to 5.5 iterations per clock. Repeating `inc dword [esp+ebx]` four times in the loop body doesn't change that: still one `inc` per 5.5 cycles. Agner Fog's insn tables list `inc m` as 5-6 cycle latency, and 3 uops (unlike `add m, r/imm` which is only 2 uops and 5c latency). This is weird. Hmm, I tested, and `add dword [esp+ebx], 1` was the same speed as `inc`, still ~5.5 cycle latency, so things aren't totally insane. – Peter Cordes Jan 01 '17 at 06:23
  • Try copying `esp` to a different register first. `[esp]` is a special case: it always needs a SIB byte in the machine encoding, even when there's no index register. Perhaps your CPU is slower at address math in this case? But the addresses should all be calculated way ahead of time, since nothing in the loop modifies registers that are used in any of the addressing modes. – Peter Cordes Jan 01 '17 at 06:26
  • @PeterCordes I tested on my `Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz`. – fuz Jan 01 '17 at 14:57
  • @fuz: I'm surprised alignment makes any diff here. How can a front-end effect matter for a loop that bottlenecks at less than 1 IPC, on store-forwarding latency? I'm not surprised that Skylake can't reproduce some slowdowns from previous architectures (making it hard to tune for them with testing only on SKL :/). What IPC are you seeing, exactly? – Peter Cordes Jan 01 '17 at 15:13
  • Oh, just thought of something I should have thought of earlier: [`[esp+ebx]` can't micro-fuse in the OOO core on SnB-family CPUs before Skylake](http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes/31027695#31027695). But `[esp]` and `[esp+8]` can. Perhaps not micro-fusing actually reduces latency? That wasn't what my testing showed on SnB, but it wasn't extensive. – Peter Cordes Jan 01 '17 at 15:16
  • @PeterCordes I'm not sure how to compute IPC. I get the following timings: without alignment 6.76s for `[esp+ebx]`, 8.31s for `[esp+8]` and with alignment of the loop entrance to 16 bytes 6.67s for both. Without the alignment, the loop entrance is at `804855e`, with alignment it is shifted forwards further. – fuz Jan 01 '17 at 15:24
  • Wait. That's weird: I accidentally places the align directive after the label, making the nop-slide part of the loop. If I put the align directive before the loop entrance, I get 7.20s for `[esp+ebx]` and 8.39s for `[esp+8]`. This is very weird. – fuz Jan 01 '17 at 15:27
  • @fuz: If you're on Linux, time it with `perf stat ./a.out`. It will print core clcok cycle count as well as instruction count, and even calculate the IPC for you. That takes care of any SpeedStep/Turbo issues, since this there's no memory bottleneck here, just store-forwarding inside the core. – Peter Cordes Jan 01 '17 at 15:28
  • @PeterCordes I'm on FreeBSD, so no `perf stat`, sadly. – fuz Jan 01 '17 at 15:29
  • @fuz: FreeBSD must have some kind of way to access performance counters, though? Anyway, sounds like there's a mystery that perf counters will hopefully help solve. I should be able to do some testing on a Haswell laptop, but I have a bunch of other SO questions that I've been meaning to answer or tidy up my answer on... – Peter Cordes Jan 01 '17 at 15:34
  • Dumb question: is [esp + 8] even mapped? The stack grows down and we haven't pushed yet. – ruthafjord Jan 21 '17 at 09:56

1 Answers1

1

You did not align your loop.
If all of your jump instruction is not in the same cache line as the rest of the loop you incur an extra cycle to fetch the next cache line.

The various alternatives you've listed assemble to the following encodings.

0:  ff 04 1c                inc    DWORD PTR [esp+ebx*1]
3:  ff 04 24                inc    DWORD PTR [esp]
6:  ff 44 24 08             inc    DWORD PTR [esp+0x8] 

[esp] and [esp+reg] both encode in 3 bytes, [esp+8] takes 4 bytes. Because the loop starts at some random location, the extra byte pushes (part of) the jne loop instruction to the next cache line.

A cache line is typically 16 bytes.

You can solve this by rewriting the code as follows:

  mov eax, 0
  mov ebx, 8
  .align 16             ;align on a cache line.
  loop:
    inc dword ptr [esp + ebx]                 ;7 cycles
    inc eax                                   ;0 latency drowned out by inc [mem]
    cmp eax, 0xFFFFFFFF                       ;0   "          "
    jne loop                                  ;0   "          "

  mov eax, 1
  mov ebx, 0
  int 0x80

This loop should take 7 cycles per iteration.

Disregarding the fact that the loop does not do any useful work, it can be further optimized like so:

  mov eax, 1      ;start counting at 1
  mov ebx, [esp+ebx]
  .align 16
  loop:         ;latency   ;comment
    lea ebx,[ebx+1]  ; 0   ;Runs in parallel with `add`
    add eax,1        ; 1   ;count until eax overflows
    mov [esp+8],ebx  ; 0   ;replace a R/W instruction with a W-only instruction   
    jnc loop         ; 1   ;runs in parallel with `mov [mem],reg`

  mov eax, 1
  xor ebx, ebx
  int 0x80

This loop should take 2 cycles per iteration.

By replacing the inc eax with a add and replacing the inc [esp] with instructions that do not alter the flags you allow the CPU to run the lea + mov and the add+jmp instructions in parallel.
The add is can be faster on newer CPU's because add alters all the flags, whereas inc only alters a subset of the flags.
This can cause a partial register stall on the jxx instruction, because it has to wait for the partial write to the flags register to be resolved. The mov [esp] is also faster, because you're not doing a read-modify-write cycle, you're only writing to memory inside the loop.

Further gains can be made by unrolling the loop, but the gains will be small, because the memory access here dominates the runtime and this is a silly loop to begin with.

To summarize:

  • Avoid Read-modify-write instructions in a loop, try to replace them with separate instructions for reading, modifying and writing, or move the reading / writing outside of the loop.
  • Avoid inc to manipulate loop counters, use add instead.
  • Try to use lea for adding when you're not interested in the flags.
  • Always align small loops on cache lines .align 16.
  • Do not use cmp inside a loop, the inc/add instruction already alters the flags.
Johan
  • 74,508
  • 24
  • 191
  • 319
  • I wouldn't know how to tune for Netburst if my life depended on it. Some cheaper processors (Atom/Bobcat) have some of the issues described above with `inc/add` I'm sure in another few years with the increase in transistor budget this will all go away. The timings are for K10, but on Haswell the story does not change the timings go from 7 to 6 so the 'optimized' section is still faster. – Johan Feb 06 '17 at 15:15
  • Of course it's going to be faster than a through-memory dependency on basically anything, but basically everything else in the story changes on Haswell (including the time, which is approximately 1 cycle per iteration for your loop on my machine) – harold Feb 06 '17 at 15:38
  • @harold, `mov [mem],reg` has a latency of 3 cycles on Haswell, this is the same as on K10 and as on almost all modern x64 processors. Unless Haswell throws away successive memory writes (which I doubt) the total latency for the loop will be 4 cycles. As far as I understand it the memory write to an address has to be resolved before I can write again to that same address, never mind reciprocal throughput. **if I'm mistaken please enlighten me, 'cause that would be great news performance-wise** – Johan Feb 06 '17 at 15:47
  • Well 1c/iteration is what the benchmark says, not just theory.. I don't know how exactly it resolves the overwrites but it's doing it (actually I'd like to ask a question about that, but it would probably just be closed so..) – harold Feb 06 '17 at 15:53
  • @Johan - I'm not sure what you are talking about that `mov [mem],reg` has a latency of 3 cycles. This is pure store, how can it have a well-defined "latency" at all? It only makes sense to talk about the a well-defined latency for instructions whose input type (reg, mem, whatever) is the same as their output type, since then the instructions can depend on each other. A store has a _memory_ destination operand and a _register_ source operand, so it doesn't meet that criteria. You _could_ talk about the total latency of a _pair_ of load and stores to the same address, but not about each one. – BeeOnRope Feb 07 '17 at 02:38
  • 1
    Certainly on most modern architectures, a stream of stores like this can just execute as fast as the address-generation hardware, store EUs and store buffers are able to take stores. On _any_ modern Intel x86, if the stores are hitting L1, that's going to be exactly 1 store/cycle, as Harold noted. Not just Haswell, even Sandy Bridge and probably earlier can sustain this. – BeeOnRope Feb 07 '17 at 02:40
  • @harold, I think that would make rather a good question actually. – Johan Feb 07 '17 at 10:13
  • Error here: *A cache line is typically 16 bytes*. I think you mean decode or fetch block, since cache lines are 64B on everything after Pentium3 (including I-cache lines). Of course, Intel since Nehalem (and AMD since Zen) reuse already-decoded uops for short loops. Even Core2 has a loop buffer, but it's before the decoders. (Core2 can recycle up to 64B of machine code without re-fetching it or finding instruction boundaries, which is actually the slow part, so it can decode up to 32B per clock in small loops). See [Agner Fog's microarch pdf](http://agner.org/optimize/) – Peter Cordes Apr 08 '17 at 03:29
  • Also, to optimize for Intel SnB-family, put the `add` next to the `jnc` so they can macro-fuse into an add-and-branch uop. Bulldozer can fuse any `cmp/jcc`, so you should only increment one counter in the loop (i.e. use `cmp ebx,end_value / jne` as the loop-exit condition). That might run at one-per-clock on AMD Bulldozer-family, since it would only be 1 store, 1 integer ALU, and 1 CMP+JCC. – Peter Cordes Apr 08 '17 at 03:35
  • Also, `lea` usually can't run on as many ports as `add`, and it's typically longer, so only use it when you can do something you couldn't do with `add`. Out-of-order x86 CPUs use register renaming, so they don't care about write-after-write hazards on EFLAGS, only partial-flag writes (and then usually only reading flags that weren't written by the last insn to write any flags is a problem.) – Peter Cordes Apr 08 '17 at 03:37