You did not align your loop.
If all of your jump instruction is not in the same cache line as the rest of the loop you incur an extra cycle to fetch the next cache line.
The various alternatives you've listed assemble to the following encodings.
0: ff 04 1c inc DWORD PTR [esp+ebx*1]
3: ff 04 24 inc DWORD PTR [esp]
6: ff 44 24 08 inc DWORD PTR [esp+0x8]
[esp]
and [esp+reg]
both encode in 3 bytes, [esp+8]
takes 4 bytes.
Because the loop starts at some random location, the extra byte pushes (part of) the jne loop
instruction to the next cache line.
A cache line is typically 16 bytes.
You can solve this by rewriting the code as follows:
mov eax, 0
mov ebx, 8
.align 16 ;align on a cache line.
loop:
inc dword ptr [esp + ebx] ;7 cycles
inc eax ;0 latency drowned out by inc [mem]
cmp eax, 0xFFFFFFFF ;0 " "
jne loop ;0 " "
mov eax, 1
mov ebx, 0
int 0x80
This loop should take 7 cycles per iteration.
Disregarding the fact that the loop does not do any useful work, it can be further optimized like so:
mov eax, 1 ;start counting at 1
mov ebx, [esp+ebx]
.align 16
loop: ;latency ;comment
lea ebx,[ebx+1] ; 0 ;Runs in parallel with `add`
add eax,1 ; 1 ;count until eax overflows
mov [esp+8],ebx ; 0 ;replace a R/W instruction with a W-only instruction
jnc loop ; 1 ;runs in parallel with `mov [mem],reg`
mov eax, 1
xor ebx, ebx
int 0x80
This loop should take 2 cycles per iteration.
By replacing the inc eax
with a add
and replacing the inc [esp]
with instructions that do not alter the flags you allow the CPU to run the lea + mov
and the add+jmp
instructions in parallel.
The add
is can be faster on newer CPU's because add
alters all the flags, whereas inc
only alters a subset of the flags.
This can cause a partial register stall on the jxx
instruction, because it has to wait for the partial write to the flags register to be resolved.
The mov [esp]
is also faster, because you're not doing a read-modify-write
cycle, you're only writing to memory inside the loop.
Further gains can be made by unrolling the loop, but the gains will be small, because the memory access here dominates the runtime and this is a silly loop to begin with.
To summarize:
- Avoid Read-modify-write instructions in a loop, try to replace them with separate instructions for reading, modifying and writing, or move the reading / writing outside of the loop.
- Avoid
inc
to manipulate loop counters, use add
instead.
- Try to use
lea
for adding when you're not interested in the flags.
- Always align small loops on cache lines
.align 16
.
- Do not use
cmp
inside a loop, the inc/add
instruction already alters the flags.