See http://agner.org/optimize/, and other links in the x86 wiki, for docs on how to optimize asm.
This way:
mov byte [edi],0
mov byte [edi+1],1
mov byte [edi+2],2
mov byte [edi+3],3
...
will be faster. There's no extra cost for using a displacement on any current microarchitecture AFAIK, except the extra one or four bytes of instruction size. Two-register addressing modes can be slower on Intel SnB-family CPUs, but fixed displacements are fine.
Real compilers like gcc and clang always use the first method (displacements in the effective address) when unrolling loops.
And BTW, a 4-byte store of 0x03020100
would be almost exactly 4x faster than four separate one-byte stores. Most modern CPUs have 128b data paths, so any single store up to 128b takes the same execution resources as an 8b store. AVX 256b stores are still less expensive than two 128b stores on Intel SnB / IvB (if aligned), while Intel Haswell and later can do a 256b store in a single operation. However, mov-immediate to memory is only available for 8, 16, and 32-bit operands. mov r64, imm64
(to register only) is available in 64-bit mode, but there are no 128 or 256 mov-immediate instructions.
In 32-bit mode, where one-byte encodings of inc reg
are available, the inc edi
/ mov byte [edi],1
would have equal code size, but still decode to twice as many uops on recent Intel and AMD microarchitectures. This might not be a problem if the code was still bottlenecked on store throughput or something, but there's no way in which it's better. CPUs are very complex, and simple analysis by counting uops doesn't always match the results you get in practice, but I think it's highly unlikely that an inc
between every store will run faster. The best you can say is that it might not run measurably slower. It will probably use more power / heat, and be less friendly for hyperthreading.
In 64-bit mode, inc rdx
takes 3 bytes to encode: 1 REX to specify 64-bit operand size (rather than the default 32-bit), 1 opcode byte to specify inc r/m
, and 1 mod/rm byte to specify rdx
as the operand.
So in 64-bit mode, there is a code-size downside. In both cases, the inc
solution will use twice as many entries in the highly-valuable uop-cache (on Intel SnB-family CPUs), which holds fused-domain uops.
More uops also means more space in the ROB, so out-of-order exec can't see as far ahead.
Also, a chain of inc
instructions will delay the store-address uops from calculating multiple store addresses earlier (and writing them into the store buffer). Intel Ice Lake has two ports that can run store-address uops (down from 3 in Haswell). It's better for later loads if store addresses are ready earlier, so the CPU can be sure they're independent, or that they do overlap. It also gets them out of the scheduler (RS) earlier, freeing up space in that out-of-order exec structure.
The 2nd part:
mov byte [edi+5],0xFF
mov byte [edi+5],0xFF
mov byte [edi+5],0xFF
mov byte [edi+5],0xFF
vs.
add edi,5 ; 3 bytes to encode.
mov byte [edi],0xFF ; saving one byte in each instruction
mov byte [edi],0xFF
mov byte [edi],0xFF
mov byte [edi],0xFF
Unless code-size was a critically important (unlikely), or there were many more stores, use the first form. The 2nd form is one byte longer, but one fewer fused-domain uop. It will use less space in the uop-cache on CPUs that have them. On older CPUs (without a uop cache), instruction decoding was more of a bottleneck, so there might be some cases where instructions lining up better into groups of 4 was the bottleneck. That won't be the case if you're bottlenecked on the store port, though.