5

Say we want to store a string at EDI. Would it be faster to store it this way

mov byte [edi],0
mov byte [edi+1],1
mov byte [edi+2],2
mov byte [edi+3],3
...

or this way?

mov byte [edi],0
inc edi
mov byte [edi],1
inc edi
mov byte [edi],2
inc edi
mov byte [edi],3
inc edi
...

Some might suggest the following in little-endian:

mov dword [edi],0x3210

Or the following in big-endian:

mov dword [edi],0x0123

But that's not the point of my question. My question is, is it faster to increment the pointer and then do the mov thus requiring more instructions, or is it faster to specify in each mov instruction an amount to add to the offset address pointed to by EDI? If the latter is true, after how many mov instructions with the same number to add to the offset address will it become worth just adding that amount to the pointer? In other words, is this

mov byte [edi+5],0xFF
mov byte [edi+5],0xFF
mov byte [edi+5],0xFF
mov byte [edi+5],0xFF

faster than this?

add edi,5
mov byte [edi],0xFF
mov byte [edi],0xFF
mov byte [edi],0xFF
mov byte [edi],0xFF
Isaac D. Cohen
  • 797
  • 2
  • 10
  • 26
  • This is a meaningless question without knowing the processor brand, architecture, etc. It is generally true that one move that's the width of the datapath will be faster than bytewise moves of the same data. But even that's not a dead cinch because modern architectures optimize such cases. – Gene Jan 14 '16 at 03:56
  • 3
    Just a note: it would be 0x00010203 and not 0x0123. – Sami Kuhmonen Jan 14 '16 at 03:59
  • 1
    The instructions with offsets are highly optimized and won't stall the pipeline, so I'm tempted to say they'll be faster. But you'll probably be limited by memory speed anyway, even if you're writing to cache. – Mark Ransom Jan 14 '16 at 04:27
  • @MarkRansom: If the sequence of stores is of limited length, then the amount of other execution resources it takes determines how well it can mix with the surrounding instructions. – Peter Cordes Jan 14 '16 at 04:39

1 Answers1

11

See http://agner.org/optimize/, and other links in the wiki, for docs on how to optimize asm.


This way:

mov byte [edi],0
mov byte [edi+1],1
mov byte [edi+2],2
mov byte [edi+3],3
...

will be faster. There's no extra cost for using a displacement on any current microarchitecture AFAIK, except the extra one or four bytes of instruction size. Two-register addressing modes can be slower on Intel SnB-family CPUs, but fixed displacements are fine.

Real compilers like gcc and clang always use the first method (displacements in the effective address) when unrolling loops.


And BTW, a 4-byte store of 0x03020100 would be almost exactly 4x faster than four separate one-byte stores. Most modern CPUs have 128b data paths, so any single store up to 128b takes the same execution resources as an 8b store. AVX 256b stores are still less expensive than two 128b stores on Intel SnB / IvB (if aligned), while Intel Haswell and later can do a 256b store in a single operation. However, mov-immediate to memory is only available for 8, 16, and 32-bit operands. mov r64, imm64 (to register only) is available in 64-bit mode, but there are no 128 or 256 mov-immediate instructions.


In 32-bit mode, where one-byte encodings of inc reg are available, the inc edi / mov byte [edi],1 would have equal code size, but still decode to twice as many uops on recent Intel and AMD microarchitectures. This might not be a problem if the code was still bottlenecked on store throughput or something, but there's no way in which it's better. CPUs are very complex, and simple analysis by counting uops doesn't always match the results you get in practice, but I think it's highly unlikely that an inc between every store will run faster. The best you can say is that it might not run measurably slower. It will probably use more power / heat, and be less friendly for hyperthreading.

In 64-bit mode, inc rdx takes 3 bytes to encode: 1 REX to specify 64-bit operand size (rather than the default 32-bit), 1 opcode byte to specify inc r/m, and 1 mod/rm byte to specify rdx as the operand.

So in 64-bit mode, there is a code-size downside. In both cases, the inc solution will use twice as many entries in the highly-valuable uop-cache (on Intel SnB-family CPUs), which holds fused-domain uops.

More uops also means more space in the ROB, so out-of-order exec can't see as far ahead.

Also, a chain of inc instructions will delay the store-address uops from calculating multiple store addresses earlier (and writing them into the store buffer). Intel Ice Lake has two ports that can run store-address uops (down from 3 in Haswell). It's better for later loads if store addresses are ready earlier, so the CPU can be sure they're independent, or that they do overlap. It also gets them out of the scheduler (RS) earlier, freeing up space in that out-of-order exec structure.


The 2nd part:

mov byte [edi+5],0xFF
mov byte [edi+5],0xFF
mov byte [edi+5],0xFF
mov byte [edi+5],0xFF

vs.

add edi,5            ; 3 bytes to encode.
mov byte [edi],0xFF  ; saving one byte in each instruction
mov byte [edi],0xFF
mov byte [edi],0xFF
mov byte [edi],0xFF

Unless code-size was a critically important (unlikely), or there were many more stores, use the first form. The 2nd form is one byte longer, but one fewer fused-domain uop. It will use less space in the uop-cache on CPUs that have them. On older CPUs (without a uop cache), instruction decoding was more of a bottleneck, so there might be some cases where instructions lining up better into groups of 4 was the bottleneck. That won't be the case if you're bottlenecked on the store port, though.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Note also that effective address calculations can take place in parallel if you use the no-`inc` version. With the `inc` version, each effective address calculation stalls on the preceding `inc`. (Or at least it did on Pentium, which was the last time I studied this sort of thing.) – Raymond Chen Jan 16 '16 at 07:16
  • @RaymondChen: yup, that's correct. I left that out of the answer because they're stores, so it doesn't matter as much if they take a long time to eventually retire. Although now that I think about it, unresolved store addresses mean that all following loads have to wait in case there's a read-after-write dependency. Also, being able to execute the store-address uops gets them out of the scheduler, which is much smaller than the re-order buffer. (32 vs. 192 or something). – Peter Cordes Jan 16 '16 at 08:25
  • If you really want to dig deep, the `inc` instruction is troublesome because it modifies *some* flags but leaves other unchanged. This means that the state of flags after `inc` is dependent on the order of execution of multiple instructions, which further impedes parallelism. – Raymond Chen Jan 16 '16 at 16:47
  • @Raymond: It's only a problem for software on P4. The P6 and SnB microarch families, and AMD, all rename different parts of EFLAGS separately, so there's only a penalty if you read a flag that was left unmodified by the last instruction to set flags. (e.g. an `adc` / `dec` / `jnz` loop.) The penalty is worse on older CPUs (stall vs. extra uop to merge) http://stackoverflow.com/questions/32084204/problems-with-adc-sbb-and-inc-dec-in-tight-loops-on-some-cpus. Modern hardware spends transistors and power to avoid false dependencies on flags whenever possible, so might as well save insn bytes. – Peter Cordes Jan 16 '16 at 17:10