Efficiently loading both RAX and R8 with the same small positive number

Question

Instead of writing mov rax, 1 (7 byte encoding 48, C7, C0, 01, 00, 00, 00), I can write mov eax, 1 (5 byte encoding B8, 01, 00, 00, 00) relying on the automatic zeroing of the high dword.

For copying RAX to R8, I can choose between mov r8, rax (3 byte encoding 49, 89, C0) or mov r8d, eax (3 byte encoding 41, 89, C0) again relying on the automatic zeroing of the high dword.

Is there any ratio at all to prefer one method of copying over the other?
The REX prefix cannot be avoided since R8 is one of the 'new' registers, and so REX.B is needed. Under this circumstance, is it desirable to try to avoid having the REX.W bit set?

score 4 · Accepted Answer · answered Jul 22 '22 at 17:51

4

If you need a REX prefix anyway, it doesn't really matter what bits are set in it. Almost all 64 bit instructions are as fast as their 32 bit counterparts; exceptions include the usual suspects (multiplications and divisions).

As for which of these two is faster: despite having a longer dependency chain, the second variant

mov eax, 1
mov r8d, eax

is likely to be faster as the second instruction is likely to be handled with register renaming, producing no latency and no µops at all. There are somewhat obscure exceptions in which register renaming may not fire; use a microarchitectural analyser to find these. In such cases, it may be better to load two immediates as these can be executed in parallel.

answered Jul 22 '22 at 17:51

fuz

88,405
25
200
352

Just to answer the second part of the question, `mov r8, rax` should have equal performance to `mov r8d, eax` on all CPUs. Probably in all cases, not just when the source register is known to be 32-bit-zero-extended. You need a REX prefix for R8 anyway, and it's not worse. I don't expect it would *help* [mov-elimination](https://stackoverflow.com/questions/44169342/can-x86s-mov-really-be-free-why-cant-i-reproduce-this-at-all) work better on any CPUs, but I don't expect it would hurt. – Peter Cordes Jul 22 '22 at 18:05
@PeterCordes That's what I tried to say in the first paragraph. – fuz Jul 22 '22 at 18:09
1

I have occasionally found that 32-bit operand size was faster than 64-bit operand-size in non-obvious cases (i.e. not division), such as https://www.agner.org/optimize/blog/read.php?i=415#857 testing uop throughput involving micro-fused loads, where using 64-bit `add rdx, [rsp]` made the loop 10% slower on Skylake. Code-alignment shouldn't have mattered, since that was 2017, before microcode that disabled the LSD, and before the JCC erratum. – Peter Cordes Jul 22 '22 at 18:11
Oh, yes I guess I skimmed past your first paragraph, sorry. So yeah, I concur with that. Effects that make mov-elimin succeed or not would I think involve surrounding code, not the choice of 32 vs. 64-bit operand-size for a mov the is eligible. – Peter Cordes Jul 22 '22 at 18:13

Efficiently loading both RAX and R8 with the same small positive number

1 Answers1