imul then mov vs mov then imul - any difference?

Question

If I compile the following C++ program:

int baz(int x) { return x * x; }

in clang 15, I get:

baz(int):
        mov     eax, edi
        imul    eax, edi
        ret

while gcc 12.2 gives me:

baz(int):
        imul    edi, edi
        mov     eax, edi
        ret

(See this on GodBolt)

Are these two implementations entirely equivalent, and merely a matter of arbitrary choice? If they're not equivalent, how can their difference manifest, or affect my program? I mean, in terms of CPU-state side-effects, latencies of other instructions, behavior during inlining etc.

score 5 · Answer 1 · answered Feb 10 '23 at 02:23

Do mov then imul because it's better with mov-elimination, and not worse anywhere for any other reason.

This is true in general for mov/and, mov/sub, etc, as long as you don't have a use for the original value. If you do, then sometimes mov to make a copy and then modify the original to hide mov latency for CPUs without move elimination. (mov/add or small shift should normally be lea).

CPU with mov-elimination

mov then imul is strictly better; overwriting a mov reg,reg result lets Intel CPUs free some resources they use to track mov elimination. (Probably something like a reference count for extra references beyond the normal RAT.) This increases the likelihood of later mov-eliminations being successful. See How do *move elimination* slots work in Intel CPU?

All else essentially equal (as in this case), prefer to mov then overwrite its result, especially when that doesn't make things worse for CPUs without mov-elimination (like Ice Lake, thanks Intel.)

It doesn't have to be in the next instruction, just sometime soon, preferably not left indefinitely e.g. for a long-running loop. But even that isn't a disaster usually.

To measure this benefit, a microbenchmark would probably need to do a lot of mov instructions that don't overwrite their result, to run the CPU out of mov-elimination slots and have some of them need an execution unit. The microbenchmark would also need to be sensitive to the latency of those mov instructions, since most modern Intel CPUs have enough execution units to keep up with the issue/rename width in terms of throughput.

CPU without mov-elimination

mov reg,reg has 1 cycle latency. If you'd been doing x*y with two separate inputs, mov then imul makes that latency part of the input->output latency for one input but not the other. The other has an extra cycle to become ready before the imul would have to wait for it, if out-of-order exec would tend to have one input ready before the other.

(A compiler would typically have no way to guess which input was the result of a long dep chain vs. a mov-immediate when compiling a non-inline function, but a 50/50 chance of winning a cycle is better than having the mov always on the critical path after the imul.)

But with x*x without mov-elimination, the only difference is that we're writing both EDI and EAX, instead of writing EAX twice. I don't think that's significant in terms of using up physical-register-file (PRF) entries or freeing them sooner. Since most code-gen is trying to be good across multiple CPUs, favour mov then imul because some CPUs do have mov-elimination. It's essentially a tie for CPUs without, when you're squaring one variable.

Things that don't matter

On a CPU that does partial register renaming, writing a register might free up two physical-register-file (PRF) entries instead of just one. (While allocating a new PRF entry either way.) But just reading the full register would already insert a merging uop.

Intel Sandybridge-family is the only x86-64 microarchitecture that does partial-register renaming and uses a PRF. Intel P6 family (Nehalem and earlier) keeps results right in the ROB, associated with the uop that produced them, until commit to a separate "retirement register file"; this is why it has register-read stalls when you read too many "cold" registers. Only Sandybridge itself (and possibly Ivy Bridge) rename low-8 registers like DIL and DL separate from full registers; on Haswell/Skylake and later only high-8 registers like DH get renamed separately.

Anyway, DIL might have been renamed separately from the full RDI. There is no DIH equivalent of DH or CH, since we're talking about EDI not EDX or ECX (the next two arg-passing registers), and gcc/clang very rarely generate code that writes high-8-bit registers. (Why doesn't GCC use partial registers?)

But either mov/imul or imul/mov will merge DIL into RDI before EDI is read, whether it's written or not (by the same imul uop). Same for DH on Haswell and later if we had an arg in EDX.

"as long as you don't have a use for the original value" <- That's almost meaningless, in the sense that the function has an ABI which involves the input being placed in edi. The compiler can't know what gets placed there and whether it has any other use. That's a consideration for "inter-procedural" optimization passes. Of course, there's the question of whether the compiler is able to reverse the order of the operations when inlining the code later on. — einpoklum, Feb 10 '23 at 12:22
@einpoklum: yes, meaningless in a function that does `return x*x;` rather than `foo = x*x;` / `bar = x-y;` or something. I was considering the general case of needing mov+operation as part of a larger expression or in general in a larger function, since in practice you'd of course never want to have a function this tiny if you could possibly get it to inline. Compilers do inlining much earlier than asm code-gen, so it's not a matter of "reversing" the order; GCC wouldn't have turned GIMPLE (SSA form) into RTL (register transfer language) until after inlining, same for LLVM using LLVM-IR. — Peter Cordes, Feb 10 '23 at 16:25

imul then mov vs mov then imul - any difference?

1 Answers1

CPU with mov-elimination

CPU without mov-elimination

Things that don't matter