With x86-style 2 operand instructions that destroy their destination, you can always simulate a non-destructive 3-operand instruction with mov
to copy one operand to the destination, then run the destructive instruction on that destination.
# with ecx and edx holding your inputs (which I'm calling C and D).
mov ebx, ecx ; ebx = C
sub ebx, edx ; ebx = C - D
That's the best you can do for this case, where you need to not destroy the values in ECX and EDX.
If you're running low on available registers, saving ECX on the stack and then producing the C - D
result in ECX instead of a new register can be a good option.
Often you can keep using the same register for the same variable throughout a function, but this is not required, and sometimes not optimal. Use comments to keep track of things.
Compilers are usually pretty good at register allocation, but their code can be hard to read because they don't even try to be consistent with register use. For non-destructive operations they'll often put the result in a new register for no reason. Still, compiler output is often a good starting point for optimization. (Write a tiny function that does something, and see how it compiles. Or write your whole thing in C with function args instead of constants as inputs, and compile it.)
x86 has some copy-and-operate instruction for other operations (not sub
), most notably LEA.
lea ebx, [ecx + ecx*4] ; ebx = C * 5
lea ebx, [ecx + ebx - 2] ; ebx = C + D - 2
x86 addressing modes can add or subtract constants, but can only left-shift and add registers.
The immediate-operand form of imul
is also 3-operand, for use with multipliers that you can't do with 1 or 2 LEAs:
imul ebx, ecx, 0x01010101 ; ebx = cl repeated 4 times, if upper bytes were zero
Unlike most immediate-operand instructions, imul
doesn't overload the /r
field in the ModRM byte as extra opcode bits. So it has room to encode a register destination and a reg/mem source, because 186 dedicated a whole opcode byte to it.
ISA extensions like BMI1 and BMI2 have added some new 3-operand integer instructions, like ANDN and SHRX.
andn ebx, ecx, edx ; ebx = (~C) & D ; BMI1
shrx ebx, edx, ecx ; ebx = D >> C ; BMI2
But they're not universally available, only Haswell and later, and Ryzen. (And the Pentium/Celeron versions of Haswell/Skylake are still sold without them, further delaying the point at which they become baseline. Thanks, Intel.)
And of course for vector instructions, AVX provides non-destructive versions of all the SSE instructions.
movaps xmm2, xmm0 ; copy a whole register
subsd xmm2, xmm1 ; scalar double-precision FP subtract: xmm0-xmm1
vsubsd xmm3, xmm0, xmm1
or a less obvious use-case
xorps xmm0, xmm0 ; zero the register and break any false dependencies
cvtsi2sd xmm0, eax ; convert to double-precision FP, with the upper element = 0
xorps xmm1, xmm1
cvtsi2sd xmm1, edx
vs. AVX:
vxorps xmm1, xmm1,xmm1 ; xmm1 = all-zero
vcvtsi2sd xmm0, xmm1, eax
vcvtsi2sd xmm1, xmm1, edx
This reuses the same zeroed reg as a merge destination to avoid false dependencies (and have the upper 64 bits zero, of the 128-bit register).