The interesting part of this exercise is finding ways to use 1 or 2 LEA, SHL, and/or ADD/SUB instructions to implement multiplies by various constants.
Actually dispatching on the fly for a single multiply isn't very interesting, and would mean either actual JIT compiling or that you have every possible sequence already present in a giant table of tiny blocks of code. (Like switch
statements.)
Instead I'd suggest writing a C or Python or whatever function that takes 1 integer arg, and as output produces the asm source text that implements x * n
where n
is the integer arg. i.e. a function like you might find in a compiler that optimizes a multiply-by-constant.
You might want to cook up an automated way to test this, e.g. by comparing against a pure C x * n
for a couple different x
values.
If you can't get the job done in 2 instructions (or 3 with one of them being mov
), it's not worth it. Modern x86 has ridiculously efficient multiply in hardware. imul reg, r/m, imm
is 1 uop, 3 cycle latency, fully pipelined. (AMD since Zen, Intel since Core2 or Nehalem or so.) That's your fallback for anything that you can't get done with a critical path length of 1 or 2 cycles (assuming zero-latency mov if you want, like IvyBridge+ and Zen.)
Or you could set a higher threshold before fallback if you want to explore more complicated sequences, e.g. aim for 64-bit multiply on Bulldozer-family (6 cycle latency). https://agner.org/optimize/. Or even P5 Pentium where imul
takes 9 cycles (not pairable).
Patterns to look for
Integer multiply boils down to adding up shifted copies of 1 operand where the other operand has 1
bits. (See the algorithm for implementing multiply by runtime-variable values, by shift and add checking each bit one at a time.)
The easiest pattern is of course only a single set bit, i.e. a power of 2; then it's just a left shift. This is easy to check for: n & (n-1) == 0
, when n != 0
.
Anything with exactly 2 set bits is at most 2 shifts and an add. (GNU C __builtin_popcount(n)
counts set bits. In x86 asm, SSE4.2 popcnt
).
GNU C __builtin_ctz
finds the bit-index of the lowest set bit. Using it on a number you know is non-zero will give you the shift count for the low bit. In x86 asm, bsf
/ tzcnt
.
To clear that lowest set bit and "expose" the next-lowest, you can do n &= n-1;
. In x86 asm, BMI1 blsr
or LEA / AND.
Another interesting pattern to look for is 2n +- 1. The +1 case is already covered by the 2-set-bits case, but the shift count for the low bit is 0; no shift needed. With shift counts up to 3, you can do it in one LEA.
You can detect 2^n - 1 by checking if n+1
is a power of 2 (has only 1 bit set). Somewhat more complex, (2^n - 1) * 2^m
can be done with this trick plus another shift. So you could try right-shifting to bring the lowest set bit to the bottom then looking for tricks.
GCC does this the 2^n - 1 way:
mul15: # gcc -O3 -mtune=bdver2
mov eax, edi
sal eax, 4
sub eax, edi
ret
clang is more efficient (for Intel CPUs where scaled-index is still only 1 cycle latency):
mul15: # clang -O3 -mtune=bdver2
lea eax, [rdi + 4*rdi]
lea eax, [rax + 2*rax]
ret
Combining these patterns
Maybe factorize your number into its prime factors and look for ways to use your building blocks to do combinations of those factors.
But this isn't the only approach. You can do x*11
as x*5*2 + x
, like GCC and Clang do this (which is a lot like How to multiply a register by 37 using only 2 consecutive leal instructions in x86?)
lea eax, [rdi + 4*rdi]
lea eax, [rdi + 2*rax]
There are 2 approaches for x*17 as well. GCC and Clang do it this way:
mul17:
mov eax, edi
sal eax, 4
add eax, edi
ret
But another way which they fail to use even with -march=sandybridge
(no mov-elimination, 1-cycle LEA [reg + reg*scale]
) is:
mul17:
lea eax, [rdi + 8*rdi] ; x*9
lea eax, [rax + 8*rdi] ; x*9 + x*8 = x*17
So instead of multiplying factors, we're adding different multipliers to make the total multiplier.
I don't have any great suggestions how to programmatically search for these sequences beyond the simple ones like 2 set bits, or 2^n +- 1. If you're curious, have a look in GCC or LLVM source code for the functions that do these optimizations; the find a lot of tricky ones.
The work might be split between target-neutral optimization passes for powers of 2 vs. x86-specific target code for using LEA, and for deciding on a threshold of how many instructions is worth it before falling back to imul
-immediate.
Negative numbers
x * -8
could be done with x - x*9
. I think that might be safe even if x*9
overflows but you'd have to double-check on that.
Look at compiler output
#define MULFUN(c) int mul##c(int x) { return x*c; }
MULFUN(9)
MULFUN(10)
MULFUN(11)
MULFUN(12)
...
I put that on the Godbolt compiler explorer for the x86-64 System V ABI (first arg in RDI, like the above examples). With gcc and clang -O3. I used -mtune=bdver2
(Piledriver) because it has somewhat slower multiply than Intel or Zen. This encourages GCC and Clang to avoid imul
slightly more aggressively.
I didn't try if long
/ uint64_t
would change that (6 cycle instead of 4 cycle latency, and half the throughput.) Or if an older uarch like -mtune=nocona
(Pentium 4) would make a difference. -mtune=bdver2
did make a difference vs. the default tune=generic
for GCC at least.
If you use -m32
, you can use even older uarches like -mtune=pentium
(in-order P5). I'd recommend -mregparm=3
for that so args are still passed in registers, not the stack.