Efficient Assembly multiplication

Question

Started to practice assembly, not too long ago. I want to implement an efficient multiplying through assembly commands lea and shift. I want to write a c program that will call an assembly procedure that fits an constant argument recieved by the user and will multiply another argument recieved by the user by that constant.

How can I make this code effective?
What numbers can I group (if any) to fit the same procedure? for example I think that I can group 2,4,8,... to the same procedure as they are just a left shift by 1,2,3 for example.

But I'm having trouble finding other groups like this one with other numbers and what about negatives...

@tkausl: since you asked, it's less efficient than 2-operand `imul reg,reg` or 3-operand `imul dst, src, imm`. But yes, modern x86 has ridiculously efficient multiply in hardware. 1 uops, 3 cycle latency, fully pipelined. (AMD since Zen, Intel since Core2 or Nehalem or so.) It's often worth using shift or LEA for multiply by a *constant*, though, when you can do it in 2 or fewer instructions. Otherwise use `imul` with an immediate. — Peter Cordes, Dec 07 '19 at 15:00
I heard its more efficient than mul but than I also heard others say its not true. Anyway its a good practice for me. — Blur, Dec 07 '19 at 15:00
@Blur How is it good practice for you if we give you the answer? ;) — klutt, Dec 07 '19 at 15:03
A compiler can optimize based the fact that an operands is constant. A function can’t and using a function would be a losing proposition. You might be able to do something with inline assembler code, but you’d have to manually optimize each place (and you lose portability with inline assembler). If you want to waste^h^h^h^h^h expend your time on it, so be it. But it probably isn’t a good use of your time. — Jonathan Leffler, Dec 07 '19 at 15:07
@Blur: So you want to write an optimizer function that will create a sequence of asm instructions to multiply by a constant? Like what GCC / clang use as part of compiling functions like `return x*33`? I'd suggest writing it in C. See [How to multiply a register by 37 using only 2 consecutive leal instructions in x86?](//stackoverflow.com/q/46480579). For example, any constant of 2^n +- 1 is easy with a `mov` + shift and add, or for small shifts a single LEA. Other patterns that take 2 LEAs are tricker, especially ones like `37` (see the link). For more examples, look at GCC output. — Peter Cordes, Dec 07 '19 at 15:07
Relevant: [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/q/692718/69809). The usual resource for checking the latency for various CPUs is [Agner's CPU blog](https://www.agner.org/optimize/), or more precisely the [huge list of CPU instruction latencies](https://www.agner.org/optimize/instruction_tables.pdf). — vgru, Dec 07 '19 at 15:08
Again I'm doing that for my practice, to play with how shiftting and LEA works and feel it with my hands... — Blur, Dec 07 '19 at 15:08
@Blur Perhaps a good start is to check [what gcc generates](https://godbolt.org/z/8Wqs45), in order to get some ideas from the guys who are doing this stuff for years. — vgru, Dec 07 '19 at 15:11

score 3 · Answer 1 · answered Dec 07 '19 at 19:20

The interesting part of this exercise is finding ways to use 1 or 2 LEA, SHL, and/or ADD/SUB instructions to implement multiplies by various constants.

Actually dispatching on the fly for a single multiply isn't very interesting, and would mean either actual JIT compiling or that you have every possible sequence already present in a giant table of tiny blocks of code. (Like switch statements.)

Instead I'd suggest writing a C or Python or whatever function that takes 1 integer arg, and as output produces the asm source text that implements x * n where n is the integer arg. i.e. a function like you might find in a compiler that optimizes a multiply-by-constant.

You might want to cook up an automated way to test this, e.g. by comparing against a pure C x * n for a couple different x values.

If you can't get the job done in 2 instructions (or 3 with one of them being mov), it's not worth it. Modern x86 has ridiculously efficient multiply in hardware. imul reg, r/m, imm is 1 uop, 3 cycle latency, fully pipelined. (AMD since Zen, Intel since Core2 or Nehalem or so.) That's your fallback for anything that you can't get done with a critical path length of 1 or 2 cycles (assuming zero-latency mov if you want, like IvyBridge+ and Zen.)

Or you could set a higher threshold before fallback if you want to explore more complicated sequences, e.g. aim for 64-bit multiply on Bulldozer-family (6 cycle latency). https://agner.org/optimize/. Or even P5 Pentium where imul takes 9 cycles (not pairable).

Patterns to look for

Integer multiply boils down to adding up shifted copies of 1 operand where the other operand has 1 bits. (See the algorithm for implementing multiply by runtime-variable values, by shift and add checking each bit one at a time.)

The easiest pattern is of course only a single set bit, i.e. a power of 2; then it's just a left shift. This is easy to check for: n & (n-1) == 0, when n != 0.

Anything with exactly 2 set bits is at most 2 shifts and an add. (GNU C __builtin_popcount(n) counts set bits. In x86 asm, SSE4.2 popcnt).

GNU C __builtin_ctz finds the bit-index of the lowest set bit. Using it on a number you know is non-zero will give you the shift count for the low bit. In x86 asm, bsf / tzcnt.

To clear that lowest set bit and "expose" the next-lowest, you can do n &= n-1;. In x86 asm, BMI1 blsr or LEA / AND.

Another interesting pattern to look for is 2ⁿ +- 1. The +1 case is already covered by the 2-set-bits case, but the shift count for the low bit is 0; no shift needed. With shift counts up to 3, you can do it in one LEA.

You can detect 2^n - 1 by checking if n+1 is a power of 2 (has only 1 bit set). Somewhat more complex, (2^n - 1) * 2^m can be done with this trick plus another shift. So you could try right-shifting to bring the lowest set bit to the bottom then looking for tricks.

GCC does this the 2^n - 1 way:

mul15:              # gcc -O3 -mtune=bdver2
        mov     eax, edi
        sal     eax, 4
        sub     eax, edi
        ret

clang is more efficient (for Intel CPUs where scaled-index is still only 1 cycle latency):

mul15:             # clang -O3 -mtune=bdver2
        lea     eax, [rdi + 4*rdi]
        lea     eax, [rax + 2*rax]
        ret

Combining these patterns

Maybe factorize your number into its prime factors and look for ways to use your building blocks to do combinations of those factors.

But this isn't the only approach. You can do x*11 as x*5*2 + x, like GCC and Clang do this (which is a lot like How to multiply a register by 37 using only 2 consecutive leal instructions in x86?)

        lea     eax, [rdi + 4*rdi]
        lea     eax, [rdi + 2*rax]

There are 2 approaches for x*17 as well. GCC and Clang do it this way:

mul17:
        mov     eax, edi
        sal     eax, 4
        add     eax, edi
        ret

But another way which they fail to use even with -march=sandybridge (no mov-elimination, 1-cycle LEA [reg + reg*scale]) is:

mul17:
        lea    eax,  [rdi + 8*rdi]  ; x*9
        lea    eax,  [rax + 8*rdi]  ; x*9 + x*8 = x*17

So instead of multiplying factors, we're adding different multipliers to make the total multiplier.

I don't have any great suggestions how to programmatically search for these sequences beyond the simple ones like 2 set bits, or 2^n +- 1. If you're curious, have a look in GCC or LLVM source code for the functions that do these optimizations; the find a lot of tricky ones.

The work might be split between target-neutral optimization passes for powers of 2 vs. x86-specific target code for using LEA, and for deciding on a threshold of how many instructions is worth it before falling back to imul-immediate.

Negative numbers

x * -8 could be done with x - x*9. I think that might be safe even if x*9 overflows but you'd have to double-check on that.

Look at compiler output

#define MULFUN(c) int mul##c(int x) { return x*c; }
MULFUN(9)
MULFUN(10)
MULFUN(11)
MULFUN(12)
...

I put that on the Godbolt compiler explorer for the x86-64 System V ABI (first arg in RDI, like the above examples). With gcc and clang -O3. I used -mtune=bdver2 (Piledriver) because it has somewhat slower multiply than Intel or Zen. This encourages GCC and Clang to avoid imul slightly more aggressively.

I didn't try if long / uint64_t would change that (6 cycle instead of 4 cycle latency, and half the throughput.) Or if an older uarch like -mtune=nocona (Pentium 4) would make a difference. -mtune=bdver2 did make a difference vs. the default tune=generic for GCC at least.

If you use -m32, you can use even older uarches like -mtune=pentium (in-order P5). I'd recommend -mregparm=3 for that so args are still passed in registers, not the stack.

Efficient Assembly multiplication

1 Answers1

Patterns to look for

Combining these patterns

Negative numbers

Look at compiler output

Linked