Is there another way to calculate (a^3) * b + 5*(b^2) without so many mul instructions?

Question

I'm starting to learn assembly, could someone please review my solution?

Exercise :

You are given eax=a, ebx=b.
Calculate (a^3) * b + 5*(b^2), and store the result inside eax.
(Here * means multiplication, and c^d means c to the power of d).

Here's my solution :

;Init the values
mov eax,3
mov ebx,2

;3,2 =>(3^3)*2 + 5*(2^2) = 27*2 + 5*4 = 54+20 = 74
;5,2 =>(5^3)*2 + 5*(2^2) = 125*2 + 5*4 = 250 + 20 = 270

;(a^3)*b 
mov esi,eax
mul eax
mul esi
mul ebx
;Store the value
mov esi,eax
;5*(b^2)
mov eax,ebx
mul ebx
mov ebx,5
mul ebx
;Calculate (a^3)*b + 5*(b^2)
lea eax,[esi + eax]

Is there a way to solve this with less instructions?

A well known trick for multiplying by 5 (but can be effective for multiplying by numbers that are not power of 2) is to use the `lea` instruction. For instance `lea ebx, [ebx+ebx*4]` would be a single instruction to multiply the contents of `ebx` by 5 . Multiplying something by 4 and adding itself is the same as multiplying by 5 which this example `lea` instruction is doing. — Michael Petch, Aug 14 '15 at 23:30
a^3*b + 5*b^2 = b*(a^3 + 5*b), you essentially save one multiplying instruction as far as I can tell — spyr03, Aug 14 '15 at 23:33
@spyr03 Yep refactoring the equation is also an instruction saver. — Michael Petch, Aug 14 '15 at 23:38

score 2 · Answer 1 · edited May 21 '22 at 19:04

2

Unless you are restricted from using all instructions for some reason, there are 2 and 3 operand forms of imul that will get the job done faster. You can also use the distributive rule to avoid a multiplication:

mov   ecx, eax           ; c = a
imul  ecx, eax           ; c *= a (a*a)
imul  ecx, eax           ; c *= a (a*a*a)
lea   eax, [ebx + ebx*4] ; a = (b + b*4) = 5*b
add   eax, ecx           ; a += c (5*b + a*a*a)
imul  eax, ebx           ; a *= b (b*[5*b + a*a*a])

Confession: To get this I wrote a trivial C function and compiled it to assembly. The compiler even applied the distributive rule to b. The quality of this code shows why writing assembly seldom pays. (I know that these days most people are learning assembly to read it for cyber security purposes, and that's fine.)

edited May 21 '22 at 19:04

Peter Cordes

328,167
45
605
847

answered Aug 15 '15 at 04:13

Gene

46,253
4
58
96

xD, I was still putting the finishing touches on mine when you posted this. I love that we came up with the identical sequence, except I used one more scratch register. /pat the clever compiler for building `a^3` in the temporary, and doing the `lea` into `eax`. – Peter Cordes Aug 15 '15 at 04:22
Yeah. The compiler's brutal objectivity about what instructions do rather than what the mnemonics imply makes it "creative" in the best sense: load effective address, indeed! ;-) – Gene Aug 15 '15 at 05:16
Well using `lea` is obvious. The clever bit is keeping track of what value is where, using fewer registers. As a human, it's easier to keep using the same register for something, instead of keeping track of where your live data has gotten to. But compilers are quite good at that. – Peter Cordes Aug 15 '15 at 05:24

Peter Cordes · Accepted Answer · 2022-05-22T02:27:52.790

Since you don't need a double-width output in EDX:EAX, instead just a 32-bit result from your 32-bit inputs, don't use slower and less convenient 1-operand mul or imul.

The 2-operand imul r, r is faster because it doesn't have to compute the upper half of the product, or store it anywhere. It works for signed and unsigned numbers. It looks like you could have save a lot of mov instructions by using the 2-operand form.

To summarize comments:

Multiplies by small constant factors are often best done with lea. You can replace up to 4 instructions (including a mov) with an lea, since it gives you non-destructive operation. (dest isn't one of the sources).

As spyr03 points out,

a^3*b + 5*b^2 = b*(a^3 + 5*b)

Your final instruction:

lea eax,[esi + eax]

could have been a simple add eax, esi, which can run on more execution ports than lea, so it's less likely to part of a bottleneck. Only use lea if you can't do the equivalent with one other single instruction. (Other than imul / mul. Always replace those with a single lea if possible. For latency over front-end throughput, sometimes worth using up to two LEAs to replace one imul reg, reg, 37 or whatever.)

So I might do:

mov  ecx, eax
imul ecx, eax            ; ecx = a^2
imul eax, ecx            ; eax = a^3
lea  edx, [ebx + 4*ebx]  ; edx = 5*b, ebx still = b
add  eax, edx            ; eax = a^3 + 5*b
imul eax, ebx            ; eax = b * (a^3 + 5*b)

Always comment your asm code. I like the saying that asm code can only have 2 kinds of bugs: code doesn't match comments, or comments don't describe a valid algorithm for doing the task.

Latency:

5*b happens in parallel with a^3, and is much faster (1 cycle). (I put it after the imuls for a to make sure the CPU started working on the longer dep chain ASAP.)
The long dep chain is the one involving eax.

mov(1 or 0) -> imul(3) -> imul(3) -> add(1) -> imul(3) total = 10 cycles

(On IvyBridge and later, reg-reg moves are done at the register-renaming stage, and have zero latency.)

It's not very many instructions, though, and they're all single-uop instructions, so there's plenty of room for other stuff to happen in parallel with it.

I don't see any scope for shortening the dependency chain even at the cost of more instructions. a^3*b and 5*b^2 could be calculated in parallel, and added at the end, but that would still be 3 multiplies in the longer of the two chains.

Is there another way to calculate (a^3) * b + 5*(b^2) without so many mul instructions?

2 Answers2