Your version destroys the caller's ebx
value. That register is call-preserved in the i386 System V ABI (which Linux uses). Pick a different tmp reg like EDX.
EAX, ECX, and EDX are call-clobbered in all the standard 32-bit calling conventions so those are the registers you can use for scratch space without saving/restoring.
Then the computation is done using %eax
(I suppose this is for performance reasons?
No, it's because imul
doesn't have a memory-destination encoding, only r/m
source.
If you cared about performance, you wouldn't introduce store/reload latency into the dependency chain for the product you're computing. Even keeping the loop counter in memory would be better (for large counts at least) because dec
latency is shorter than imul
latency.
In your loop, one input for each iteration comes from the previous iteration. The total latency of store/load/imul / store/load/imul / ... is what bottlenecks your loop on an out-of-order execution CPU (all current x86 CPUs are like that). Store-forwarding makes store/reload only about 5 cycles, but that's still longer than imul
latency on most CPUs.
If you're running low on registers and want to keep one of your variables in memory, generally prefer spilling a variable that only read inside a loop, not written. Re-reading the same memory location multiple times is fine.
In compiler/optimization terminology, "spilling" a variable from a register is when you can't do the preferred thing of keeping it in a register, and instead of to store it to memory.
e.g. imul 8(%ebp), %eax
as your loop body.
- you don't need a tmp reg to load+operate on a value from memory, you only need a counter + accumulator.
- the loop-carried dependency chain is only imul -> imul -> imul ... for the critical path.
You also didn't need to copy to -4(%ebp)
, you could have just modified 8(%ebp)
where you already had this variable in memory.
According to the calling convention you "own" your stack args and can modify that copy of your variables. So even if you did want to modify a value in memory, you still didn't need to copy it.
Your loop structure is also inefficient, although the 3-cycle latency of imul
(or worse on some older CPUs) will easily hide the overhead of running all those extra instructions vs. a simple dec %ecx
/ jnz loop_find_pow
at the bottom of your loop. Why are loops always compiled into "do...while" style (tail jump)?
Look at optimized compiler output for a C version of your function!
How to remove "noise" from GCC/clang assembly output?
(update: this could be simplified further because we're guaranteed that n>0
. So we could just directly use a do{}while(--n);
loop structure. I missed that in the question originally and will leave that simplification as an exercise for the reader.)
unsigned find_pow(unsigned x, unsigned n) {
unsigned tot = x; // or should this be 1? Possible off-by-1 error
for(; n ; n--) { // gcc fails to convert i=0 ; i<n ; i++ into a down-count
tot *= x;
}
return tot;
}
GCC 9.1 -O3 -m32 -fno-tree-vectorize -march=skylake
on Godbolt
find_pow:
movl 4(%esp), %ecx
movl 8(%esp), %eax
movl %ecx, %edx
testl %eax, %eax # if(n==0) return
je .L1
.L3: # do{
imull %ecx, %edx
decl %eax
jne .L3 # }while(--n);
.L1:
movl %edx, %eax # silly compiler, should have used tot=EAX in the first place
ret
clang unrolls (unless you use -fno-unroll-loops
). But it doesn't use multiple accumulators for the product to hide imul
latency /facepalm. imul
is pipelined with better throughput than latency on modern x86 CPUs, typically 3 cycle latency / 1 cycle throughput. See https://agner.org/optimize/.
But clang just repeats imul %ecx, %eax
8 times in its main loop which is basically useless. Multiplying into %edx
and %eax
in parallel, and then multiplying those at the end, would let two chains of imul
run in parallel. Use at least 3 registers for best performance with large n
. But practical values of n
are relatively small (like even 2**32
will overflow a 32-bit integer), so large-n
optimizations are only interesting if you want to use this to get the low bits of (x**n) % (2**32)
.
I think that's a safe optimization even with wraparound at 2^32. GCC auto-vectorizes with AVX2 vpmulld
if available (-march=skylake
) which wouldn't be safe if integer multiply wasn't associative.
Clang does unroll with multiple accumulators does when optimizing FP loops with -ffast-math
(to make it treat FP add/mul as associative) so it's too bad it isn't doing that with integer math. Unrolling by 8 only makes sense if the compiler doesn't realize that n
is probably usually small, so if clang is going to optimize for large n
, do it right and hide imul
latency.
When auto-vectorizing with AVX2, clang does get this somewhat right and uses 4 vectors of totals in parallel. But vpmulld
has 10 cycle latency on Haswell/Skylake and clang unrolls by 32, not just 4. Seems like way too much.