Is the result supposed to be an integer as well? x^-n
is just 1/x^n
, which rounds to zero for any x
other than 1
. e.g. pow(16, -2)
is 1/256
.
For an integer return value, just check for positive n
or return 1
or 0
. For an FP return value, you can use an unsigned implementation with the absolute value, and conditionally take the reciprocal.
For large n
magnitude, you might want to use an FP exp/log based implementation (see my comments on the question, and How can I write a power function myself?) instead of a loop-through-the-bits implementation.
For pure integer with an unsigned exponent (or signed positive), a nice branchless implementation is possible, using the usual algorithm of right-shifting the exponent and multiplying the result
if the current bit was set. (See https://eli.thegreenplace.net/2009/03/21/efficient-integer-exponentiation-algorithms for the math behind the algorithm, and code in Python.)
We can use shr
to right shift and CMOV on the bit shifted out, and loop-branch on the remaining value.
This version passes args in the same registers that x86-64 System V would, but it will assemble in 32-bit mode just fine. You can of course adapt it to whatever calling convention you like; it needs 4 registers so you might need to save/restore a call-preserved reg in 32-bit calling conventions.
It's similar to but better than what you get from x86-64 C compilers for a straight port of the Python implementation. (https://godbolt.org/z/L9Kb98 gcc / clang structure the loop with a test sil,1
/cmov` inside it, separate from the loop branch on the shr result.)
;; untested
; integer power
; args in EDI,ESI (like the x86-64 System V calling convention)
; returns in EAX
; clobbers: EDX, EDI, ESI
ipown: ; (int a (EDI), unsigned n (ESI))
mov eax, 1 ; res = 1
test edi,edi
jz .zero_exponent
.loop:
mov edx, eax ; tmp = res
imul eax, edi ; res *= a (will be overwritten with old res if n&1 == 0)
imul edi, edi ; a*=a
shr esi, 1 ; n>>=1 setting ZF according to result, and CF= bit shifted out (old_n&1)
cmovnc eax, edx ; res = tmp if the bit was zero so we don't do res *= a this time
jnz .loop
.zero_exponent:
ret
On a Broadwell or later Intel CPU, or AMD Ryzen, where we have 1 cycle CMOV and 3 cycle latency imul
, this will hopefully run at 4 cycles per iteration (imul -> cmov dependency chain through EAX).
imul
is fully pipelined on modern x86 (or at least sufficiently pipelined on AMD Bulldozer-family), but with a throughput of only 1 per clock, so there's a potential resource conflict between the two imul
instructions that could both be waiting for edi
to be ready. But the 3-cycle dep chain through EDI should get ahead of the 4 cycle imul/cmov chain, so on any cycle where both an imul eax,edi
and an imul edi,edi
are ready to start, the oldest-ready-first scheduling should make the right choice and start the imul eax,edi
.
Notice that the mov edx,eax
is off the critical path: it can run in parallel with imul
. If I'd done tmp *= edi
, the mov
would be on the critical path and hurt latency on CPUs without mov-elimination for integer registers.
Of course, the max loop trip-count is only 32 (if the high bit is set in the exponent), so out-of-order execution can see through this to the end of the loop (and hopefully resolve the loop-exit mispredict before the multiplies get there).
This has few instructions in the loop (compared to its throughput) so it should be able to overlap significantly with independent instructions before/after.
The expected latency is approximately 4 cycles *trip_count
= 4 * log2(n)
, i.e. 4 * the position of the highest set bit in the exponent.
For an FP version of this, x87 might actually be interesting for fcmov
. Otherwise you could maybe use shifts and SSE4 blendvps
to blend based on the high bit of another register. 0.0
is the additive identity, but not the multiplicative identity, so ANDing with compare results doesn't just work.