-2

i using IA32 assembly,

I would like to create a function which, given the two input numbers, rests the value of the power, the result must be maximum in 32 bits. The base is unsigned int base always positive while the exponent is int. so both negative and 0. thanks in advance

  • 1
    Hi giggi, can you provide the code you have written so far that is not working, and also a test case for desired two inputs and the exact output you seek? It is customary on here to provide very specific questions based on minimum reproducible examples. Thanks and good luck :) – mysteRious Nov 16 '18 at 16:17
  • What ranges of values are important for performance? Can you assume PPro instructions like `cmov` for branchless handing of the 0/1 check inside a shift loop? That might be worth it for throughput, even though it creates a potentially long data dependency. You'll want to use the usual integer power algorithm. http://wiki.c2.com/?IntegerPowerAlgorithm – Peter Cordes Nov 16 '18 at 16:24
  • Related: [Why isn't int pow(int base, int exponent) in the standard C++ libraries?](https://stackoverflow.com/a/13771544) has some performance experiments for a linear O(n) time vs. O(log2(n)) time shift implementation with a clunky recursive C implementation that compilers hopefully optimize into a loop, vs. the standard library function using floating point which can actually be faster for cases other than very small `n`. See [How can I write a power function myself?](https://stackoverflow.com/a/2882886) – Peter Cordes Nov 16 '18 at 17:27
  • For an FP version, you'd need either x87 `FYL2X` and so on, or an SSE implementation of exp2 and log2. See [Fastest Implementation of Exponential Function Using AVX](https://stackoverflow.com/q/48863719) and [Efficient implementation of log2(\_\_m256d) in AVX2](https://stackoverflow.com/q/45770089) for some tips on making a fast version that might be accurate enough to round to the correct nearest integer when converting back to integer. – Peter Cordes Nov 16 '18 at 17:30

1 Answers1

1

Is the result supposed to be an integer as well? x^-n is just 1/x^n, which rounds to zero for any x other than 1. e.g. pow(16, -2) is 1/256.

For an integer return value, just check for positive n or return 1 or 0. For an FP return value, you can use an unsigned implementation with the absolute value, and conditionally take the reciprocal.

For large n magnitude, you might want to use an FP exp/log based implementation (see my comments on the question, and How can I write a power function myself?) instead of a loop-through-the-bits implementation.


For pure integer with an unsigned exponent (or signed positive), a nice branchless implementation is possible, using the usual algorithm of right-shifting the exponent and multiplying the result if the current bit was set. (See https://eli.thegreenplace.net/2009/03/21/efficient-integer-exponentiation-algorithms for the math behind the algorithm, and code in Python.)

We can use shr to right shift and CMOV on the bit shifted out, and loop-branch on the remaining value.

This version passes args in the same registers that x86-64 System V would, but it will assemble in 32-bit mode just fine. You can of course adapt it to whatever calling convention you like; it needs 4 registers so you might need to save/restore a call-preserved reg in 32-bit calling conventions.

It's similar to but better than what you get from x86-64 C compilers for a straight port of the Python implementation. (https://godbolt.org/z/L9Kb98 gcc / clang structure the loop with a test sil,1/cmov` inside it, separate from the loop branch on the shr result.)

;; untested
; integer power
; args in EDI,ESI  (like the x86-64 System V calling convention)
; returns in EAX
; clobbers: EDX, EDI, ESI
ipown:   ; (int a (EDI), unsigned n (ESI))
    mov    eax, 1       ; res = 1

    test   edi,edi
    jz    .zero_exponent

.loop:
    mov    edx, eax      ; tmp = res
    imul   eax, edi      ; res *= a  (will be overwritten with old res if n&1 == 0)
    imul   edi, edi      ; a*=a

    shr    esi, 1        ; n>>=1  setting ZF according to result, and CF= bit shifted out (old_n&1)
    cmovnc  eax, edx     ; res = tmp if the bit was zero so we don't do res *= a this time
    jnz   .loop

.zero_exponent:
    ret

On a Broadwell or later Intel CPU, or AMD Ryzen, where we have 1 cycle CMOV and 3 cycle latency imul, this will hopefully run at 4 cycles per iteration (imul -> cmov dependency chain through EAX).

imul is fully pipelined on modern x86 (or at least sufficiently pipelined on AMD Bulldozer-family), but with a throughput of only 1 per clock, so there's a potential resource conflict between the two imul instructions that could both be waiting for edi to be ready. But the 3-cycle dep chain through EDI should get ahead of the 4 cycle imul/cmov chain, so on any cycle where both an imul eax,edi and an imul edi,edi are ready to start, the oldest-ready-first scheduling should make the right choice and start the imul eax,edi.

Notice that the mov edx,eax is off the critical path: it can run in parallel with imul. If I'd done tmp *= edi, the mov would be on the critical path and hurt latency on CPUs without mov-elimination for integer registers.


Of course, the max loop trip-count is only 32 (if the high bit is set in the exponent), so out-of-order execution can see through this to the end of the loop (and hopefully resolve the loop-exit mispredict before the multiplies get there).

This has few instructions in the loop (compared to its throughput) so it should be able to overlap significantly with independent instructions before/after.

The expected latency is approximately 4 cycles *trip_count = 4 * log2(n), i.e. 4 * the position of the highest set bit in the exponent.


For an FP version of this, x87 might actually be interesting for fcmov. Otherwise you could maybe use shifts and SSE4 blendvps to blend based on the high bit of another register. 0.0 is the additive identity, but not the multiplicative identity, so ANDing with compare results doesn't just work.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847