Why is pow(int, int) so slow?

Question

I've been working on a few project Euler exercises to improve my knowledge of C++.

I've written the following function:

int a = 0,b = 0,c = 0;

for (a = 1; a <= SUMTOTAL; a++)
{
    for (b = a+1; b <= SUMTOTAL-a; b++)
    {
        c = SUMTOTAL-(a+b);

        if (c == sqrt(pow(a,2)+pow(b,2)) && b < c)
        {
            std::cout << "a: " << a << " b: " << b << " c: "<< c << std::endl;
            std::cout << a * b * c << std::endl;
        }
    }
}

This computes in 17 milliseconds.

However, if I change the line

if (c == sqrt(pow(a,2)+pow(b,2)) && b < c)

to

if (c == sqrt((a*a)+(b*b)) && b < c)

the computation takes place in 2 milliseconds. Is there some obvious implementation detail of pow(int, int) that I'm missing which makes the first expression compute so much slower?

`a*a` is probably 1 instruction. `pow` is at least a function call, plus whatever work the function does. — jtbandes, Dec 10 '16 at 06:24
*This computes in 17 milliseconds.* -- First, `pow` is a floating point function. Second, posting how much time a function takes only makes sense if you're running an optimized build. If you're running an unoptimized of "debug" build, the time is meaningless. And last, but not least, [don't use pow if the exponent is an integer](http://stackoverflow.com/questions/25678481/why-does-pown-2-return-24-when-n-5-with-my-compiler-and-os) — PaulMcKenzie, Dec 10 '16 at 06:33
This [review](https://codereview.stackexchange.com/questions/145221/disproving-euler-proposition-by-brute-force-in-c) might be interesting for you. It's both a library call, as well as a "overpowered" function as ringo said. — Zeta, Dec 10 '16 at 08:35
It's probably faster if you use `c*c = a*a + b*b`: multiplication, especially integer multiplication, is faster than square root. But it's only correct if `c*c` doesn't overflow. — Roel Schroeven, Dec 10 '16 at 10:51
@RoelSchroeven But if `c*c` overflows, then `a*a + b*b` would also overflow (assuming that they are in fact equal), so it probably should not matter much. — tobias_k, Dec 10 '16 at 13:07
Another tip: you can speed up things if you reverse the order of the conditionals: `if (b < c && c*c == a*a + b*b)` instead of `if (c*c == a*a + b*b && b < c)`: `b < c` is a fast operation, and allows the program to skip the relatively slow calculation when it's not needed. — Roel Schroeven, Dec 10 '16 at 16:58
@RoelSchroeven: Even better: put that check into the the loop condition: `for(... ; b <= SUMTOTAL-a-b ; ...)`, because `c` decreases by one each iteration, so the inner-loop should just end once they cross. See my answer for a near-optimal version using the sqrt-elimination technique (although without loop unrolling or AVX). — Peter Cordes, Dec 10 '16 at 17:54
@PaulMcKenzie: Actually, if the exponent is an integer, then the pow function will be slow (making the code run 8.5 times longer) instead of excessively slow - any decent implementation of the pow function will recognise that the exponent is just two and evaluate pow (x, 2.0) as x*x. So the overhead is just conversion to double and a function call. For general values of the exponent it's likely at least a logarithm and an exponential function. — gnasher729, Dec 10 '16 at 22:04
I might also hoist `a*a` out of the `b` loop. That subexpression is constant throughout that loop. Even better, since from a^2, one wishes to compute (a+1)^2, merely add 2a+1, rather than performing another multiply. Same for incremental b^2. You compiler might be doing this hoisting and strength reduction for you. — Eric Towers, Dec 11 '16 at 03:09
Did you use optimization e.g. `-O3` or `-O2`? If not, as @PaulMcKenzie said, your performance results are not interesting. — Z boson, Dec 14 '16 at 13:32
Many developers prefer std::pow just because it is a function from the standards library, and unfortunately the standard library lacks something like square(x) or sqr(x), but one can easily implement them by oneself as { return x*x; } — Fedor, May 03 '21 at 11:57

Déjà vu · Accepted Answer · 2016-12-10T09:49:37.720

76

pow() works with real floating-point numbers and uses under the hood the formula

pow(x,y) = e^(y log(x))

to calculate x^y. The int are converted to double before calling pow. (log is the natural logarithm, e-based)

x^2 using pow() is therefore slower than x*x.

Edit based on relevant comments

Using pow even with integer exponents may yield incorrect results (PaulMcKenzie)
In addition to using a math function with double type, pow is a function call (while x*x isn't) (jtbandes)
Many modern compilers will in fact optimize out pow with constant integer arguments, but this should not be relied upon.

edited Dec 10 '16 at 09:49

answered Dec 10 '16 at 06:23

Déjà vu

28,223
6
72
100

7

Not only is it slower, you can get inexact answers, even for integer base and exponents. – PaulMcKenzie Dec 10 '16 at 06:38
This mostly correct, except that usually an optimized implementation will use logarithm with base 2, since there are fast way to compute 2^n. @PaulMcKenzie pow(integer, integer) will always give exact result as long as the result is representable as floating point. At least this is the case for most reputable libm – Yan Zhou Dec 10 '16 at 06:42
3

@YanZhou -- It will not always give the exact results, else [this would never have been asked](http://stackoverflow.com/questions/25678481/why-does-pown-2-return-24-when-n-5-with-my-compiler-and-os?noredirect=1&lq=1) – PaulMcKenzie Dec 10 '16 at 06:44
3

@PaulMcKenzie As I said, it is the case with reputable libm. Not every libm give exact. As far as I know, AMD libm, Intel libimf, OpenLibm, BSD libm and its derivatives such as the one in macOS will all give you `pow(5, 2) == 25`, the example you cited. GNU libc is the most widely used on linux, but it does not make it the gold standard – Yan Zhou Dec 10 '16 at 06:50
1

@PaulMcKenzie One correction, GNU libc shall also give the same results. The post you cited is probably using GCC+Code Blocks on Windows, MS libm was the less reputable one. – Yan Zhou Dec 10 '16 at 06:52
6

My point is that nothing in the C++ standard guarantees that `pow` gives exact results, even with integer exponents. Make life happy and compute pow using a lookup table, or some other means that guarantees there is no round-off error, regardless of the library being used. – PaulMcKenzie Dec 10 '16 at 06:53
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/130272/discussion-between-yan-zhou-and-paulmckenzie). – Yan Zhou Dec 10 '16 at 06:55
Are you sure that pow implements `pow(x,y) = e^(y log(x))` for `pow(double,int)`? For example pow(x,8) could be done in three multiplication operations as `return x = x*x, x = x*x, x = x*x`. Agner Fog in his Vector Class Library implements `pow(double, int)` using template meta-programming to achieve this for SSE/AVX/AVX512 which I think is likely much faster than using `pow(x,y) = e^(y log(x))`. – Z boson Dec 14 '16 at 10:25
1

I checked GCC and `pow(int,2)` generates `cvtsi2sd xmm0, edi` and `mulsd xmm0, xmm0` so I think your answer is misleading because it gives then impression that the implementation has to use `e^(y log(x))` when in fact the only thing it has to do is convert to double. – Z boson Dec 14 '16 at 10:32
@Zboson Compilers may optimize depending on arguments, this is the last point of the answer. – Déjà vu Dec 14 '16 at 12:33
But you can't rely on `pow(x,y) = e^(y log(x))` either obviously. The only thing you can rely on is your first statement that it works with floating point numbers. I doubt any good C++ compiler uses `e^(y log(x))` for `pow(x,2)`. I check GCC 6.2, Clang 3.9, and ICC 17 (I did not check MSVC) and they all to `x*x` and double conversions. Your answer is misleading. – Z boson Dec 14 '16 at 12:50
@Zboson `pow` general function uses the logs - particular cases like this one may benefit from optimizations, but it is not recommended to rely on such optimizations to do exponentiation. – Déjà vu Dec 14 '16 at 13:01
The OP asked a specific question not a general one. The slow down has nothing to do with `pow(x,y) = e^(y log(x))` (unless the OP compiled without optimization). – Z boson Dec 14 '16 at 13:08
I checked higher powers such as 4 and 8 and unfortunately the compilers do call `pow` in this case rather than optimize it to `log2(power)` multiplication operations. There is clearly some room for optimization still as only the power of 2 is optimized. – Z boson Dec 14 '16 at 13:11

Peter Cordes · Answer 2 · 2016-12-11T06:25:44.517

40

You've picked one of the slowest possible ways to check

c*c == a*a + b*b   // assuming c is non-negative

That compiles to three integer multiplications (one of which can be hoisted out of the loop). Even without pow(), you're still converting to double and taking a square root, which is terrible for throughput. (And also latency, but branch prediction + speculative execution on modern CPUs means that latency isn't a factor here).

Intel Haswell's SQRTSD instruction has a throughput of one per 8-14 cycles (source: Agner Fog's instruction tables), so even if your sqrt() version keeps the FP sqrt execution unit saturated, it's still about 4 times slower than what I got gcc to emit (below).

You can also optimize the loop condition to break out of the loop when the b < c part of the condition becomes false, so the compiler only has to do one version of that check.

void foo_optimized()
{ 
  for (int a = 1; a <= SUMTOTAL; a++) {
    for (int b = a+1; b < SUMTOTAL-a-b; b++) {
        // int c = SUMTOTAL-(a+b);   // gcc won't always transform signed-integer math, so this prevents hoisting (SUMTOTAL-a) :(
        int c = (SUMTOTAL-a) - b;
        // if (b >= c) break;  // just changed the loop condition instead

        // the compiler can hoist a*a out of the loop for us
        if (/* b < c && */ c*c == a*a + b*b) {
            // Just print a newline.  std::endl also flushes, which bloats the asm
            std::cout << "a: " << a << " b: " << b << " c: "<< c << '\n';
            std::cout << a * b * c << '\n';
        }
    }
  }
}

This compiles (with gcc6.2 -O3 -mtune=haswell) to code with this inner loop. See the full code on the Godbolt compiler explorer.

# a*a is hoisted out of the loop.  It's in r15d
.L6:
    add     ebp, 1    # b++
    sub     ebx, 1    # c--
    add     r12d, r14d        # ivtmp.36, ivtmp.43  # not sure what this is or why it's in the loop, would have to look again at the asm outside
    cmp     ebp, ebx  # b, _39
    jg      .L13    ## This is the loop-exit branch, not-taken until the end
                    ## .L13 is the rest of the outer loop.
                    ##  It sets up for the next entry to this inner loop.
.L8:
    mov     eax, ebp        # multiply a copy of the counters
    mov     edx, ebx
    imul    eax, ebp        # b*b
    imul    edx, ebx        # c*c
    add     eax, r15d       # a*a + b*b
    cmp     edx, eax  # tmp137, tmp139
    jne     .L6
 ## Fall-through into the cout print code when we find a match
 ## extremely rare, so should predict near-perfectly

On Intel Haswell, all these instructions are 1 uop each. (And the cmp/jcc pairs macro-fuse into compare-and-branch uops.) So that's 10 fused-domain uops, which can issue at one iteration per 2.5 cycles.

Haswell runs imul r32, r32 with a throughput of one iteration per clock, so the two multiplies inside the inner loop aren't saturating port 1 at two multiplies per 2.5c. This leaves room to soak up the inevitable resource conflicts from ADD and SUB stealing port 1.

We're not even close to any other execution-port bottlenecks, so the front-end bottleneck is the only issue, and this should run at one iteration per 2.5 cycles on Intel Haswell and later.

Loop-unrolling could help here to reduce the number of uops per check. e.g. use lea ecx, [rbx+1] to compute b+1 for the next iteration, so we can imul ebx, ebx without using a MOV to make it non-destructive.

A strength-reduction is also possible: Given b*b we could try to compute (b-1) * (b-1) without an IMUL. (b-1) * (b-1) = b*b - 2*b + 1, so maybe we can do an lea ecx, [rbx*2 - 1] and then subtract that from b*b. (There are no addressing-modes that subtract instead of add. Hmm, maybe we could keep -b in a register, and count up towards zero, so we could use lea ecx, [rcx + rbx*2 - 1] to update b*b in ECX, given -b in EBX).

Unless you actually bottleneck on IMUL throughput, this might end up taking more uops and not be a win. It might be fun to see how well a compiler would do with this strength-reduction in the C++ source.

You could probably also vectorize this with SSE or AVX, checking 4 or 8 consecutive b values in parallel. Since hits are really rare, you just check if any of the 8 had a hit and then sort out which one it was in the rare case that there was a match.

See also the x86 tag wiki for more optimization stuff.

edited Dec 11 '16 at 06:25

answered Dec 10 '16 at 17:50

Peter Cordes

328,167
45
605
847

How about: for (int a = 1...) { int aSquare = a * a; --- pull the multiplication out of the loop and save a cycle per loop. From the assembly you quote the compiler didn't see that one. – Loren Pechtel Dec 11 '16 at 05:48
@LorenPechtel: The compiler already hoists `a*a` (and keeps it in `r15d`). The two IMULs are for `b` and `c`, which both change inside the loop. I tried doing it manually, but the asm didn't change at all. I couldn't think of any clever transformations to do manually to remove the redundancy of doing two separate multiplies for two variables related by such a simple equation. – Peter Cordes Dec 11 '16 at 06:19
@LorenPechtel: Added hand-written comments to the asm, to save people time trying to figure out what the compiler did :) – Peter Cordes Dec 11 '16 at 06:26
About manually using SIMD instructions: doesn't the compiler do automatic vectorization on simple, embarassingly parallel loops like these? – étale-cohomology Dec 11 '16 at 07:27
1

@étale-cohomology: auto-vectorization is only effective because the `if()` is very rarely taken, and compilers don't know that. If there were very few full vectors with no elements that took the `if()`, it could easily be a loss to do a vectorized computation. (Although instead of redoing the computation to see which elements to print, you could just scan a bitmap from PMOVMSKB...) Anyway, yes auto-vectorization can work, but no this problem is not simple for compilers because of the conditional behaviour. Compiler technology is not at the point where I had any hope that they'd do this. – Peter Cordes Dec 11 '16 at 08:30
@PeterCordes Oops, misread the code. I got mislead by the c*c and thought it was doing a*a and b*b rather than b*b and c*c. – Loren Pechtel Dec 11 '16 at 21:08
@LorenPechtel: yeah, I found the OP's code confusing, too, with everything declared outside the loops even though it's C++. I assumed those were function inputs, especially since they were initialized! I definitely misunderstood the loop at first. – Peter Cordes Dec 11 '16 at 21:23
Argh! I forgot that * has meaning and was using it as a literal. It mangled my comment! – Loren Pechtel Dec 11 '16 at 23:52
@PeterCordes: Our professor always told us that declaring variables outside the loop improves performance. Of course, I've read varying opinion on that, depending on compiler and programming language. – Fang Dec 12 '16 at 08:20
2

@Fang: It's definitely not true for C. All optimizing compilers just see the data-movement required to implement the code. If anything, limiting the scope of a variable (and using separate variables for separate loops) will help the compiler see that they're independent, and dead after the loop. – Peter Cordes Dec 12 '16 at 18:59
So `pow(double x, int n)` can be done in `log2(n)` multiplications e.g. `pow(x,4) { return x=x*x, x=x*x, x=x*x, x=x*x;}`. The problem I see with this method is the dependency chain i.e. it's latency bound. I don't see anyway to make it throughput bound. I guess my main point is than `O(ln(n))` operations wins for large n but for small n `O(n)` operations may be better due to latency. – Z boson Dec 15 '16 at 08:45
@Zboson: pow(x,4) only takes 2 steps, not 4, using the O(log2(n)) method. What exactly are you suggesting as an O(n) method that bottlenecks on throughput? Surely not doing the same multiplication more than once just to keep the multipliers busy? Also, did you mean to leave that comment in the other thread where `pow(double,int)` was being discussed? – Peter Cordes Dec 15 '16 at 14:58
I meant pow(x,16) can be done in log2(16)=4 multiplication. I suggested that the O(log2(n)) operation bottlenecks on latency but not the O(n) operation. Yes, this comment was meant for you. I am asking a general question. – Z boson Dec 15 '16 at 15:06
@Zboson: But what is the O(n) implementation? The only ones I can think of are just serial left-to-right which bottlenecks on latency, or `((x*x) * (x*x)) * ...` where it would be totally redundant to calculate `x*x` 8 times in parallel. I think one of us is going to kick themselves for missing something obvious, but I'm not sure who yet :P – Peter Cordes Dec 15 '16 at 15:17
Nevermind, I am clearly rusty as I have contributed very little to these forums in the last 3 months. I was thinking about a reduction with threads and how the O(n) method can be faster than the O(ln(n)) for a small number of threads. But anyway, my main point was just that even for O(ln(n)) that it bottlenecks on latency and that's unfortunate. – Z boson Dec 15 '16 at 19:33
BTW, you are are prolific poster and your posts are elaborate. Where do you find the time? I have no time with my new job and I mostly work with R now anyway. – Z boson Dec 15 '16 at 19:39
@Zboson: I don't have a job, so I get to spend as much time as I want writing SO answers that interest me, and/or contributing to open-source software. – Peter Cordes Dec 15 '16 at 19:52
What OSS projects are you involved with. I guess this is not the place to discuss this. You can see how rusty I am. – Z boson Dec 15 '16 at 19:58
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/130728/discussion-between-z-boson-and-peter-cordes). – Z boson Dec 15 '16 at 19:58

Why is pow(int, int) so slow?

2 Answers2

Linked