154

What of these two methods is in C more efficient? And how about:

pow(x,3)

vs.

x*x*x // etc?
jamylak
  • 128,818
  • 30
  • 231
  • 230

7 Answers7

120

UPDATE 2021

I've modified the benchmark code as follows:

  • std::chrono used for timing measurements instead of boost
  • C++11 <random> used instead of rand()
  • Avoid repeated operations that can get hoisted out. The base parameter is ever-changing.

I get the following results with GCC 10 -O2 (in seconds):

exp     c++ pow     c pow       x*x*x...
2       0.204243    1.39962     0.0902527   
3       1.36162     1.38291     0.107679    
4       1.37717     1.38197     0.106103    
5       1.3815      1.39139     0.117097

GCC 10 -O3 is almost identical to GCC 10 -O2.

With GCC 10 -O2 -ffast-math:

exp     c++ pow     c pow       x*x*x...
2       0.203625    1.4056      0.0913414   
3       0.11094     1.39938     0.108027    
4       0.201593    1.38618     0.101585    
5       0.102141    1.38212     0.10662

With GCC 10 -O3 -ffast-math:

exp     c++ pow     c pow       x*x*x...
2       0.0451995   1.175       0.0450497   
3       0.0470842   1.20226     0.051399    
4       0.0475239   1.18033     0.0473844   
5       0.0522424   1.16817     0.0522291

With Clang 12 -O2:

exp     c++ pow     c pow       x*x*x...
2       0.106242    0.105435    0.105533    
3       1.45909     1.4425      0.102235    
4       1.45629     1.44262     0.108861    
5       1.45837     1.44483     0.1116

Clang 12 -O3 is almost identical to Clang 12 -O2.

With Clang 12 -O2 -ffast-math:

exp     c++ pow     c pow       x*x*x...
2       0.0233731   0.0232457   0.0231076   
3       0.0271074   0.0266663   0.0278415   
4       0.026897    0.0270698   0.0268115   
5       0.0312481   0.0296402   0.029811    

Clang 12 -O3 -ffast-math is almost identical to Clang 12 -O2 -ffast-math.

Machine is Intel Core i7-7700K on Linux 5.4.0-73-generic x86_64.

Conclusions:

  • With GCC 10 (no -ffast-math), x*x*x... is always faster
  • With GCC 10 -O2 -ffast-math, std::pow is as fast as x*x*x... for odd exponents
  • With GCC 10 -O3 -ffast-math, std::pow is as fast as x*x*x... for all test cases, and is around twice as fast as -O2.
  • With GCC 10, C's pow(double, double) is always much slower
  • With Clang 12 (no -ffast-math), x*x*x... is faster for exponents greater than 2
  • With Clang 12 -ffast-math, all methods produce similar results
  • With Clang 12, pow(double, double) is as fast as std::pow for integral exponents
  • Writing benchmarks without having the compiler outsmart you is hard.

I'll eventually get around to installing a more recent version of GCC on my machine and will update my results when I do so.

Here's the updated benchmark code:

#include <cmath>
#include <chrono>
#include <iostream>
#include <random>

using Moment = std::chrono::high_resolution_clock::time_point;
using FloatSecs = std::chrono::duration<double>;

inline Moment now()
{
    return std::chrono::high_resolution_clock::now();
}

#define TEST(num, expression) \
double test##num(double b, long loops) \
{ \
    double x = 0.0; \
\
    auto startTime = now(); \
    for (long i=0; i<loops; ++i) \
    { \
        x += expression; \
        b += 1.0; \
    } \
    auto elapsed = now() - startTime; \
    auto seconds = std::chrono::duration_cast<FloatSecs>(elapsed); \
    std::cout << seconds.count() << "\t"; \
    return x; \
}

TEST(2, b*b)
TEST(3, b*b*b)
TEST(4, b*b*b*b)
TEST(5, b*b*b*b*b)

template <int exponent>
double testCppPow(double base, long loops)
{
    double x = 0.0;

    auto startTime = now();
    for (long i=0; i<loops; ++i)
    {
        x += std::pow(base, exponent);
        base += 1.0;
    }
    auto elapsed = now() - startTime;

    auto seconds = std::chrono::duration_cast<FloatSecs>(elapsed); \
    std::cout << seconds.count() << "\t"; \

    return x;
}

double testCPow(double base, double exponent, long loops)
{
    double x = 0.0;

    auto startTime = now();
    for (long i=0; i<loops; ++i)
    {
        x += ::pow(base, exponent);
        base += 1.0;
    }
    auto elapsed = now() - startTime;

    auto seconds = std::chrono::duration_cast<FloatSecs>(elapsed); \
    std::cout << seconds.count() << "\t"; \

    return x;
}

int main()
{
    using std::cout;
    long loops = 100000000l;
    double x = 0;
    std::random_device rd;
    std::default_random_engine re(rd());
    std::uniform_real_distribution<double> dist(1.1, 1.2);
    cout << "exp\tc++ pow\tc pow\tx*x*x...";

    cout << "\n2\t";
    double b = dist(re);
    x += testCppPow<2>(b, loops);
    x += testCPow(b, 2.0, loops);
    x += test2(b, loops);

    cout << "\n3\t";
    b = dist(re);
    x += testCppPow<3>(b, loops);
    x += testCPow(b, 3.0, loops);
    x += test3(b, loops);

    cout << "\n4\t";
    b = dist(re);
    x += testCppPow<4>(b, loops);
    x += testCPow(b, 4.0, loops);
    x += test4(b, loops);

    cout << "\n5\t";
    b = dist(re);
    x += testCppPow<5>(b, loops);
    x += testCPow(b, 5.0, loops);
    x += test5(b, loops);

    std::cout << "\n" << x << "\n";
}

Old Answer, 2010

I tested the performance difference between x*x*... vs pow(x,i) for small i using this code:

#include <cstdlib>
#include <cmath>
#include <boost/date_time/posix_time/posix_time.hpp>

inline boost::posix_time::ptime now()
{
    return boost::posix_time::microsec_clock::local_time();
}

#define TEST(num, expression) \
double test##num(double b, long loops) \
{ \
    double x = 0.0; \
\
    boost::posix_time::ptime startTime = now(); \
    for (long i=0; i<loops; ++i) \
    { \
        x += expression; \
        x += expression; \
        x += expression; \
        x += expression; \
        x += expression; \
        x += expression; \
        x += expression; \
        x += expression; \
        x += expression; \
        x += expression; \
    } \
    boost::posix_time::time_duration elapsed = now() - startTime; \
\
    std::cout << elapsed << " "; \
\
    return x; \
}

TEST(1, b)
TEST(2, b*b)
TEST(3, b*b*b)
TEST(4, b*b*b*b)
TEST(5, b*b*b*b*b)

template <int exponent>
double testpow(double base, long loops)
{
    double x = 0.0;

    boost::posix_time::ptime startTime = now();
    for (long i=0; i<loops; ++i)
    {
        x += std::pow(base, exponent);
        x += std::pow(base, exponent);
        x += std::pow(base, exponent);
        x += std::pow(base, exponent);
        x += std::pow(base, exponent);
        x += std::pow(base, exponent);
        x += std::pow(base, exponent);
        x += std::pow(base, exponent);
        x += std::pow(base, exponent);
        x += std::pow(base, exponent);
    }
    boost::posix_time::time_duration elapsed = now() - startTime;

    std::cout << elapsed << " ";

    return x;
}

int main()
{
    using std::cout;
    long loops = 100000000l;
    double x = 0.0;
    cout << "1 ";
    x += testpow<1>(rand(), loops);
    x += test1(rand(), loops);

    cout << "\n2 ";
    x += testpow<2>(rand(), loops);
    x += test2(rand(), loops);

    cout << "\n3 ";
    x += testpow<3>(rand(), loops);
    x += test3(rand(), loops);

    cout << "\n4 ";
    x += testpow<4>(rand(), loops);
    x += test4(rand(), loops);

    cout << "\n5 ";
    x += testpow<5>(rand(), loops);
    x += test5(rand(), loops);
    cout << "\n" << x << "\n";
}

Results are:

1 00:00:01.126008 00:00:01.128338 
2 00:00:01.125832 00:00:01.127227 
3 00:00:01.125563 00:00:01.126590 
4 00:00:01.126289 00:00:01.126086 
5 00:00:01.126570 00:00:01.125930 
2.45829e+54

Note that I accumulate the result of every pow calculation to make sure the compiler doesn't optimize it away.

If I use the std::pow(double, double) version, and loops = 1000000l, I get:

1 00:00:00.011339 00:00:00.011262 
2 00:00:00.011259 00:00:00.011254 
3 00:00:00.975658 00:00:00.011254 
4 00:00:00.976427 00:00:00.011254 
5 00:00:00.973029 00:00:00.011254 
2.45829e+52

This is on an Intel Core Duo running Ubuntu 9.10 64bit. Compiled using gcc 4.4.1 with -o2 optimization.

So in C, yes x*x*x will be faster than pow(x, 3), because there is no pow(double, int) overload. In C++, it will be the roughly same. (Assuming the methodology in my testing is correct.)


This is in response to the comment made by An Markm:

Even if a using namespace std directive was issued, if the second parameter to pow is an int, then the std::pow(double, int) overload from <cmath> will be called instead of ::pow(double, double) from <math.h>.

This test code confirms that behavior:

#include <iostream>

namespace foo
{

    double bar(double x, int i)
    {
        std::cout << "foo::bar\n";
        return x*i;
    }


}

double bar(double x, double y)
{
    std::cout << "::bar\n";
    return x*y;
}

using namespace foo;

int main()
{
    double a = bar(1.2, 3); // Prints "foo::bar"
    std::cout << a << "\n";
    return 0;
}
Emile Cormier
  • 28,391
  • 15
  • 94
  • 122
  • 1
    does this mean that inserting a "using namespace std" chooses the C option and this will be detrimental to the runtime? – Andreas Mar 19 '13 at 18:13
  • In both your timing loops, the pow calculation probably only happens once. gcc -O2 should have no problem hoisting the loop-invariant expression out of the loop. So you're just testing how well the compiler does at turning an add-constant loop into a multiply, or just optimizing an add-constant loop. There's a reason your loops are the same speed with exponent = 1 vs. exponent = 5, even for the written-out version. – Peter Cordes Oct 02 '15 at 19:32
  • 4
    I tried it on [godbolt](https://goo.gl/ibtNL8) (with timing commented out, since godbolt doesn't have Boost installed). It surprisingly does actually call `std::pow` 8*loops times (for exponent > 2), unless you use `-fno-math-errno`. Then it can pull the pow call out of the loop, as I thought it would. I guess since errno is a global, thread safety requires that it call pow to possibly set errno multiple times... exp=1 and exp=2 are fast because the pow call is hoisted out of the loop with just `-O3`.. ([with -ffast-math](https://goo.gl/HeLEf1), it does the sum-of-8 outside the loop, too.) – Peter Cordes Oct 02 '15 at 20:21
  • 1
    I downvoted before I realized that I had -ffast-math on in the godbolt session that I was using. Even without that, testpow<1> and testpow<2> are broken, because they inline with the `pow` call hoisted out of the loop, so there's a big flaw there. Also, it looks like you're mostly testing the latency of FP addition, since all the test run in the same amount of time. You'd expect `test5` to be slower than `test1`, but it isn't. Using multiple accumulators would split up the dependency chain and hide the latency. – Peter Cordes Oct 02 '15 at 20:30
  • @PeterCordes, where were you 5 years ago? :-) I'll try fixing my benchmark by applying `pow` to an ever-changing value (to prevent the repeated pow expression from being hoisted out). – Emile Cormier Oct 02 '15 at 21:21
  • @EmileCormier: cool. I linked here from a comment on http://stackoverflow.com/questions/32907258/performance-when-implementing-long-equations-in-c. – Peter Cordes Oct 02 '15 at 21:35
  • For what it's worth, I was trying to figure out how to speed up some performance-critical areas of my code and found that using std::pow increased the running time by over a factor of 10x. YMMV. – vmrob Apr 02 '16 at 00:43
  • @PeterCordes vote++ for **godbolt** (new one for me) and just for being Peter Cordes. – Orwellophile Jan 31 '17 at 07:50
  • 1
    @PeterCordes I finally got around to updating the benchmark code to avoid repeated calculations in the loops. I'm sure you'll find flaws in this version as well. ;-) – Emile Cormier Jun 05 '21 at 00:11
  • @vmrob My updated benchmark confirms your x10 slower experience for exponents greater than 2 and without the `-ffast-math` compiler option. I'm glad you were able to find your bottleneck. – Emile Cormier Jun 05 '21 at 00:17
  • Yes, this looks sane, although of course the loop still bottlenecks on FP-add latency unless there's an even slower throughput bottleneck involves (like an actual call to the `pow` library function, not inlined and taking advantage of the exponent being a small integer instead of the usual log/exp algorithm.) – Peter Cordes Jun 05 '21 at 00:32
  • *That's* why `clang -O3 -ffast-math` can go so much faster: it's allowed to (`-ffast-math`) and chooses to (`clang -O3`) vectorize (two multiples / adds per mulpd / addpd) and *unroll with multiple accumulators*. (And unroll the `b += 1.0` chain with `b += 8.0`, and `b + 2.0` / `b + 4.0` forking off from that each iteration). Skylake has 4 cycle addpd latency, and 2/clock throughput for add/mul/fma operations. (You didn't use `-march=native` so no FMAs are used.) GCC does vectorize with fast-math, but keeps the loops "rolled up": https://godbolt.org/z/E7fcYfasf – Peter Cordes Jun 05 '21 at 00:34
  • I'm not sure why `gcc -O3 -ffast-math` isn't twice as fast as `gcc -O3`; it is doing 2 mul/add per add latency, instead of 1 for the pure scalar way. (Re: dep chains, see also [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) and [Simd matmul program gives different numerical results](https://stackoverflow.com/a/55478102) (multiple accumulators gives *different* rounding than the source, not worse.) – Peter Cordes Jun 05 '21 at 00:36
  • Anyway, I think the asm is actually doing all the work the C source looks like it's doing, not some algorithmic transformation. At least the inner-most loops of `test2` / `test3` etc look normal; and I think also how they inlined into `main`. Using `-fno-unroll-loops` slows down clang -O3 -ffast-math (on my system) to about 0.06 instead of 0.03. Oh, another fast-math optimization is that `b*b*b*b*b` can be evaluated as `tmp = b*b; tmp*=tmp; tmp*=b;` so only 3 multiplies for test5, not 4. And two multiplies for test4. That's exactly the kind of thing you *want* -ffast-math to allow. – Peter Cordes Jun 05 '21 at 00:47
  • @PeterCordes: I haven't needed to do assembly since university, so it's difficult for me to judge speed just by looking at the asm (unless it's plainly obvious). Even if FP-add is a bottleneck, wouldn't it be the same bottleneck for all 3 methods? I realize a good chunk of the elapsed time is loop/addition overhead, but wouldn't the results still indicate which method is fastest, even if they're not "proportional"? – Emile Cormier Jun 05 '21 at 00:52
  • Re: `gcc -O3 -ffast-math` not being a factor of 2 speedup: you didn't test that, only `-O2 -ffast-math`, and gcc only enables `-ftree-vectorize` as part of `-O3`, while clang does it at `-O2`. So your clang fast-math has a factor of 2 advantage over your GCC fast-math built-in, and clang's unrolling to hide FP latency is sufficient to explain the rest, hitting back-end or front-end throughput bottlenecks instead of latency. (And that explains why gcc didn't get much faster with -ffast-math for the `b*b*...` expression.) – Peter Cordes Jun 05 '21 at 01:01
  • @PeterCordes I thought there was no speedup from -O2 to -O3 with -ffast-math, but it appears I was mistaken. I'll check again and will update my answer. – Emile Cormier Jun 05 '21 at 01:05
  • For me to judge why the asm runs the speed it does, I'm not looking *super* carefully at the asm and fully checking everything, mostly just counting instructions, and explaining any surprised by looking more closely (e.g. to confirm that it's squaring a temporary). The other part is knowing what optimizations I expect it to do, from experience with optimization in general and looking at compiler output in lots of other cases. And knowing how fast it runs is a good guide to what optimizations it probably did.) – Peter Cordes Jun 05 '21 at 01:06
  • (-O2 vs. -O3 speedup with -ffast-math: probably you only looked at clang, which already vectorizes at -O2. GCC doesn't.) – Peter Cordes Jun 05 '21 at 01:07
  • *I realize a good chunk of the elapsed time is loop/addition overhead, but wouldn't the results still indicate which method is fastest, even if they're not "proportional"?* - maybe sort of. In this case, anything that gets down to the ~0.1 sec time proves that it's efficient. The fact that test5 is about the same speed as test2 shows that out-of-order exec is doing its job and hiding work (one, two, or three multiplies) under the FP add latency. Times like 1.4s show libm `pow` got called, doing lots of work (but with data-independent performance), with its throughput being bottleneck. – Peter Cordes Jun 05 '21 at 01:12
  • `gcc -O2`'s time bouncing around from 0.2 to 0.1 depending on the exponent is interesting. It seems GCC tripped over itself when inlining, and somehow failed to keep `x` in a register during that `testCppPow<2>(b, loops);`. So the critical path dependency chain includes a store/reload (~ 5 cycle store-forwarding latency) as well as the 4 cycle `addsd` latency. I had been wondering if a quirk of imperfect out-of-order scheduling could have been the cause, but no it's just poor code-gen. I don't think you'd expect that to happen when using `std::pow(foo, 4)` normally though. – Peter Cordes Jun 05 '21 at 01:19
  • The loop is `.L45:` on line 375 of the `gcc -O2` asm output: https://godbolt.org/z/eT3zhq6Mc. clang doesn't have any of those spill/reload oddities. (Except around calling `pow()`, because there are no call-preserved xmm registers in the x86-64 SysV calling convention, so it has to store/reload for that. But latency isn't the bottleneck there because pow is so slow.) – Peter Cordes Jun 05 '21 at 01:24
  • I've added the `gcc -O3 -ffast-math` benchmarks. It's much faster than `gcc -O2 -ffast-math` like you said it would be. :-) It's also comparable to `clang -ffast-math`. – Emile Cormier Jun 05 '21 at 01:25
25

That's the wrong kind of question. The right question would be: "Which one is easier to understand for human readers of my code?"

If speed matters (later), don't ask, but measure. (And before that, measure whether optimizing this actually will make any noticeable difference.) Until then, write the code so that it is easiest to read.

Edit
Just to make this clear (although it already should have been): Breakthrough speedups usually come from things like using better algorithms, improving locality of data, reducing the use of dynamic memory, pre-computing results, etc. They rarely ever come from micro-optimizing single function calls, and where they do, they do so in very few places, which would only be found by careful (and time-consuming) profiling, more often than never they can be sped up by doing very non-intuitive things (like inserting noop statements), and what's an optimization for one platform is sometimes a pessimization for another (which is why you need to measure, instead of asking, because we don't fully know/have your environment).

Let me underline this again: Even in the few applications where such things matter, they don't matter in most places they're used, and it is very unlikely that you will find the places where they matter by looking at the code. You really do need to identify the hot spots first, because otherwise optimizing code is just a waste of time.

Even if a single operation (like computing the square of some value) takes up 10% of the application's execution time (which IME is quite rare), and even if optimizing it saves 50% of the time necessary for that operation (which IME is even much, much rarer), you still made the application take only 5% less time.
Your users will need a stopwatch to even notice that. (I guess in most cases anything under 20% speedup goes unnoticed for most users. And that is four such spots you need to find.)

sbi
  • 219,715
  • 46
  • 258
  • 445
  • 58
    It might be the right kind of question. Maybe he is not thinking about his own practical project, but is merely interested in how the langauge/compiler works... – Andreas Rejbrand May 30 '10 at 21:26
  • @Andreas: The answer would still be to measure in the concrete environment he's interested in. – sbi May 30 '10 at 21:30
  • 53
    I do not think readability is an issue here. Writing x*x versus pow(x,2) seem both quite clear. – KillianDS May 30 '10 at 23:05
  • @KillianDS: If readability isn't an issue, then gnol needs to get on with coding and forget about such micro optimizations until it's time for profiling. – sbi May 30 '10 at 23:38
  • 35
    I don't entirely agree with this answer. It'a a valid question to ask about performance. The best performance you can achieve is a valid requirement sometimes, and often the reason someone has used c++ rather than another language. And measuring isn't always a good idea. I might measure bubble sort and quicksort and find bubblesort faster with my 10 items because I didn't have the background to know that the number of items matters hugely and find later on with my 1,000,000 items it was a very bad choice. – jcoder May 01 '13 at 13:46
  • 6
    This is exactly the kind of question to ask. Requesting knowledge of an experienced developer can reveal a great deal of insight regarding compiler optimization/unrolling, macros, C/C++ standards, etc. The OP, as well as most of the viewers, already know the answer to "what is easier to read". They are both easy to read. Math operators, like the power function, are used intensely in scientific programming and will take up a very large amount of your CPU time for loops with large N. This is a standard case in programming. – Greg Nov 01 '15 at 16:39
  • @Greg No, even experienced developers do not know. They have to measure. (FWIW, I have written C++ code professionally for 20 years.) – sbi Nov 02 '15 at 08:53
23

x*x or x*x*x will be faster than pow, since pow must deal with the general case, whereas x*x is specific. Also, you can elide the function call and suchlike.

However, if you find yourself micro-optimizing like this, you need to get a profiler and do some serious profiling. The overwhelming probability is that you would never notice any difference between the two.

Toby Speight
  • 27,591
  • 48
  • 66
  • 103
Puppy
  • 144,682
  • 38
  • 256
  • 465
  • 7
    I was thinking the same thing until I decided to test it. I just tested `x*x*x` vs double `std::pow(double base, int exponent)` in a timed loop and cannot see a statistically meaningful performance difference. – Emile Cormier May 30 '10 at 23:39
  • 2
    Make sure it's not getting optimized away by the compiler. – Ponkadoodle May 30 '10 at 23:50
  • 1
    @Emile: Check the code generated by the compiler. Optimizers do some tricky (and inobvious) things sometimes. Also check the performance at various optimization levels: -O0, -O1, -O2 and -O3 for example. – JUST MY correct OPINION May 31 '10 at 02:07
  • 2
    You cannot assume that generalized functions are slower. Sometimes the opposite is true because simpler code is easier for the compiler to optimize. – cambunctious Apr 21 '20 at 20:18
8

I was also wondering about the performance issue, and was hoping this would be optimised out by the compiler, based on the answer from @EmileCormier. However, I was worried that the test code he showed would still allow the compiler to optimise away the std::pow() call, since the same values were used in the call every time, which would allow the compiler to store the results and re-use it in the loop - this would explain the almost identical run-times for all cases. So I had a look into it too.

Here's the code I used (test_pow.cpp):

#include <iostream>                                                                                                                                                                                                                       
#include <cmath>
#include <chrono>

class Timer {
  public:
    explicit Timer () : from (std::chrono::high_resolution_clock::now()) { }

    void start () {
      from = std::chrono::high_resolution_clock::now();
    }

    double elapsed() const {
      return std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - from).count() * 1.0e-6;
    }

  private:
    std::chrono::high_resolution_clock::time_point from;
};

int main (int argc, char* argv[])
{
  double total;
  Timer timer;



  total = 0.0;
  timer.start();
  for (double i = 0.0; i < 1.0; i += 1e-8)
    total += std::pow (i,2);
  std::cout << "std::pow(i,2): " << timer.elapsed() << "s (result = " << total << ")\n";

  total = 0.0;
  timer.start();
  for (double i = 0.0; i < 1.0; i += 1e-8)
    total += i*i;
  std::cout << "i*i: " << timer.elapsed() << "s (result = " << total << ")\n";

  std::cout << "\n";

  total = 0.0;
  timer.start();
  for (double i = 0.0; i < 1.0; i += 1e-8)
    total += std::pow (i,3);
  std::cout << "std::pow(i,3): " << timer.elapsed() << "s (result = " << total << ")\n";

  total = 0.0;
  timer.start();
  for (double i = 0.0; i < 1.0; i += 1e-8)
    total += i*i*i;
  std::cout << "i*i*i: " << timer.elapsed() << "s (result = " << total << ")\n";


  return 0;
}

This was compiled using:

g++ -std=c++11 [-O2] test_pow.cpp -o test_pow

Basically, the difference is the argument to std::pow() is the loop counter. As I feared, the difference in performance is pronounced. Without the -O2 flag, the results on my system (Arch Linux 64-bit, g++ 4.9.1, Intel i7-4930) were:

std::pow(i,2): 0.001105s (result = 3.33333e+07)
i*i: 0.000352s (result = 3.33333e+07)

std::pow(i,3): 0.006034s (result = 2.5e+07)
i*i*i: 0.000328s (result = 2.5e+07)

With optimisation, the results were equally striking:

std::pow(i,2): 0.000155s (result = 3.33333e+07)
i*i: 0.000106s (result = 3.33333e+07)

std::pow(i,3): 0.006066s (result = 2.5e+07)
i*i*i: 9.7e-05s (result = 2.5e+07)

So it looks like the compiler does at least try to optimise the std::pow(x,2) case, but not the std::pow(x,3) case (it takes ~40 times longer than the std::pow(x,2) case). In all cases, manual expansion performed better - but particularly for the power 3 case (60 times quicker). This is definitely worth bearing in mind if running std::pow() with integer powers greater than 2 in a tight loop...

jdtournier
  • 365
  • 3
  • 10
5

The most efficient way is to consider the exponential growth of the multiplications. Check this code for p^q:

template <typename T>
T expt(T p, unsigned q){
    T r =1;
    while (q != 0) {
        if (q % 2 == 1) {    // if q is odd
            r *= p;
            q--;
        }
        p *= p;
        q /= 2;
    }
    return r;
}
mohaghighat
  • 1,293
  • 17
  • 29
2

If the exponent is constant and small, expand it out, minimizing the number of multiplications. (For example, x^4 is not optimally x*x*x*x, but y*y where y=x*x. And x^5 is y*y*x where y=x*x. And so on.) For constant integer exponents, just write out the optimized form already; with small exponents, this is a standard optimization that should be performed whether the code has been profiled or not. The optimized form will be quicker in so large a percentage of cases that it's basically always worth doing.

(If you use Visual C++, std::pow(float,int) performs the optimization I allude to, whereby the sequence of operations is related to the bit pattern of the exponent. I make no guarantee that the compiler will unroll the loop for you, though, so it's still worth doing it by hand.)

[edit] BTW pow has a (un)surprising tendency to crop up on the profiler results. If you don't absolutely need it (i.e., the exponent is large or not a constant), and you're at all concerned about performance, then best to write out the optimal code and wait for the profiler to tell you it's (surprisingly) wasting time before thinking further. (The alternative is to call pow and have the profiler tell you it's (unsurprisingly) wasting time -- you're cutting out this step by doing it intelligently.)

caf
  • 233,326
  • 40
  • 323
  • 462
1

I have been busy with a similar problem, and I'm quite puzzled by the results. I was calculating x⁻³/² for Newtonian gravitation in an n-bodies situation (acceleration undergone from another body of mass M situated at a distance vector d) : a = M G d*(d²)⁻³/² (where d² is the dot (scalar) product of d by itself) , and I thought calculating M*G*pow(d2, -1.5) would be simpler than M*G/d2/sqrt(d2)

The trick is that it is true for small systems, but as systems grow in size, M*G/d2/sqrt(d2) becomes more efficient and I don't understand why the size of the system impacts this result, because repeating the operation on different data does not. It is as if there were possible optimizations as the system grow, but which are not possible with pow

enter image description here

Camion
  • 1,264
  • 9
  • 22
  • Numpy is acting on arrays at a time and so each separate operation generates a whole intermediate array (with a lot of memory allocation). Hence `M*G` generates `tmp1`, `tmp1/d2` generates `tmp2`, `sqrt(d2)` generates `tmp3`, `tmp2/tmp3` generates the result. Just calling `pow` skips one of those temps. Thus it really isn't relevant to a question about C functions. Try using `numba` or `numexpr` to help avoid that overhead – DavidW Oct 28 '21 at 11:35