What is faster than std::pow?

Question

My program spends 90% of CPU time in the std::pow(double,int) function. Accuracy is not a primary concern here, so I was wondering if there were any faster alternatives. One thing I was thinking of trying is casting to float, performing the operation and then back to double (haven't tried this yet); I am concerned that this is not a portable way of improving performance (don't most CPUs operate on doubles intrinsically anyway?)

Cheers

Kinda depends what powers you are calculating -- please show some code and/or describe your data. — paddy, May 28 '13 at 01:57
Hardware is faster than software in the general case, that's kind of the point of `pow`... you can't beat it unless you can place additional restrictions on what you're doing. — user541686, May 28 '13 at 01:58
This article may be useful: http://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/ — Shafik Yaghmour, May 28 '13 at 01:58
What about `2^(y*log2(x))`? See: http://stackoverflow.com/questions/4638473/how-to-powreal-real-in-x86 — Derek, May 28 '13 at 02:01
Thanks. This will keep me busy for a while. Will do some profiling and report back. — quant, May 28 '13 at 02:06
This question has already been asked [here](http://stackoverflow.com/q/9652549/1277769) — SirGuy, May 28 '13 at 02:07
@GuyGreer: at a glance - that question seems more focused on just how to do it for its own sake, rather than doing it to outperform `std::pow`. — Tony Delroy, May 28 '13 at 02:21
Ah yes, you're right. My memory was a little faulty on that question, however my [answer](http://stackoverflow.com/a/9824880/1277769) to that question resulted in an approximation faster than std::pow. — SirGuy, May 28 '13 at 02:25
Casting to float and back certainly won't help. The CPU does all FP in double precision. — user207421, May 28 '13 at 02:36
If you know the range of numbers you intend to raise to a power, and if you are using a limited number of powers, then you can get excellent optimization by storing the pre-calculated values into an array and simply fetching them by index using the exponent as the index. — Scott Lahteine, Aug 19 '16 at 22:01

score 18 · Accepted Answer · edited May 23 '17 at 12:10

It looks like Martin Ankerl has a few of articles on this, Optimized Approximative pow() in C / C++ is one and it has two fast versions, one is as follows:

inline double fastPow(double a, double b) {
  union {
    double d;
    int x[2];
  } u = { a };
  u.x[1] = (int)(b * (u.x[1] - 1072632447) + 1072632447);
  u.x[0] = 0;
  return u.d;
}

which relies on type punning through a union which is undefined behavior in C++, from the draft standard section 9.5 [class.union]:

In a union, at most one of the non-static data members can be active at any time, that is, the value of at most one of the non-static data members can be stored in a union at any time. [...]

but most compilers including gcc support this with well defined behavior:

The practice of reading from a different union member than the one most recently written to (called “type-punning”) is common. Even with -fstrict-aliasing, type-punning is allowed, provided the memory is accessed through the union type

but this is not universal as this article points out and as I point out in my answer here using memcpy should generate identical code and does not invoke undefined behavior.

He also links to a second one Optimized pow() approximation for Java, C / C++, and C#.

The first article also links to his microbenchmarks here

score 13 · Answer 2 · answered May 28 '13 at 01:59

Depending on what you need to do, operating in the log domain might work — that is, you replace all of your values with their logarithms; multiplication becomes addition, division becomes subtraction, and exponentiation becomes multiplication. But now addition and subtraction become expensive and somewhat error-prone operations.

score 6 · Answer 3 · answered May 28 '13 at 02:10

6

How big are your integers? Are they known at compile time? It's far better to compute x^2 as x*x as opposed to pow(x,2). Note: Almost all applications of pow() to an integer power involve raising some number to the second or third power (or the multiplicative inverse in the case of negative exponents). Using pow() is overkill in such cases. Use a template for these small integer powers, or just use x*x.

If the integers are small, but not known at compile time, say between -12 and +12, multiplication will still beat pow() and won't lose accuracy. You don't need eleven multiplications to compute x^12. Four will do. Use the fact that x^(2n) = (x^n)^2 and x^(2n+1) = x*((x^n)^2). For example, x^12 is ((x*x*x)^2)^2. Two multiplications to compute x^3 (x*x*x), one more to compute x^6, and one final one to compute x^12.

answered May 28 '13 at 02:10

David Hammen

32,454
9
60
108

Of course, this assumes that ausairman is working with integers. It's unclear whether that's the case. – jamesdlin May 28 '13 at 02:22
@jamesdin: Of course he is. *My program spends 90% of CPU time in the std::pow(double,int) function.* – David Hammen May 28 '13 at 06:26
Oops, sorry. You're right; I think my brain was on holiday today. >_ – jamesdlin May 28 '13 at 08:59
1

This answer is wrong with recent optimizers. std::pow(x, {2, 3, 4})... compiles to the exact same code that the corresponding multiplications. – Jean-Michaël Celerier Oct 18 '16 at 14:36
@Jean-MichaëlCelerier [ICC on godbolt](https://godbolt.org/g/NniqWc) does this quite well. But GCC and Clang only do it for 2 and call pow for bigger exponents if you don't specify `-Ofast`. – Christoph Diegelmann Feb 21 '17 at 07:46
If using `-O3 -ffast-math` in GCC (4.1+, probably earlier), even `powf(x, 100.f)` optimizes to multiply instructions. – Vortico Jun 01 '19 at 21:59
@Vortico: `-O3 -ffast-math` does lots of bad things as well as that one good thing. I want that one good thing without the bad things. Fortran, an ancient language, has done this one good thing before it was called Fortran (it previously went by FORTRAN). The problem with `pow` and `exp` is that for some values, it can take a surprisingly huge amount of time to chase down that last half an ULP of precision. And sometimes, those surprising values that take a long time involve raising a value `x` to a small integral exponent. – David Hammen Jun 01 '19 at 23:03
@DavidHammen I would have said to use `-fassociative-math`, but that doesn't seem to be enough. The most specific I can find is `-funsafe-math-optimizations`, which is very reasonable for most applications. For the "last half an ULP" issue, you should enable flush-to-zero and denormals-are-zero mode. https://scc.ustc.edu.cn/zlsc/tc4600/intel/2017.0.098/compiler_c/common/core/GUID-1659EAE1-583E-44EE-BDEA-7C68C46061C7.html – Vortico Jun 04 '19 at 14:15

John Doe · Answer 4 · 2019-07-28T22:37:27.217

YES! Very fast if you only need 'y'/'n' as a long/int which allows you to avoid the slow FPU FSCALE function. This is Agner Fog's x86 hand-optimized version if you only need results with 'y'/'n' as an INT. I upgraded it to __fastcall/__declspec(naked) for speed/size, made use of ECX to pass 'n' (floats always are passed in stack for 32-bit MSVC++), so very minor tweaks on my part, it's mostly Agner's work. It was tested/debugged/compiled on MS Visual VC++ 2005 Express/Pro, so should be OK to slip in newer versions. Accuracy against the universal CRT pow() function is very good.

extern double __fastcall fs_power(double x, long n);

// Raise 'x' to the power 'n' (INT-only) in ASM by the great Agner Fog!

__declspec(naked) double __fastcall fs_power(double x, long n) { __asm {
    MOV  EAX, ECX     ;// Move 'n' to eax
;// abs(n) is calculated by inverting all bits and adding 1 if n < 0:
    CDQ               ;// Get sign bit into all bits of edx
    XOR  EAX, EDX     ;// Invert bits if negative
    SUB  EAX, EDX     ;// Add 1 if negative. Now eax = abs(n)
    JZ   RETZERO      ;// End if n = 0
    FLD1              ;// ST(0) = 1.0 (FPU push1)
    FLD  QWORD PTR [ESP+4] ;// Load 'x' : ST(0) = 'x', ST(1) = 1.0 (FPU push2)
    JMP  L2           ;// Jump into loop
L1: ;// Top of loop
    FMUL ST(0), ST(0) ;// Square x
L2: ;// Loop entered here
    SHR  EAX, 1       ;// Get each bit of n into carry flag
    JNC  L1           ;// No carry. Skip multiplication, goto next
    FMUL ST(1), ST(0) ;// Multiply by x squared i times for bit # i
    JNZ  L1           ;// End of loop. Stop when nn = 0
    FSTP ST(0)        ;// Discard ST(0) (FPU Pop1)
    TEST EDX, EDX     ;// Test if 'n' was negative
    JNS  RETPOS       ;// Finish if 'n' was positive
    FLD1              ;// ST(0) = 1.0, ST(1) = x^abs(n)
    FDIVR             ;// Reciprocal
RETPOS:    ;// Finish, success!
    RET  4 ;//(FPU Pop2 occurs by compiler on assignment

RETZERO:
    FLDZ   ;// Ret 0.0, fail, if n was 0
    RET  4
}}

What is faster than std::pow?

4 Answers4

Linked