Fast floating-point power of 2 on x86_64

Question

Is there a fast way to take 2.0 to some floating-point degree x? I mean something faster than pow(2.0, x) and preferrably what vectorizes well with AVX2.

The counterpart for integers is 1<<n, but it works for integer n only.

I doubt if there's a faster way for `2 powered to x` in floating point. The integer version makes use of the way its store to speed up pow(2,x) and I doubt if you will get such advantages on floating point representations — Malice, Aug 23 '17 at 10:36
@Malice, the problem is that `x` is not integer, it's floating-point. — Serge Rogatch, Aug 23 '17 at 10:36
For integer, this is just writing the exponent term. The fractional part needs a log-mapping (or inverse) to the non-exponent part. But really, if a large speedup where possible, x^y = 2^(lg x * y); thus the most speedup you can hope for is eliminating one base 2 log and a multiplication. Hiw many powers of 2 are you hoping to do? Millions? — Yakk - Adam Nevraumont, Aug 23 '17 at 10:37
And, how much accuracy do you need (or are willing to give up for speed) — Yakk - Adam Nevraumont, Aug 23 '17 at 10:38
@Yakk, 0.4 billions per second using 4-at-once vectorization. The accuracy should be 40 bits of mantissa at least. — Serge Rogatch, Aug 23 '17 at 10:38
Knowing the IEEE-754 you could do the bit manipulation yourself... https://www.h-schmidt.net/FloatConverter/IEEE754.html — Tommy Andersen, Aug 23 '17 at 10:38
A naive approach using `float` can be found here: https://ideone.com/Bv0Y6V I do not know if it is faster though. But obviously that is for integer powers. — Tommy Andersen, Aug 23 '17 at 10:52
Have you read https://www.researchgate.net/publication/272178514_Fast_Exponential_Computation_on_SIMD_Architectures ? — Yakk - Adam Nevraumont, Aug 23 '17 at 11:48
@SergeRogatch: Vectorization might be a bit optimistic. The chief problem I see are the IEEE754 edge cases, like `pow(2.0, -INF)`. — MSalters, Aug 23 '17 at 13:26
You can use the implementation from the VCL but with a slightly shorter polynomial and still have 40 bits left over. — harold, Aug 23 '17 at 14:19
Handling edge cases can be vectorized. If one is working with a known range of input data, certain edge cases may not apply and **NOT** checking for them will definitely be faster than the generic implementation of the runtime library. — Khouri Giordano, Aug 23 '17 at 16:11

DAle · Answer 1 · 2017-08-23T13:29:29.363

12

There is a standard std::exp2(double n)

Computes 2 raised to the given power n

It is possible that exp2(x) would not be faster than pow(2.0, x) in a particular environment but it's more specific than general pow.

edited Aug 23 '17 at 13:29

answered Aug 23 '17 at 11:35

DAle

8,990
2
26
45

5

On a recent linux where glibc includes libmvec, `g++ -Ofast` manages to vectorize `std::exp(x*std::log(2))` but not `std::exp2(x)`, strangely. – Marc Glisse Aug 23 '17 at 14:04

score 3 · Answer 2 · answered Aug 23 '17 at 18:20

3

For integer powers, you can use std::ldexp:

 double x = std::ldexp(1.0, k); // 2 to the power of k

This is better than 1<<k and casting as it won't have intermediate overflow, and supports negative powers as well.

answered Aug 23 '17 at 18:20

Simon Byrne

7,694
1
26
50

The challenge is to take 2 to a floating-point power `x`, which is not an integer `k`. – Serge Rogatch Aug 23 '17 at 18:32
If `k` was an integer and you don't need to check for out-of-range, you'd want to manually stuff the biased exponent into an IEEE 754 `double` with a bit-shift and an addition (to bias the exponent and set the mantissa with the same operation). That would auto-vectorize well. – Peter Cordes Aug 26 '17 at 01:00

score 2 · Answer 3 · answered Apr 15 '22 at 12:48

TL;DR: split integer and fraction, use a Taylor expansion to calculate the 2^fraction part, add that back into the integer part then use the convert-to-int magic trick many have mentioned. The Taylor expansion is slow because of a long dependency chain calculating the successive powers, but the process only requires 3 registers meaning you can do 4 vectors in parallel to allow pipe-lining.

I know this is an old question but it never got a good answer (or even really any answer that actually satisfied the original questioners requirements) - The best answer was a link to a paper in one of the comments that explains the Taylor series and the convert-to-int-trick and details a way to use a polynomial to adjust the fraction part you already have to get the 2^fract you need, but it was complicated and didn't include the actual numbers to make it work.

The original questioner wanted to do 400,000,000 2^x's per second, with 40 bits minimum precision. My requirements are a bit different - mainly that I don't need that much precision, so I haven't tried it with doubles yet but I am getting 1,730,000,000 2^x's per second with 16bit minimum precision in a single thread on the 10 year-old 3rd-gen i7 mobile chip in my laptop, and I'm confident that 1/4 of this would be doable for double precision and twice the terms in the series, on the same hardware.

The Taylor series calculates e^x with greater accuracy each step, and looks like this:

1 + x + x²/2! + x³/3! + (x^4)/4! + ...

to get it to calculate any other base's power, multiply the power you want by ln(power). in other words:

N^x = e^(x*ln(N))

or for us,

2^x = e^(x*ln(2))

If you use values of x that are large, it takes a long time to converge, but since we have a trick for the integer part we will never be dealing with x => 1.0. we get 16bits precision in 6 terms (up to x^6) and 40 bits in 12 terms, with more precision the closer to 0 we go.

the basic process looks like this:

vector register x, y, z;   /// x is our input, y will become our output
y = floor(x);    /// 3 round toward -infinity
x -= y;          /// 3 split integer and fraction
x *= ln(2);      /// 5 ln(2) = 0.693147180559945...
y += x;          /// summing the terms direct into the integer part
z = x * x;       /// 5
y += z * (1/2);  /// this is a single fma instruction but ...
z *= x;          /// 5
y += z * (1/6);  /// if your processor lacks it, the working register ...
z *= x;          /// 5
y += z * (1/24); /// can be shared between the 4 parallel calculation chains
z *= x;          /// 5
y += z * (1/120); /// the broadcasted constants can be shared too
z *= x;           /// 5
y += z * (1/720); /// 8 I stopped here but you could easily carry on.
y *= 2^(mantissa bits);  /// 5        8388608.0 for float, 
                         /// 4503599627370496.0 for double   
y = convert-float-to-int(y); /// 3 creates an integer where the original
                             /// integer part appears where the exponent bits
                             /// of a float value would be, and our added
                             /// fraction part becomes the mantissa, gaining
                             /// the 1.0 it was missing in the process
y = integer-addition(y, interpret-bits-as-integer(1.0)); /// 3
                             /// add the exponent bias to complete our result
                             /// this can be done as a float addition before
                             /// the convert-to-int, but bits of the precision
                             /// of the result get lost, because all of the
                             /// mantissa and exponent bits have to fit into the
                             /// mantissa before conversion

each line can convert to one vector instruction (or two on older processors without fma), but many have to wait for the result of a near previous step to complete (instruction latency - the numbers after some lines indicate they are part of the chain and how many cycles must be waited for the result) and this chain limits the minimum processing time for this 6-step version to 55 cycles. But the floating-point hardware can actually work much faster than this, the 5-stages of addition to do a 24-bit multiply can all be working on different values. with 16 registers in the AVX register file, 4 of these calculation chains can be coded to run together in a single thread, allowing 4 vectors of results to be calculated in the same time ... about 55 clock cycles.

Using double's would halve the number of results per chain, but doubling the number of terms doesn't double the time taken because the starting steps and finishing steps are the same. It results in a latency chain of about 85 cycles, so in theory 560,000,000 40bit precise 2^double results should be possible in a single thread on a third-gen, 3Ghz, intel processor (including read of input and write of output). Newer processors have decreased the latency of multiplication a little, but the latency of rounding and float-to-int conversion have gone up, so it remains about the same overall.

Yup, polynomial approximation over the mantissa range, and stuff in the integer part of the exponent, is generally the way to go. But for polynomial evaluation, you normally want Horner's rule to it's just a chain of FMAs. See [Fastest Implementation of Exponential Function Using AVX](https://stackoverflow.com/q/48863719) for a working implementation in actual C with intrinsics. In your pseudocode, `1/6` is `0` as a C expression, unlike `1.f / 6`. — Peter Cordes, Apr 15 '22 at 13:22

Fast floating-point power of 2 on x86_64

3 Answers3