61

I have hot spots in my code where I'm doing pow() taking up around 10-20% of my execution time.

My input to pow(x,y) is very specific, so I'm wondering if there's a way to roll two pow() approximations (one for each exponent) with higher performance:

  • I have two constant exponents: 2.4 and 1/2.4.
  • When the exponent is 2.4, x will be in the range (0.090473935, 1.0].
  • When the exponent is 1/2.4, x will be in the range (0.0031308, 1.0].
  • I'm using SSE/AVX float vectors. If platform specifics can be taken advantage of, right on!

A maximum error rate around 0.01% is ideal, though I'm interested in full precision (for float) algorithms as well.

I'm already using a fast pow() approximation, but it doesn't take these constraints into account. Is it possible to do better?

phuclv
  • 37,963
  • 15
  • 156
  • 475
Cory Nelson
  • 29,236
  • 5
  • 72
  • 110
  • 5
    How accurate does it have to be? – In silico Jun 25 '11 at 02:20
  • 1
    Is `pow` a native CPU instruction? Otherwise I could only imagine that `pow(a,b)` computes `exp(b * log(a))`, and so your exponent isn't all that constant after all. Well, it'll be great to see the answers! – Kerrek SB Jun 25 '11 at 02:26
  • 3
    @Kerrek: `pow` is not generally a native CPU instruction, but generally a library function built from `exp` and `log` with a ton of special cases for `x^2` and `x^-1` etc. – Dietrich Epp Jun 25 '11 at 03:07
  • @silico 16 bits is probably enough for me. – Cory Nelson Jun 25 '11 at 03:33
  • 1
    Given that you're clearly doing a gamma correction, I assume you also know something a priori about the range of values that the first argument can take on; what is it? Also, what platforms are you targeting? – Stephen Canon Jun 27 '11 at 02:28
  • @Stephen Thanks for the tips. Question updated. – Cory Nelson Jun 27 '11 at 10:39
  • 1
    @CoryNelson Could you post your final solution for (1/2.4 and 2.4) as an answer? – Lilith River Mar 23 '16 at 20:51

10 Answers10

34

Another answer because this is very different from my previous answer, and this is blazing fast. Relative error is 3e-8. Want more accuracy? Add a couple more Chebychev terms. It's best to keep the order odd as this makes for a small discontinuity between 2^n-epsilon and 2^n+epsilon.

#include <stdlib.h>
#include <math.h>

// Returns x^(5/12) for x in [1,2), to within 3e-8 (relative error).
// Want more precision? Add more Chebychev polynomial coefs.
double pow512norm (
   double x)
{
   static const int N = 8;

   // Chebychev polynomial terms.
   // Non-zero terms calculated via
   //   integrate (2/pi)*ChebyshevT[n,u]/sqrt(1-u^2)*((u+3)/2)^(5/12)
   //   from -1 to 1
   // Zeroth term is similar except it uses 1/pi rather than 2/pi.
   static const double Cn[N] = { 
       1.1758200232996901923,
       0.16665763094889061230,
      -0.0083154894939042125035,
       0.00075187976780420279038,
      // Wolfram alpha doesn't want to compute the remaining terms
      // to more precision (it times out).
      -0.0000832402,
       0.0000102292,
      -1.3401e-6,
       1.83334e-7};

   double Tn[N];

   double u = 2.0*x - 3.0;

   Tn[0] = 1.0;
   Tn[1] = u;
   for (int ii = 2; ii < N; ++ii) {
      Tn[ii] = 2*u*Tn[ii-1] - Tn[ii-2];
   }   

   double y = 0.0;
   for (int ii = N-1; ii >= 0; --ii) {
      y += Cn[ii]*Tn[ii];
   }   

   return y;
}


// Returns x^(5/12) to within 3e-8 (relative error).
double pow512 (
   double x)
{
   static const double pow2_512[12] = {
      1.0,
      pow(2.0, 5.0/12.0),
      pow(4.0, 5.0/12.0),
      pow(8.0, 5.0/12.0),
      pow(16.0, 5.0/12.0),
      pow(32.0, 5.0/12.0),
      pow(64.0, 5.0/12.0),
      pow(128.0, 5.0/12.0),
      pow(256.0, 5.0/12.0),
      pow(512.0, 5.0/12.0),
      pow(1024.0, 5.0/12.0),
      pow(2048.0, 5.0/12.0)
   };

   double s;
   int iexp;

   s = frexp (x, &iexp);
   s *= 2.0;
   iexp -= 1;

   div_t qr = div (iexp, 12);
   if (qr.rem < 0) {
      qr.quot -= 1;
      qr.rem += 12;
   }

   return ldexp (pow512norm(s)*pow2_512[qr.rem], 5*qr.quot);
}

Addendum: What's going on here?
Per request, the following explains how the above code works.

Overview
The above code defines two functions, double pow512norm (double x) and double pow512 (double x). The latter is the entry point to the suite; this is the function that user code should call to calculate x^(5/12). The function pow512norm(x) uses Chebyshev polynomials to approximate x^(5/12), but only for x in the range [1,2]. (Use pow512norm(x) for values of x outside that range and the result will be garbage.)

The function pow512(x) splits the incoming x into a pair (double s, int n) such that x = s * 2^n and such that 1≤s<2. A further partitioning of n into (int q, unsigned int r) such that n = 12*q + r and r is less than 12 lets me split the problem of finding x^(5/12) into parts:

  1. x^(5/12)=(s^(5/12))*((2^n)^(5/12)) via (uv)^a=(u^a)(v^a) for positive u,v and real a.
  2. s^(5/12) is calculated via pow512norm(s).
  3. (2^n)^(5/12)=(2^(12*q+r))^(5/12) via substitution.
  4. 2^(12*q+r)=(2^(12*q))*(2^r) via u^(a+b)=(u^a)*(u^b) for positive u, real a,b.
  5. (2^(12*q+r))^(5/12)=(2^(5*q))*((2^r)^(5/12)) via some more manipulations.
  6. (2^r)^(5/12) is calculated by the lookup table pow2_512.
  7. Calculate pow512norm(s)*pow2_512[qr.rem] and we're almost there. Here qr.rem is the r value calculated in step 3 above. All that is needed is to multiply this by 2^(5*q) to yield the desired result.
  8. That is exactly what the math library function ldexp does.

Function Approximation
The goal here is to come up with an easily computable approximation of f(x)=x^(5/12) that is 'good enough' for the problem at hand. Our approximation should be close to f(x) in some sense. Rhetorical question: What does 'close to' mean? Two competing interpretations are minimizing the mean square error versus minimizing the maximum absolute error.

I'll use a stock market analogy to describe the difference between these. Suppose you want to save for your eventual retirement. If you are in your twenties, the best thing to do is to invest in stocks or stock market funds. This is because over a long enough span of time, the stock market on average beats any other investment scheme. However, we've all seen times when putting money into stocks is a very bad thing to do. If you are in your fifties or sixties (or forties if you want to retire young) you need to invest a bit more conservatively. Those downswings can wreak have on your retirement portfolio.

Back to function approximation: As the consumer of some approximation, you are typically worried about the worst-case error rather than the performance "on average". Use some approximation constructed to give the best performance "on average" (e.g. least squares) and Murphy's law dictates that your program will spend a whole lot of time using the approximation exactly where the performance is far worse than average. What you want is a minimax approximation, something that minimizes the maximum absolute error over some domain. A good math library will take a minimax approach rather than a least squares approach because this lets the authors of the math library give some guaranteed performance of their library.

Math libraries typically use a polynomial or a rational polynomial to approximate some function f(x) over some domain a≤x≤b. Suppose the function f(x) is analytic over this domain and you want to approximate the function by some polynomial p(x) of degree N. For a given degree N there exists some magical, unique polynomial p(x) such that p(x)-f(x) has N+2 extrema over [a,b] and such that the absolute values of these N+2 extrema are all equal to one another. Finding this magical polynomial p(x) is the holy grail of function approximators.

I did not find that holy grail for you. I instead used a Chebyshev approximation. The Chebyshev polynomials of the first kind are an orthogonal (but not orthonormal) set of polynomials with some very nice features when it comes to function approximation. The Chebyshev approximation oftentimes is very close to that magical polynomial p(x). (In fact, the Remez exchange algorithm that does find that holy grail polynomial typically starts with a Chebyshev approximation.)

pow512norm(x)
This function uses Chebyshev approximation to find some polynomial p*(x) that approximates x^(5/12). Here I'm using p*(x) to distinguish this Chebyshev approximation from the magical polynomial p(x) described above. The Chebyshev approximation p*(x) is easy to find; finding p(x) is a bear. The Chebyshev approximation p*(x) is sum_i Cn[i]*Tn(i,x), where the Cn[i] are the Chebyshev coefficients and Tn(i,x) are the Chebyshev polynomials evaluated at x.

I used Wolfram alpha to find the Chebyshev coefficients Cn for me. For example, this calculates Cn[1]. The first box after the input box has the desired answer, 0.166658 in this case. That's not as many digits as I would like. Click on 'more digits' and voila, you get a whole lot more digits. Wolfram alpha is free; there is a limit on how much computation it will do. It hits that limit on higher order terms. (If you buy or have access to mathematica you will be able to calculate those high-order coefficients to a high degree of precision.)

The Chebyshev polynomials Tn(x) are calculated in the array Tn. Beyond giving something very close to magical polynomial p(x), another reason for using Chebyshev approximation is that the values of those Chebyshev polynomials are easily calculated: Start with Tn[0]=1 and Tn[1]=x, and then iteratively calculate Tn[i]=2*x*Tn[i-1] - Tn[i-2]. (I used 'ii' as the index variable rather than 'i' in my code. I never use 'i' as a variable name. How many words in the English language have an 'i' in the word? How many have two consecutive 'i's?)

pow512(x)
pow512 is the function that user code should be calling. I already described the basics of this function above. A few more details: The math library function frexp(x) returns the significand s and exponent iexp for the input x. (Minor issue: I want s between 1 and 2 for use with pow512norm but frexp returns a value between 0.5 and 1.) The math library function div returns the quotient and remainder for integer division in one swell foop. Finally, I use the math library function ldexp to put the three parts together to form the final answer.

David Hammen
  • 32,454
  • 9
  • 60
  • 108
  • Man, I wish I understood the math that went into this. Would this also work for x^2.4? – Cory Nelson Jun 25 '11 at 22:53
  • 1
    @Cory: Regarding the math, I'll add some details tomorrow. Regarding x^2.4: Conceptually, yes. The Chebyshev polynomials span the one-dimensional real function space over R[-1,1] (a Hilbert space). That said, the number of terms needed to represent x^2.4 is going to be considerably more than the number needed to represent x^(5/12). However, there is no need to compute x^2.4 directly. All that is needed is the ability to extract the fifth root, x^(1/5), and this will require even fewer Chebyshev terms than does x^(5/12). With x^(1/5) at hand, x^2.4 = (x*x^(1/5))^2. – David Hammen Jun 26 '11 at 05:07
  • A little note: there has been some work done at ENS Lyon regarding the computation of the best float/double coefficients for the polynomial knowing that the polynomial is evaluated with float/double arithmetics. Chebyshev coefficients are only "best" for real computations. I do not know how usable this stuff is. http://perso.ens-lyon.fr/nicolas.brisebarre/Publi/final-bmt-toms.ps – Pascal Cuoq Jun 26 '11 at 21:03
  • @Pascal: To be completely pedantic, Chebyshev coefficients aren't "best" for real computations; the minimax polynomial is. That said, as David noted in his answer, for well-behaved functions the two are typically very close. – Stephen Canon Jun 26 '11 at 21:30
  • Thanks for the update. Very cool. I'm implementing this with SSE to see how it fairs compared to my current code, will post back when I'm done. – Cory Nelson Jun 26 '11 at 23:47
  • I dont think pow512 properly handles the case x=0. It would be trivial to special case it though. – Michael Anderson Jun 27 '11 at 03:30
  • @Michael: Good point. To make this a general purpose function, it would need to handle x=0 (result is 0), x<0 (result is NaN and throw domain error or raise FPE) as special cases. It would also have to handle NaNs and Infs. – David Hammen Jun 27 '11 at 10:11
  • Computing Chebyshev coefficients can be done quickly via FFT, instead of numerical integration, so this process can be automated for other values than 5/12. [@Pascal: great read. Glad to see people still do great work in ENS Lyon, where I graduated :)] – Alexandre C. Jun 30 '11 at 19:46
  • More helpful information is to be found in ["Numerical Recipes: The Art of Scientific Computing, 3rd ed, p233"](http://books.google.ca/books?id=1aAOdzK3FegC&lpg=PA241&ots=3iOmFaHrjf&dq=Polynomial%20Approximation%20from%20Chebyshev%20Coefficients&pg=PA233#v=onepage&q&f=true). – mrkj Jan 12 '12 at 03:18
24

In the IEEE 754 hacking vein, here is another solution which is faster and less "magical." It achieves an error margin of .08% in about a dozen clock cycles (for the case of p=2.4, on an Intel Merom CPU).

Floating point numbers were originally invented as an approximation to logarithms, so you can use the integer value as an approximation of log2. This is somewhat-portably achievable by applying the convert-from-integer instruction to a floating-point value, to obtain another floating-point value.

To complete the pow computation, you can multiply by a constant factor and convert the logarithm back with the convert-to-integer instruction. On SSE, the relevant instructions are cvtdq2ps and cvtps2dq.

It's not quite so simple, though. The exponent field in IEEE 754 is signed, with a bias value of 127 representing an exponent of zero. This bias must be removed before you multiply the logarithm, and re-added before you exponentiate. Furthermore, bias adjustment by subtraction won't work on zero. Fortunately, both adjustments can be achieved by multiplying by a constant factor beforehand.

x^p
= exp2( p * log2( x ) )
= exp2( p * ( log2( x ) + 127 - 127 ) - 127 + 127 )
= cvtps2dq( p * ( log2( x ) + 127 - 127 - 127 / p ) )
= cvtps2dq( p * ( log2( x ) + 127 - log2( exp2( 127 - 127 / p ) ) )
= cvtps2dq( p * ( log2( x * exp2( 127 / p - 127 ) ) + 127 ) )
= cvtps2dq( p * ( cvtdq2ps( x * exp2( 127 / p - 127 ) ) ) )

exp2( 127 / p - 127 ) is the constant factor. This function is rather specialized: it won't work with small fractional exponents, because the constant factor grows exponentially with the inverse of the exponent and will overflow. It won't work with negative exponents. Large exponents lead to high error, because the mantissa bits are mingled with the exponent bits by the multiplication.

But, it's just 4 fast instructions long. Pre-multiply, convert from "integer" (to logarithm), power-multiply, convert to "integer" (from logarithm). Conversions are very fast on this implementation of SSE. We can also squeeze an extra constant coefficient into the first multiplication.

template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
        __m128 ret = arg;
//      std::printf( "arg = %,vg\n", ret );
        // Apply a constant pre-correction factor.
        ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
                * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
//      std::printf( "scaled = %,vg\n", ret );
        // Reinterpret arg as integer to obtain logarithm.
        asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
//      std::printf( "log = %,vg\n", ret );
        // Multiply logarithm by power.
        ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
//      std::printf( "powered = %,vg\n", ret );
        // Convert back to "integer" to exponentiate.
        asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
//      std::printf( "result = %,vg\n", ret );
        return ret;
}

A few trials with exponent = 2.4 show this consistently overestimates by about 5%. (The routine is always guaranteed to overestimate.) You could simply multiply by 0.95, but a few more instructions will get us about 4 decimal digits of accuracy, which should be enough for graphics.

The key is to match the overestimate with an underestimate, and take the average.

  • Compute x^0.8: four instructions, error ~ +3%.
  • Compute x^-0.4: one rsqrtps. (This is quite accurate enough, but does sacrifice the ability to work with zero.)
  • Compute x^0.4: one mulps.
  • Compute x^-0.2: one rsqrtps.
  • Compute x^2: one mulps.
  • Compute x^3: one mulps.
  • x^2.4 = x^2 * x^0.4: one mulps. This is the overestimate.
  • x^2.4 = x^3 * x^-0.4 * x^-0.2: two mulps. This is the underestimate.
  • Average the above: one addps, one mulps.

Instruction tally: fourteen, including two conversions with latency = 5 and two reciprocal square root estimates with throughput = 4.

To properly take the average, we want to weight the estimates by their expected errors. The underestimate raises the error to a power of 0.6 vs 0.4, so we expect it to be 1.5x as erroneous. Weighting doesn't add any instructions; it can be done in the pre-factor. Calling the coefficient a: a^0.5 = 1.5 a^-0.75, and a = 1.38316186.

The final error is about .015%, or 2 orders of magnitude better than the initial fastpow result. The runtime is about a dozen cycles for a busy loop with volatile source and destination variables… although it's overlapping the iterations, real-world usage will also see instruction-level parallelism. Considering SIMD, that's a throughput of one scalar result per 3 cycles!

int main() {
        __m128 const x0 = _mm_set_ps( 0.01, 1, 5, 1234.567 );
        std::printf( "Input: %,vg\n", x0 );

        // Approx 5% accuracy from one call. Always an overestimate.
        __m128 x1 = fastpow< 24, 10, 1, 1 >( x0 );
        std::printf( "Direct x^2.4: %,vg\n", x1 );

        // Lower exponents provide lower initial error, but too low causes overflow.
        __m128 xf = fastpow< 8, 10, int( 1.38316186 * 1e9 ), int( 1e9 ) >( x0 );
        std::printf( "1.38 x^0.8: %,vg\n", xf );

        // Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
        __m128 xfm4 = _mm_rsqrt_ps( xf );
        __m128 xf4 = _mm_mul_ps( xf, xfm4 );

        // Precisely calculate x^2 and x^3
        __m128 x2 = _mm_mul_ps( x0, x0 );
        __m128 x3 = _mm_mul_ps( x2, x0 );

        // Overestimate of x^2 * x^0.4
        x2 = _mm_mul_ps( x2, xf4 );

        // Get x^-0.2 from x^0.4. Combine with x^-0.4 into x^-0.6 and x^2.4.
        __m128 xfm2 = _mm_rsqrt_ps( xf4 );
        x3 = _mm_mul_ps( x3, xfm4 );
        x3 = _mm_mul_ps( x3, xfm2 );

        std::printf( "x^2 * x^0.4: %,vg\n", x2 );
        std::printf( "x^3 / x^0.6: %,vg\n", x3 );
        x2 = _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 ) );
        // Final accuracy about 0.015%, 200x better than x^0.8 calculation.
        std::printf( "average = %,vg\n", x2 );
}

Well… sorry I wasn't able to post this sooner. And extending it to x^1/2.4 is left as an exercise ;v) .


Update with stats

I implemented a little test harness and two x(512) cases corresponding to the above.

#include <cstdio>
#include <xmmintrin.h>
#include <cmath>
#include <cfloat>
#include <algorithm>
using namespace std;

template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
    __m128 ret = arg;
//  std::printf( "arg = %,vg\n", ret );
    // Apply a constant pre-correction factor.
    ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
        * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
//  std::printf( "scaled = %,vg\n", ret );
    // Reinterpret arg as integer to obtain logarithm.
    asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
//  std::printf( "log = %,vg\n", ret );
    // Multiply logarithm by power.
    ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
//  std::printf( "powered = %,vg\n", ret );
    // Convert back to "integer" to exponentiate.
    asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
//  std::printf( "result = %,vg\n", ret );
    return ret;
}

__m128 pow125_4( __m128 arg ) {
    // Lower exponents provide lower initial error, but too low causes overflow.
    __m128 xf = fastpow< 4, 5, int( 1.38316186 * 1e9 ), int( 1e9 ) >( arg );

    // Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
    __m128 xfm4 = _mm_rsqrt_ps( xf );
    __m128 xf4 = _mm_mul_ps( xf, xfm4 );

    // Precisely calculate x^2 and x^3
    __m128 x2 = _mm_mul_ps( arg, arg );
    __m128 x3 = _mm_mul_ps( x2, arg );

    // Overestimate of x^2 * x^0.4
    x2 = _mm_mul_ps( x2, xf4 );

    // Get x^-0.2 from x^0.4, and square it for x^-0.4. Combine into x^-0.6.
    __m128 xfm2 = _mm_rsqrt_ps( xf4 );
    x3 = _mm_mul_ps( x3, xfm4 );
    x3 = _mm_mul_ps( x3, xfm2 );

    return _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 * 0.9999 ) );
}

__m128 pow512_2( __m128 arg ) {
    // 5/12 is too small, so compute the sqrt of 10/12 instead.
    __m128 x = fastpow< 5, 6, int( 0.992245 * 1e9 ), int( 1e9 ) >( arg );
    return _mm_mul_ps( _mm_rsqrt_ps( x ), x );
}

__m128 pow512_4( __m128 arg ) {
    // 5/12 is too small, so compute the 4th root of 20/12 instead.
    // 20/12 = 5/3 = 1 + 2/3 = 2 - 1/3. 2/3 is a suitable argument for fastpow.
    // weighting coefficient: a^-1/2 = 2 a; a = 2^-2/3
    __m128 xf = fastpow< 2, 3, int( 0.629960524947437 * 1e9 ), int( 1e9 ) >( arg );
    __m128 xover = _mm_mul_ps( arg, xf );

    __m128 xfm1 = _mm_rsqrt_ps( xf );
    __m128 x2 = _mm_mul_ps( arg, arg );
    __m128 xunder = _mm_mul_ps( x2, xfm1 );

    // sqrt2 * over + 2 * sqrt2 * under
    __m128 xavg = _mm_mul_ps( _mm_set1_ps( 1/( 3 * 0.629960524947437 ) * 0.999852 ),
                                _mm_add_ps( xover, xunder ) );

    xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
    xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
    return xavg;
}

__m128 mm_succ_ps( __m128 arg ) {
    return (__m128) _mm_add_epi32( (__m128i) arg, _mm_set1_epi32( 4 ) );
}

void test_pow( double p, __m128 (*f)( __m128 ) ) {
    __m128 arg;

    for ( arg = _mm_set1_ps( FLT_MIN / FLT_EPSILON );
            ! isfinite( _mm_cvtss_f32( f( arg ) ) );
            arg = mm_succ_ps( arg ) ) ;

    for ( ; _mm_cvtss_f32( f( arg ) ) == 0;
            arg = mm_succ_ps( arg ) ) ;

    std::printf( "Domain from %g\n", _mm_cvtss_f32( arg ) );

    int n;
    int const bucket_size = 1 << 25;
    do {
        float max_error = 0;
        double total_error = 0, cum_error = 0;
        for ( n = 0; n != bucket_size; ++ n ) {
            float result = _mm_cvtss_f32( f( arg ) );

            if ( ! isfinite( result ) ) break;

            float actual = ::powf( _mm_cvtss_f32( arg ), p );

            float error = ( result - actual ) / actual;
            cum_error += error;
            error = std::abs( error );
            max_error = std::max( max_error, error );
            total_error += error;

            arg = mm_succ_ps( arg );
        }

        std::printf( "error max = %8g\t" "avg = %8g\t" "|avg| = %8g\t" "to %8g\n",
                    max_error, cum_error / n, total_error / n, _mm_cvtss_f32( arg ) );
    } while ( n == bucket_size );
}

int main() {
    std::printf( "4 insn x^12/5:\n" );
    test_pow( 12./5, & fastpow< 12, 5, 1059, 1000 > );
    std::printf( "14 insn x^12/5:\n" );
    test_pow( 12./5, & pow125_4 );
    std::printf( "6 insn x^5/12:\n" );
    test_pow( 5./12, & pow512_2 );
    std::printf( "14 insn x^5/12:\n" );
    test_pow( 5./12, & pow512_4 );
}

Output:

4 insn x^12/5:
Domain from 1.36909e-23
error max =      inf    avg =      inf  |avg| =      inf    to 8.97249e-19
error max =  2267.14    avg =  139.175  |avg| =  139.193    to 5.88021e-14
error max = 0.123606    avg = -0.000102963  |avg| = 0.0371122   to 3.85365e-09
error max = 0.123607    avg = -0.000108978  |avg| = 0.0368548   to 0.000252553
error max =  0.12361    avg = 7.28909e-05   |avg| = 0.037507    to  16.5513
error max = 0.123612    avg = -0.000258619  |avg| = 0.0365618   to 1.08471e+06
error max = 0.123611    avg = 8.70966e-05   |avg| = 0.0374369   to 7.10874e+10
error max =  0.12361    avg = -0.000103047  |avg| = 0.0371122   to 4.65878e+15
error max = 0.123609    avg =      nan  |avg| =      nan    to 1.16469e+16
14 insn x^12/5:
Domain from 1.42795e-19
error max =      inf    avg =      nan  |avg| =      nan    to 9.35823e-15
error max = 0.000936462 avg = 2.0202e-05    |avg| = 0.000133764 to 6.13301e-10
error max = 0.000792752 avg = 1.45717e-05   |avg| = 0.000129936 to 4.01933e-05
error max = 0.000791785 avg = 7.0132e-06    |avg| = 0.000129923 to  2.63411
error max = 0.000787589 avg = 1.20745e-05   |avg| = 0.000129347 to   172629
error max = 0.000786553 avg = 1.62351e-05   |avg| = 0.000132397 to 1.13134e+10
error max = 0.000785586 avg = 8.25205e-06   |avg| = 0.00013037  to 6.98147e+12
6 insn x^5/12:
Domain from 9.86076e-32
error max = 0.0284339   avg = 0.000441158   |avg| = 0.00967327  to 6.46235e-27
error max = 0.0284342   avg = -5.79938e-06  |avg| = 0.00897913  to 4.23516e-22
error max = 0.0284341   avg = -0.000140706  |avg| = 0.00897084  to 2.77556e-17
error max = 0.028434    avg = 0.000440504   |avg| = 0.00967325  to 1.81899e-12
error max = 0.0284339   avg = -6.11153e-06  |avg| = 0.00897915  to 1.19209e-07
error max = 0.0284298   avg = -0.000140597  |avg| = 0.00897084  to 0.0078125
error max = 0.0284371   avg = 0.000439748   |avg| = 0.00967319  to      512
error max = 0.028437    avg = -7.74294e-06  |avg| = 0.00897924  to 3.35544e+07
error max = 0.0284369   avg = -0.000142036  |avg| = 0.00897089  to 2.19902e+12
error max = 0.0284368   avg = 0.000439183   |avg| = 0.0096732   to 1.44115e+17
error max = 0.0284367   avg = -7.41244e-06  |avg| = 0.00897923  to 9.44473e+21
error max = 0.0284366   avg = -0.000141706  |avg| = 0.00897088  to 6.1897e+26
error max = 0.485129    avg = -0.0401671    |avg| = 0.048422    to 4.05648e+31
error max = 0.994932    avg = -0.891494 |avg| = 0.891494    to 2.65846e+36
error max = 0.999329    avg =      nan  |avg| =      nan    to       -0
14 insn x^5/12:
Domain from 2.64698e-23
error max =  0.13556    avg = 0.00125936    |avg| = 0.00354677  to 1.73472e-18
error max = 0.000564988 avg = 2.51458e-06   |avg| = 0.000113709 to 1.13687e-13
error max = 0.000565065 avg = -1.49258e-06  |avg| = 0.000112553 to 7.45058e-09
error max = 0.000565143 avg = 1.5293e-06    |avg| = 0.000112864 to 0.000488281
error max = 0.000565298 avg = 2.76457e-06   |avg| = 0.000113713 to       32
error max = 0.000565453 avg = -1.61276e-06  |avg| = 0.000112561 to 2.09715e+06
error max = 0.000565531 avg = 1.42628e-06   |avg| = 0.000112866 to 1.37439e+11
error max = 0.000565686 avg = 2.71505e-06   |avg| = 0.000113715 to 9.0072e+15
error max = 0.000565763 avg = -1.56586e-06  |avg| = 0.000112415 to 1.84467e+19

I suspect accuracy of the more accurate 5/12 is being limited by the rsqrt operation.

Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
  • Going to give this a try as well. Looks like it should be faster than David's, and a ~.015% error is probably acceptable. – Cory Nelson Jun 26 '11 at 23:55
  • @Cory: Let me know how it turns out. What you see here is the entirety of the testing I did. – Potatoswatter Jun 27 '11 at 00:35
  • @Cory: I added a little timing info and fixed a bug with `-O3` in the inline assembly. (The source and dest operands were reversed… I guess.) – Potatoswatter Jun 27 '11 at 01:22
  • @Cory: If a 0.015% error is acceptable, just reduce N in my solution from 8 to 4 (and delete the last four terms of Cn). Chebyshev polynomials have a very deep mathematical basis. This solution has an ad-hoc basis. To be blunt, this solution looks like crap. – David Hammen Jun 27 '11 at 01:30
  • 2
    @David: To each his own. Really I'm only trying to do better than the other "FP-hack" answer, which truly is crap. But I don't see how ad-hoc usage of the tools at hand is worse than enlightened usage of the deep fundamentals… it's just cleverness vs. beauty. – Potatoswatter Jun 27 '11 at 01:46
  • @David Will do. I appreciate all the answers and will be updating my question with a comparison when I finish testing them. – Cory Nelson Jun 27 '11 at 10:51
  • @Cory: Updated with the other exponent, a test harness, and tweaked coefficients. – Potatoswatter Jun 28 '11 at 05:57
  • I had rolled my own x^(1/2.4), but yours ended up having higher precision. Could you explain how you determine the overflow errors, weighting coefficients, and average multipliers? – Cory Nelson Jun 28 '11 at 11:06
  • 3
    The cvtdq2ps/cvtps2dq can be implemented with intrinsics too btw, `_mm_cvtepi32_ps(_mm_castps_si128(x))` and `_mm_castsi128_ps(_mm_cvtps_epi32(x))`. – Cory Nelson Jun 28 '11 at 11:08
  • @Cory: Ah, in `emmintrin.h`. Nice. I'm not sure what you mean by determine the overflow errors. The constant `exp2` calculation will overflow single-precision if its argument is >127, i.e. the power is less than 1/2. You could compensate by making the coefficient very small. I haven't tried that. I simply assumed the relative amount of error is proportional to the power, i.e. each `rsqrt` halves it. This might not be the best approximation, hmmm… The weighting coefficient "a" is set to plug equal-but-opposite errors into the average, a^(p-q) = -p/q, avg mult = 1/(a^p + a^q). – Potatoswatter Jun 28 '11 at 18:56
  • Oops, that should be a^(p-q) = -q/p. And I added another fudge coefficient to make the average error close to zero, but on second thought it would be better to tweak the weight instead, and to minimize the maximum error instead of the average. – Potatoswatter Jun 28 '11 at 19:06
20

Ian Stephenson wrote this code which he claims outperforms pow(). He describes the idea as follows:

Pow is basically implemented using log's: pow(a,b)=x(logx(a)*b). so we need a fast log and fast exponent - it doesn't matter what x is so we use 2. The trick is that a floating point number is already in a log style format:

a=M*2E

Taking the log of both sides gives:

log2(a)=log2(M)+E

or more simply:

log2(a)~=E

In other words if we take the floating point representation of a number, and extract the Exponent we've got something that's a good starting point as its log. It turns out that when we do this by massaging the bit patterns, the Mantissa ends up giving a good approximation to the error, and it works pretty well.

This should be good enough for simple lighting calculations, but if you need something better, you can then extract the Mantissa, and use that to calculate a quadratic correction factor which is pretty accurate.

PengOne
  • 48,188
  • 17
  • 130
  • 149
  • Neat. A fine idea to use the internal format of floats! Does the standard library by any chance already take advantage of this? – Kerrek SB Jun 25 '11 at 02:37
  • 8
    Nice idea. Might be prettier to use the standard [frexp function](http://pubs.opengroup.org/onlinepubs/9699919799/functions/frexp.html) to extract the exponent and mantissa, though... Any decent compiler should be able to implement frexp as a bit extraction (plus offset), so it would gain portability without losing performance. – Nemo Jun 25 '11 at 02:43
  • Cool! I know I'd seen this before, but have since forgotten about it. That is a very nice trick for getting a fast approximation of log2. – Mikola Jun 25 '11 at 03:08
  • This kind of optimization doesn't account for the constant argument, so it doesn't really answer the question. This particular implementation isn't very good… aside from failing to use `frexp`, the usual way of doing this is to count-leading-zeroes and shift the number, to move the exponent into the mantissa and then use the zero count to create a new exponent. The bits shifted within the mantissa approximate the result using `2^x ≈ x, 1 < x < 2`. – Potatoswatter Jun 25 '11 at 03:11
  • +1 for a nice link. I might be able to optimize generic pow with this. Like @potatoswatter says though, it doesn't exactly answer the question. – Cory Nelson Jun 25 '11 at 03:36
  • Aha! I was halfway down this very path -- squaring the mantissa to improve the piecewise approximation that float makes of log and exp -- when I punked out and decided to search for something more straightforward. Other answers in here made me realise that my own solution wasn't so bad, and now I don't have to figure out the magic numbers! – sh1 Dec 31 '14 at 07:30
17

First off, using floats isn't going to buy much on most machines nowadays. In fact, doubles can be faster. Your power, 1.0/2.4, is 5/12 or 1/3*(1+1/4). Even though this is calling cbrt (once) and sqrt (twice!) it is still twice as fast as using pow(). (Optimization: -O3, compiler: i686-apple-darwin10-g++-4.2.1).

#include <math.h> // cmath does not provide cbrt; C99 does.
double xpow512 (double x) {
  double cbrtx = cbrt(x);
  return cbrtx*sqrt(sqrt(cbrtx));
}
David Hammen
  • 32,454
  • 9
  • 60
  • 108
  • 5
    float/double is often the same speed (well, except for div/sqrt) for a single value, but I'm actually doing this in SIMD -- a case where float will generally have twice the throughput. Very interesting answer though, will definitely be trying this. – Cory Nelson Jun 25 '11 at 03:41
  • +1 Because this is both correct and actually captures the value of "2.4" better than the float literal. (But this one only works for the `pow(a, 1/2.4f)` version...). – Kerrek SB Jun 25 '11 at 10:36
  • For an alternative technique, see my other answer. That uses Chebychev polynomials. It is blazingly fast, but it is a bit more verbose than this two-liner. – David Hammen Jun 25 '11 at 17:01
  • 2
    Using `float` is actually significantly faster than using `double` on a platform that has a well-written math library. Just to provide some hard numbers, I ran a small experiment: calling single-precision `powf` on OSX 10.6 / Core2 Duo is nearly twice as fast as the implementation you provide in this answer. – Stephen Canon Jun 27 '11 at 02:23
  • 1
    @Stephen: That's a bit of a red herring. A single primitive float operation is not significantly faster than the corresponding double operation. What makes powf so much faster than pow is that powf is only going after about 6-7 decimal digits of precision while pow goes after 15-16. If you wanted to make a fair comparison you would have made a float version of this function that calls cbrtf instead of cbrt, sqrtf instead of sqrt. – David Hammen Jun 27 '11 at 03:14
  • 4
    @David: Of course, but that doesn't make it a red herring. The questioner is asking how to implement a fairly complex function, not a single primitive operation. Even for primitive operations, one gets twice as much data into cache, and potentially twice as much SIMD parallelism (assuming hand-tuned code) using float instead of double. – Stephen Canon Jun 27 '11 at 03:26
  • I've tested the code in the answer and unfortunately it happened to be 1.3 slower than std::pow version both in Clang and GCC. https://quick-bench.com/q/Q2Mw6vv0JERYDvnZZwKlcNoly0g – fdermishin Jan 10 '23 at 10:07
  • Also, cbrt is now available in cmath as of c++11 and gcc 4.7.1 – fdermishin Jan 10 '23 at 10:07
15

This might not answer your question.

The 2.4f and 1/2.4f make me very suspicious, because those are exactly the powers used to convert between sRGB and a linear RGB color space. So you might actually be trying to optimize that, specifically. I don't know, which is why this might not answer your question.

If this is the case, try using a lookup table. Something like:

__attribute__((aligned(64))
static const unsigned short SRGB_TO_LINEAR[256] = { ... };
__attribute__((aligned(64))
static const unsigned short LINEAR_TO_SRGB[256] = { ... };

void apply_lut(const unsigned short lut[256], unsigned char *src, ...

If you are using 16-bit data, change as appropriate. I would make the table 16 bits anyway so you can dither the result if necessary when working with 8-bit data. This obviously won't work very well if your data is floating point to begin with -- but it doesn't really make sense to store sRGB data in floating point, so you might as well convert to 16-bit / 8-bit first and then do the conversion from linear to sRGB.

(The reason sRGB doesn't make sense as floating point is that HDR should be linear, and sRGB is only convenient for storing on disk or displaying on screen, but not convenient for manipulation.)

Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • 2
    You caught me ;), I'm doing sRGB gamma compression/decompression. My input/output needs to be float, or I'd definitely be using a LUT. – Cory Nelson Jun 25 '11 at 03:44
  • Hm, interesting... do you mind explaining why it needs to be float? – Dietrich Epp Jun 25 '11 at 03:50
  • 4
    These conversions are smack in the middle of a larger floating-point pipeline. I suppose I could scale the input to an integer for an index into a float LUT of desired granularity, but large LUTs don't play well with SIMD. (plus I'd like to see what could be done without LUT first, just for future reference and because it's neat ;) – Cory Nelson Jun 25 '11 at 06:17
  • Heck, small LUTs don't play well with SIMD (except "ultra-small" tables which fit in a word). I just have a hard time imagining what you'd do with floating point sRGB data, since you can't composite it, you can't do HDR, you can't do filters, etc. – Dietrich Epp Jun 25 '11 at 06:31
  • You convert it into Y'CbCr :) – Cory Nelson Jun 25 '11 at 06:37
  • Well, Y'V12 specifically. Requires yet more scaling due to chroma being half-size, and unfortunately the scaling must be on gamma-compressed values. – Cory Nelson Jun 25 '11 at 06:51
  • yup. linear sRGB -> gamma-compressed sRGB -> Y'CbCr -> 2x downsize CbCr -> integer YV12. – Cory Nelson Jun 25 '11 at 07:32
  • Yeah, typically 16 bits is seen as enough for the whole pipeline there (if 8 isn't ok). – Dietrich Epp Jun 25 '11 at 08:07
  • Would this SIMD happen to be an NVidia GPU? In which case I remember seeing a way to use the texturing unit as a LUT, using a float as an index. Free interpolation, even. – MSalters Jun 25 '11 at 13:41
  • 1
    @MSalters: For some definition of "free"... doesn't interpolation come with increased memory access on GPUs? – Dietrich Epp Jun 25 '11 at 20:00
  • Obviously, yes, but that's unavoidable - you need at least two adjacent values for interpolation. It's free in the sense that you don't need to write code. – MSalters Jun 27 '11 at 07:47
  • May I ask why you're aligning the look-up table? What benefit does this have? – rdb Nov 10 '14 at 11:07
  • @rdb: The look-up table is (1) simple (2) accurate (3) fast. – Dietrich Epp Nov 10 '14 at 12:17
  • I understand that, but my question was why the code in this answer explicitly aligns the look-up table to 64 bytes. – rdb Nov 10 '14 at 12:19
  • 1
    @rdb: 64 bytes is a common size for a cache line. – Dietrich Epp Nov 10 '14 at 12:34
4

I shall answer the question you really wanted to ask, which is how to do fast sRGB <-> linear RGB conversion. To do this precisely and efficiently we can use polynomial approximations. The following polynomial approximations have been generated with sollya, and have a worst case relative error of 0.0144%.

inline double poly7(double x, double a, double b, double c, double d,
                              double e, double f, double g, double h) {
    double ab, cd, ef, gh, abcd, efgh, x2, x4;
    x2 = x*x; x4 = x2*x2;
    ab = a*x + b; cd = c*x + d;
    ef = e*x + f; gh = g*x + h;
    abcd = ab*x2 + cd; efgh = ef*x2 + gh;
    return abcd*x4 + efgh;
}

inline double srgb_to_linear(double x) {
    if (x <= 0.04045) return x / 12.92;

    // Polynomial approximation of ((x+0.055)/1.055)^2.4.
    return poly7(x, 0.15237971711927983387,
                   -0.57235993072870072762,
                    0.92097986411523535821,
                   -0.90208229831912012386,
                    0.88348956209696805075,
                    0.48110797889132134175,
                    0.03563925285274562038,
                    0.00084585397227064120);
}

inline double linear_to_srgb(double x) {
    if (x <= 0.0031308) return x * 12.92;

    // Piecewise polynomial approximation (divided by x^3)
    // of 1.055 * x^(1/2.4) - 0.055.
    if (x <= 0.0523) return poly7(x, -6681.49576364495442248881,
                                      1224.97114922729451791383,
                                      -100.23413743425112443219,
                                         6.60361150127077944916,
                                         0.06114808961060447245,
                                        -0.00022244138470139442,
                                         0.00000041231840827815,
                                        -0.00000000035133685895) / (x*x*x);

    return poly7(x, -0.18730034115395793881,
                     0.64677431008037400417,
                    -0.99032868647877825286,
                     1.20939072663263713636,
                     0.33433459165487383613,
                    -0.01345095746411287783,
                     0.00044351684288719036,
                    -0.00000664263587520855) / (x*x*x);
}

And the sollya input used to generate the polynomials:

suppressmessage(174);
f = ((x+0.055)/1.055)^2.4;
p0 = fpminimax(f, 7, [|D...|], [0.04045;1], relative);
p = fpminimax(f/(p0(1)+1e-18), 7, [|D...|], [0.04045;1], relative);
print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
print("absolute:", dirtyinfnorm((f-p), [s;1]));
print(canonical(p));

s = 0.0523;
z = 3;
f = 1.055 * x^(1/2.4) - 0.055;

p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [0.0031308;s], relative)/x^z;
print("relative:", dirtyinfnorm((f-p)/f, [0.0031308;s]));
print("absolute:", dirtyinfnorm((f-p), [0.0031308;s]));
print(canonical(p));

p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [s;1], relative)/x^z;
print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
print("absolute:", dirtyinfnorm((f-p), [s;1]));
print(canonical(p));
orlp
  • 112,504
  • 36
  • 218
  • 315
3

Binomial series does account for a constant exponent, but you will be able to use it only if you can normalize all your input to the range [1,2). (Note that it computes (1+x)^a). You'll have to do some analysis to decide how many terms you need for your desired accuracy.

zvrba
  • 24,186
  • 3
  • 55
  • 65
  • Now this one is interesting. The accuracy depends heavily on how small _x_ is. While it performs fantastically for larger numbers (>0.1), for the small numbers I sometimes need to work with (0.04 and 0.003) I needed so many iterations that the `pow()` approximation I linked to in the question is significantly faster. – Cory Nelson Jun 25 '11 at 10:18
  • You might also want to look at function approximation using chebyshev polynomials: http://en.wikipedia.org/wiki/Chebyshev_polynomials – zvrba Jun 25 '11 at 10:48
  • @Cory Nelson: The Binomial series is just the special case of a Taylor series around x=1. You can also calculate the Taylor series around x=0. This looks far less pretty on paper with lots of α terms, but since you _know_ α that's no big deal. – MSalters Jun 25 '11 at 13:37
  • I did exactly that in my second answer (see below). This second result blazingly fast, very accurate throughout. @Cory: Chebychev polynomials are an incredibly powerful tool for function approximation, often much better than Taylor series. – David Hammen Jun 25 '11 at 16:34
1

So traditionally the powf(x, p) = x^p is solved by rewriting x as x=2^(log2(x)) making powf(x,p) = 2^(p*log2(x)), which transforms the problem into two approximations exp2() & log2(). This has the advantage of working with larger powers p, however the downside is that this is not the optimal solution for a constant power p and over a specified input bound 0 ≤ x ≤ 1.

When the power p > 1, the answer is a trivial minimax polynomial over the bound 0 ≤ x ≤ 1, which is the case for p = 12/5 = 2.4 as can be seen below:

float pow12_5(float x){
    float mp;
    // Minimax horner polynomials for x^(5/12), Note: choose the accurarcy required then implement with fma() [Fused Multiply Accumulates]
    // mp = 0x4.a84a38p-12 + x * (-0xd.e5648p-8 + x * (0xa.d82fep-4 + x * 0x6.062668p-4)); // 1.13705697e-3
    mp = 0x1.117542p-12 + x * (-0x5.91e6ap-8 + x * (0x8.0f50ep-4 + x * (0xa.aa231p-4 + x * (-0x2.62787p-4))));  // 2.6079002e-4
    // mp = 0x5.a522ap-16 + x * (-0x2.d997fcp-8 + x * (0x6.8f6d1p-4 + x * (0xf.21285p-4 + x * (-0x7.b5b248p-4 + x * 0x2.32b668p-4))));  // 8.61377e-5
    // mp = 0x2.4f5538p-16 + x * (-0x1.abcdecp-8 + x * (0x5.97464p-4 + x * (0x1.399edap0 + x * (-0x1.0d363ap0 + x * (0xa.a54a3p-4 + x * (-0x2.e8a77cp-4))))));  // 3.524655e-5
    return(mp);
}

However when p < 1 the minimax approximation over the bound 0 ≤ x ≤ 1 does not appropriately converge to the desired accuracy. One option [not really] is to rewrite the problem y=x^p=x^(p+m)/x^m where m=1,2,3 is a positive integer, making the new power approximation p > 1 but this introduces division which is inherently slower.

There's however another option which is to decompose the input x as its floating point exponent and mantissa form:

x = mx* 2^(ex) where 1 ≤ mx < 2
y = x^(5/12) = mx^(5/12) * 2^((5/12)*ex), let ey = floor(5*ex/12), k = (5*ex) % 12
  = mx^(5/12) * 2^(k/12) * 2^(ey)

The minimax approximation of mx^(5/12) over 1 ≤ mx < 2 now converges much faster than before, without division, but requires 12 point LUT for the 2^(k/12). The code is below:

float powk_12LUT[] = {0x1.0p0, 0x1.0f38fap0, 0x1.1f59acp0,  0x1.306fep0, 0x1.428a3p0, 0x1.55b81p0, 0x1.6a09e6p0, 0x1.7f910ep0, 0x1.965feap0, 0x1.ae89fap0, 0x1.c823ep0, 0x1.e3437ep0};
float pow5_12(float x){
    union{float f; uint32_t u;} v, e2;
    float poff, m, e, ei;
    int xe;

    v.f = x;
    xe = ((v.u >> 23) - 127);

    if(xe < -127) return(0.0f);

    // Calculate remainder k in 2^(k/12) to find LUT
    e = xe * (5.0f/12.0f);
    ei = floorf(e);
    poff = powk_12LUT[(int)(12.0f * (e - ei))];

    e2.u = ((int)ei + 127) << 23;   // Calculate the exponent
    v.u = (v.u & ~(0xFFuL << 23)) | (0x7FuL << 23); // Normalize exponent to zero

    // Approximate mx^(5/12) on [1,2), with appropriate degree minimax
    // m = 0x8.87592p-4 + v.f * (0x8.8f056p-4 + v.f * (-0x1.134044p-4));    // 7.6125e-4
    // m = 0x7.582138p-4 + v.f * (0xb.1666bp-4 + v.f * (-0x2.d21954p-4 + v.f * 0x6.3ea0cp-8));  // 8.4522726e-5
    m = 0x6.9465cp-4 + v.f * (0xd.43015p-4 + v.f * (-0x5.17b2a8p-4 + v.f * (0x1.6cb1f8p-4 + v.f * (-0x2.c5b76p-8))));   // 1.04091259e-5
    // m = 0x6.08242p-4 + v.f * (0xf.352bdp-4 + v.f * (-0x7.d0c1bp-4 + v.f * (0x3.4d153p-4 + v.f * (-0xc.f7a42p-8 + v.f * 0x1.5d840cp-8))));    // 1.367401e-6

    return(m * poff * e2.f);
}
nimig18
  • 797
  • 7
  • 10
1

For exponents of 2.4, you could either make a lookup table for all your 2.4 values and lirp or perhaps higher-order function to fill in the in-betweem values if the table wasn't accurate enough (basically a huge log table.)

Or, value squared * value to the 2/5s which could take the initial square value from the first half of the function and then 5th root it. For the 5th root, you could Newton it or do some other fast approximator, though honestly once you get to this point, your probably better off just doing the exp and log functions with the appropriate abbreviated series functions yourself.

Michael Dorgan
  • 12,453
  • 3
  • 31
  • 61
1

The following is an idea you can use with any of the fast calculation methods. Whether it helps things go faster depends on how your data arrives. You can use the fact that if you know x and pow(x, n), you can use the rate of change of the power to compute a reasonable approximation of pow(x + delta, n) for small delta, with a single multiply and add (more or less). If successive values you feed your power functions are close enough together, this would amortize the full cost of the accurate calculation over multiple function calls. Note that you don't need an extra pow calculation to get the derivative. You could extend this to use the second derivative so you can use a quadratic, which would increase the delta you could use and still get the same accuracy.

Permaquid
  • 1,950
  • 1
  • 14
  • 15