11

Suppose that correctly rounded standard library functions such as found in CRlibm are available. Then how would one compute the correctly rounded cubic root of a double-precision input?

This question is not an “actual problem that [I] face”, to quote the FAQ. It is a little bit like homework this way. But the cubic root is a frequently found operation and one could imagine this question being an actual problem that someone faces.

Since “best Stack Overflow questions have a bit of source code in them”, here is a bit of source code:

  y = pow(x, 1. / 3.);

The above does not compute a correctly rounded cubic root because 1/3 is not representable exactly as a double.


ADDITIONAL NOTES:

An article describes how to compute a floating-point cubic root, but the last iteration(s) of the recommended Newton-Raphson algorithm would have to be done to higher precision for the algorithm to compute a correctly rounded double-precision cubic root. That may be the best way to compute it, but I am still looking for a shortcut that would take advantage of existing correctly rounded standardized functions.

C99 includes a cbrt() function, but it cannot be expected to be correctly rounded or even faithful for all compilers. CRlibm's designers could have chosen to include cbrt() in the list of provided functions, but they didn't. References to implementations available in other libraries of correctly rounded math functions are welcome.

Community
  • 1
  • 1
Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
  • Like square root, a correctly-rounded cube root does not require very much extra precision. (The first time I implemented `cbrtf` it turned out to be correctly-rounded on accident). – Stephen Canon Aug 05 '13 at 17:12
  • @StephenCanon … but I have no hope of exhaustively testing a double-precision cubic root function I would have written to check whether I did not make it correctly rounded by accident. It's 3*2^52 inputs after argument reduction just for normal inputs, if I understand Ken Turkowski's argument correctly. – Pascal Cuoq Aug 05 '13 at 17:18
  • Yeah, more of an anecdote than an answer =). – Stephen Canon Aug 05 '13 at 17:20
  • check the third method: http://en.wikipedia.org/wiki/Cube_root#Numerical_methods – Khaled.K Nov 23 '14 at 07:05

4 Answers4

6

I am afraid I do not know how to guarantee a correctly rounded double-precision cube root, but can offer one that is very nearly correctly rounded as also mentioned in the question. In other words, the maximum error appears to be very close to 0.5 ulp.

Peter Markstein, "IA-64 and Elementary Functions: Speed and Precision" (Prentice-Hall 2000)

presents efficient FMA-based techniques for correctly rounding reciprocal, square root, and reciprocal square root, but it does not cover the cube root in this regard. In general Markstein's approach requires a preliminary result that is accurate to within 1 ulp prior to the final rounding sequence. I do not have the mathematical wherewithal to extend his technique to the rounding of cube roots, but it seems to me that this should be possible in principle, being a challenge somewhat similar to the reciprocal square root.

Bit-wise algorithms lend themselves easily to the computation of roots with correct rounding. Since tie cases for the IEEE-754 rounding mode to-nearest-or-even cannot occur, one simply needs to carry the computation until it has produced all mantissa bits plus one round bit. The bit-wise algorithm for square root, based on the binomial theorem, is well known in both non-restoring and restoring variants and has been the basis of hardware implementations. The same approach via binomial theorem works for the cube root, and there is a little-known paper that lays out the details of a non-restoring implementation:

H. Peng, “Algorithms for Extracting Square Roots and Cube Roots,” Proceedings 5th IEEE International Symposium on Computer Arithmetic, pp. 121-126, 1981.

Best I can tell from experimenting with it this works well enough for the extraction of cube roots from integers. Since each iteration produces only a single result bit it is not exactly fast. For applications in floating-point arithmetic it has the drawback of using a couple of bookkeeping variables that require roughly twice the number of bits of the final result. This means one would need to use 128-bit integer arithmetic to implement a double-precision cube root.

My C99 code below is based on Halley's rational method for the cube root which has cubic convergence meaning that the initial approximation does not have to be very accurate as the number of valid digits triples in every iteration. The computation can be arranged in various ways. Generally it is numerically advantageous to arrange iterative schemes as

new_guess := old_guess + correction

since for a sufficiently close initial guess, correction is significantly smaller than old_guess. This leads to the following iteration scheme for the cube root:

x := x - x * (x3 - a) / (2*x3 + a)

This particular arrangement is also listed in Kahan's notes on cube root. It has the further advantage of lending itself naturally to the use of FMA (fused-multiply add). One drawback is that the computation of 2*x3 could lead to overflow, therefore an argument reduction scheme is required for at least part of the double-precision input domain. In my code I simply apply the argument reduction to all non-exceptional inputs, based on straightforward manipulation of the exponents of IEEE-754 double-precision operands.

The interval [0.125,1) serves as the primary approximation interval. A polynomial minimax approximation is used that returns an initial guess in [0.5,1]. The narrow range facilitates the use of single-precision arithmetic for the low-accuracy portions of the computation.

I cannot prove anything about the error bounds of my implementation, however, testing with 200 million random test vectors against a reference implementation (accurate to about 200 bits) found a total of 277 incorrectly rounded results (so an error rate of roughly 1.4 ppm) with a maximum error of 0.500012 ulps.

double my_cbrt (double a)
{
    double b, u, v, r;
    float bb, uu, vv;
    int e, f, s;

    if ((a == 0.0) || isinf(a) || isnan(a)) {
        /* handle special cases */
        r = a + a;
    } else {
        /* strip off sign-bit */
        b = fabs (a);
        /* compute exponent adjustments */
        b = frexp (b, &e);
        s = e - 3*342;
        f = s / 3;
        s = s - 3 * f;
        f = f + 342;
        /* map argument into the primary approximation interval [0.125,1) */
        b = ldexp (b, s);
        bb = (float)b;
        /* approximate cube root in [0.125,1) with relative error 5.22e-3 */
        uu =                0x1.2f32c0p-1f;
        uu = fmaf (uu, bb, -0x1.62cc2ap+0f);
        uu = fmaf (uu, bb,  0x1.7546e0p+0f);
        uu = fmaf (uu, bb,  0x1.5d0590p-2f);
        /* refine cube root using two Halley iterations w/ cubic convergence */
        vv = uu * uu;
        uu = fmaf (fmaf (vv, uu, -bb) / fmaf (vv, 2.0f*uu, bb), -uu, uu);
        u = (double)uu;
        v = u * u; // this product is exact
        r = fma (fma (v, u, -b) / fma (v, 2.0*u, b), -u, u);
        /* map back from primary approximation interval by jamming exponent */
        r = ldexp (r, f);
        /* restore sign bit */
        r = copysign (r, a);
    }
    return r;
}
njuffa
  • 23,970
  • 4
  • 78
  • 130
5

Given that there are lots of easily computable rational points on the curve x = y^3, I’m tempted to reduce about s^3 ~ x, with s rational and just a few bits wide. Then you have:

cbrt(x) = s * cbrt(1 + (x - s^3)/s)

Obvious thing then is to evaluate the correction term using your favorite series approximation, and compute a residual via head-tail FMA arithmetic to bump the result up or down by an ulp if needed (you won’t need the full computation most of the time, obviously).

This isn’t totally in the spirit of the question, but can definitely be made to work, and it’s very easy to prove the necessary bounds this way. Hopefully someone else can suggest something more clever (I used up my cleverness for the month already).

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • “This isn’t totally in the spirit of the question” -> I asked this question because I could not find a way to take advantage of a correctly-rounded `pow()`, even with `pow(candidate, 3.)` to compute correctly-rounded cubes of candidates. If the answer is “no, there is no way to make `pow()` useful”, that's the answer. – Pascal Cuoq Aug 05 '13 at 17:52
  • I’m definitely not prepared to say that there’s no way. But I’m not seeing one right away either. – Stephen Canon Aug 05 '13 at 18:20
  • It's frustrating, because just one additional bit of precision (and assuming “exact cases” are taken care of beforehand) would allow to use pow_ru to pick between two candidates distant of one ULP for cbrt_rn(x). Compute pow_ru(midpoint between the candidates, 3) and compare to x. – Pascal Cuoq Aug 05 '13 at 22:37
5

I have recently written a correctly-rounded cube root, implemented in https://github.com/mockingbirdnest/Principia/blob/master/numerics/cbrt.cpp and documented in https://github.com/mockingbirdnest/Principia/blob/master/documentation/cbrt.pdf. That implementation is in C++ and uses Intel intrinsics, but it should be straightforward to adapt that to the language and libraries of one’s choice.

Brief summary of the actual computation below; the references here use bibliographic codes from that document because I am apparently too disreputable to link many things at this juncture. References to “part I+” or “appendix [A-Z]” are to cbrt.pdf.

Correct rounding.

This can be done by using a nearly-correct faithful method and, if the result is too close to a tie, following that up with one round of the classical digit-by-digit algorithm (the one that is like long division) to get the 54th bit; since that algorithm computes bits of the result rounded toward 0, and since there are no halfway cases for the cube root, this bit suffices to determine rounding to nearest.

It is possible to get the nearly-correct method to be both faster than most existing implementations of the cube root, and correct enough that the average cost is essentially unaffected by the correction step; see appendices D and F.

Remarks on the faithful method without FMA.

The « nearly-correct faithful method » part is perhaps the more interesting; without FMA the basic outline is the same as in Kahan’s [KB01]:

  1. integer arithmetic on the representation of the given floating-point number;
  2. root-finding to a third of the precision;
  3. rounding to a third of the precision, to get an exact cube;
  4. one step of a high-order method using that exact cube.

Using a clever root-finding method in step 2, one can lower the misrounding rate while retaining or improving performance. In this case I dug up an irrational method,

↦ ½ + √(¼² + /(3)) to approximate ∛(³+),

from a 17th century book by Thomas Fantet de Lagny, [Fan92]; perhaps surprisingly since it has both a division and a square root, its performance when rewritten appropriately (so as to schedule the division as early as possible, and avoiding a serial dependency between the square root and the division, see appendix D) is similar to that of the better known rational method (nowadays often known as Halley’s, but both methods are considered by Halley, and both are due to Lagny when applied to the cube root; see the discussion in part I). This is because the rational method has a divisor of higher degree, so that its division cannot be scheduled early. The error of that irrational method is half that of the rational method.

Optimizing the coefficients to minimize the error at the end of step 2, in a manner inspired by tweets by Stephen Canon (two twitter threads, [Can18a] and [Can18b]) gains another two bits on the error of that irrational method, at no cost.

With a rational method of order 5 in step 4, the method achieves a misrounding rate of 4.564(68) per million, which is easily corrected at essentially no average cost (the rate of passage in the slow “potential misrounding” path is 2.6480(52)⋅10−4, and the latency of the slow path is less than ten times that of the fast path).

Remarks on the faithful method with FMA.

The key idea is in njuffa’s answer to this question: step 4 can be performed with only an exact square, without requiring an exact cube, so that step 2 should aim for, and step 3 should round to, half of the precision rather than a third.

Contrary to njuffa I used double throughout. In step 2, I went for an irrational method of order 5 (obtained by generalizing Lagny’s methods; see part I and appendix B), and in step 4, for a rational method of order 4 rather than 3.

The resulting method has a misrounding rate of 6.10(25) per billion, and a rate of passage in the “potential misrounding” path of 3.05(18)⋅10−7.

Robin Leroy
  • 51
  • 1
  • 1
  • Very interesting, thanks. Appreciate the references to historical publications. Are you planning to formally publish this somewhere? Based on the linked PDF I think this could be a good fit for ACM TOMS. – njuffa Jul 18 '21 at 07:33
3

The code below does not compute a proven correctly rounded double precision cube root (see Robin's answer/work for that case). However, after extensive testing I wasn't able to find any incorrectly rounded results. Therefore the results of function cbrt_ac() might be accidentally correctly rounded.

Halley's method

Indeed Halley's method is very suitable to improve the accuracy of an approximate cube root, see also njuffa's answer. Here we rewrite the Halley update formula slightly:

x := x - x * (x^3 - a) / (2*x^3 + a) = x + x * (a - x^3) / (2*x^3 + a)

If we define d = a - x^3, then x^3 = a - d and

x := x + x * d / (2*(a - d) + a)

:= x + x * ( d / (3*a - 2*d) )

This form of the denominator (3*a - 2*d) is suitable for both efficient and accurate evaluation. In order to compute d = a - x^3 accurately we first compute x^2 as an exact sum of 2 doubles (as suggested by Stephen Canon), using the fma instruction: x^2 = x2_h + x2_l exactly, see e.g. Karp and Markstein, High precision division and square root.

x2_h := x * x

x2_l := fma(x, x, -x_h)

Now we can compute d = a - x^3 with a relatively high accuracy:

d_h := fma(x, -x2_h, a)

d := fma(x, -x2_l, d_h)

Function cbrt_ac()

In the code below an initial approximation to y = a^(-2/3) is computed in a "fast inverse square root" way. Three division free (inexpensive) Newton iterations improve the accuracy. r = a^(1/3) is approximated with r = a * y and the accuracy of this cube root is improved with a pseudo Newton step. This leads to an almost correctly rounded division free cube root method with about 0.50002 ulp accuracy.

Finally the accuracy is further improved with a single Halley iteration. After more than 10^10 test calls to cbrt_ac() the maximum observed error was 4.9999999999404426e-01 ulp. Here the double precision results of cbrt_ac() were compared with gcc's quad precision ( __float128 ) function cbrtq. This suggests that the double precision function cbrt_ac() might be correctly rounded.

Please let me know if you find an incorrectly rounded result.

Double precision code

/* Accurate double precision cube root function cbrt_ac() and surrounding test code */
#include <stdio.h>
#include <stdint.h>
#include <math.h>
#include <quadmath.h>

/* gcc -O3 -Wall -m64 -std=c99 -march=skylake  cbrt_tst.c -lm -lquadmath */
/* Assumptions:
                - rounding mode: round nearest 
                - safe math: with gcc don't use -ffast-math, with icc enable -fp-model=precise
                - ieee-754 floating point math, higher intermediate floating point precision should not be used in the calculations, i.e. FLT_EVAL_METHOD == 0, which is default with the gcc x86-64 compiler.
*/    

uint64_t d2i (double x);
double i2d (uint64_t i);

double cbrt_ac(double z){                            // Accurate cube root (double)
    double a, y, r, r2_h, r2_l, y_a2y4, ayy, diff, diff3, denom;
    uint64_t ai, ai23, aim23;
    int issmall;
    
    a = fabs(z);
    issmall = (a <  0.015625);                       // Scale large, small and/or subnormal numbers to avoid underflow, overflow or subnormal numbers
    a = issmall ? (a * 0x1.0p+210) : (a * 0.125);
    ai = d2i(a);
    if ((ai >= 0x7FF0000000000000ull) || (z == 0.0)){    // Inf, 0.0 and NaN
        r = z + z;
    }
    else
    { 
        ai23 = 2 * (ai/3);                           // Integer division. The compiler, with suitable optimization level, should generate a much more efficient multiplication by 0xAAAAAAAAAAAAAAAB
        aim23 = 0x6A8EB53800000000ull - ai23;        // This uses a similar idea as the "fast inverse square root" approximation, see https://en.wikipedia.org/wiki/Fast_inverse_square_root
        y = i2d(aim23);                              // y is an approximation of a^(-2/3)

        ayy = (a * y) * y;                           // First Newton iteration for f(y)=a^2-y^-3 to calculate a better approximation y=a^(-2/3)
        y_a2y4 = fma(ayy, -ayy, y);
        y = fma(y_a2y4, 0.33333333333333333333, y);

        ayy = (a * y) * y;                           // Second Newton iteration
        y_a2y4 = fma(ayy, -ayy, y);
        y = fma(y_a2y4, 0.33523333333, y);           // This is a small modification to the exact Newton parameter 1/3 which gives slightly better results

        ayy = (a * y) * y;                           // Third Newton iteration
        y_a2y4 = fma(ayy, -ayy, y);
        y = fma(y_a2y4, 0.33333333333333333333, y);

        r = y * a;                                   // Now r = y * a is an approximation of a^(1/3), because y approximates a^(-2/3).
        r2_h = r * r;                                // Compute one pseudo Newton step with g(r)=a-r^3, but instead of dividing by f'(r)=3r^2 we multiply with 
                                                     // the approximation 0.3333...*y (division is usually a relatively expensive operation)
        r2_l = fma(r, r, -r2_h);                     // For better accuracy we split r*r=r^2 as r^2=r2_h+r2_l exactly.
        diff = fma(r2_h, -r, a);                     // Compute diff=a-r^3 accurately: diff=(a-r*r2_h)-r*r2_l with two fma instructions
        diff = fma(r2_l, -r, diff);               
        diff3 = diff * 0.33333333333333333333; 
        r = fma(diff3, y, r);                        // Now r approximates a^(1/3) within about 0.50002 ulp
                                                     
        r2_h = r * r;                                // One final Halley iteration
        r2_l = fma(r, r, -r2_h);
        diff = fma(r2_h, -r, a);
        diff = fma(r2_l, -r, diff);
        denom = fma(a, 3.0, -2.0 * diff);     
        r = fma(diff/denom, r, r);

        r = issmall ? (r * 0x1.0p-70) : (r * 2.0);   // Undo scaling
        r = copysign(r, z);
    }
    return r;                                     
}



int main(){
    uint64_t i, ai;
    ai=0;
    double x, r;
    double ulp, err, maxerr;
    __float128 rq;
    maxerr=0.0;
    ai=0x7FEFFFFFFFFFFFFFull;
    x = i2d (ai);                                               // To test negative numbers start with x = -i2d (ai);
    for(i=0; (i<12000000000ull)&&(x!=0.0); i=i+1){
        r = cbrt_ac(x);
        
        rq = cbrtq((__float128)x);                              // Compare with quad precision result rounded to double precision
        if ( (r-(double)rq) != 0.0f ){
            printf("not equal %12lu %24.16e   %24.16e   %24.16e   \n", i, x, r, (r-(double)rq)/(double)rq);
        }
        ulp = (nextafter(r, 1.0/0.0)-r);                        // Compare double precision with quad precision and estimate ulp error
        err = fabs(  (double)((__float128)r - rq)  )/ulp;
        if (err>maxerr){
            maxerr=err;
            printf("ulp err %12lu %24.16e   %24.16e   %24.16e    \n", i, x, r, err);
        }
        x = x * 0.9998596;                                      // Multiply with approx 10^(-610/n_test)  10^(-610/1e6)=0.9985964
        x = nextafter(x, 0);                                    // 1e7 -> 0.99985955  1e8 -> 0.999985954  1e9 -> 0.999998595
    }                                                           // After more than 10^10 test calls to cbrt_ac() the  
    printf("i = %12lu  x = %24.16e \n", i, x);                  // maximum observed ulp error is 4.9999999999404426e-01
   
    return 0;
}
union dbl_uint64{
    double d;
    uint64_t i;
};

uint64_t d2i (double x){                 // Bitwise copy (type-punning) from double to uint64_t (no conversion)
    union dbl_uint64 tmp ;               // With C++ use memcopy instead of this "union trick"   
    tmp.d = x;
    return tmp.i;
}

double i2d (uint64_t i){                 // Bitwise copy (type-punning) from uint64_t to double (no conversion)
    union dbl_uint64 tmp ;               // With C++ use memcopy instead of this "union trick" 
    tmp.i = i;
    return tmp.d;
}



  
 

Single precision code

Numerical tests with all possible input values of cbrtf_ac(), see code below, indicate that the error of this single precision cube root function is less than 0.5 ulp, hence the result is a correctly rounded cube root.

/* Accurate float cube root function cbrtf_ac() and surrounding test code */
#include <stdio.h>
#include <stdint.h>
#include <math.h>

/* gcc -O3 -Wall -m64 -std=c99 -march=skylake  cbrtf_tst.c -lm */
/* Assumptions:
                - rounding mode: round nearest 
                - safe math: with gcc don't use -ffast-math, with icc enable -fp-model=precise
                - ieee-754 floating point math, higher intermediate floating point precision should not be used in the calculations, i.e. FLT_EVAL_METHOD == 0, which is default with the gcc x86-64 compiler.
*/    

uint32_t f2i (float x);
float i2f (uint32_t i);
uint64_t d2i (double x);
double i2d (uint64_t i);    
double cbrt_ac(double z);    

float cbrtf_ac(float z){                          // Accurate cubic root (float)
    float a, y, r, r2_h, r2_l, y_a2y4, ayy, diff, diff3, denom;
    uint32_t ai, ai23, aim23;
    int issmall;
    
    a = fabsf(z);
    issmall = (a <  0.015625f);                   // Scale large, small and/or subnormal numbers to avoid underflow, overflow or subnormal numbers
    a = issmall ? (a * 0x1.0p+81f) : (a * 0.125f);
    ai = f2i(a);
    if ((ai > 0x7f7fffffu) || (z == 0.0f)){       // Inf, 0.0 and NaN
        r = z + z;
    }
    else
    {
        ai23 = 2 * (ai/3);                        // Integer division. The compiler, with suitable optimization level, should generate a much more efficient multiplication by 0xAAAAAAAB
        aim23 = 1774904040u - ai23;               // This uses a similar idea as the "fast inverse square root" approximation, see https://en.wikipedia.org/wiki/Fast_inverse_square_root
        y = i2f(aim23);                           // y is an approximation of a^(-2/3)

        ayy = (a * y) * y;                        // Newton iteration for f(y)=a^2-y^-3 to calculate a better approximation y=a^(-2/3)
        y_a2y4 = fmaf(ayy, -ayy, y);
        y = fmaf(y_a2y4, 0.33333333333333333f, y);

        ayy = (a * y) * y;                        // Second Newton iteration
        y_a2y4 = fmaf(ayy, -ayy, y);
        y = fmaf(y_a2y4, 0.3351667f, y);          // This is a small modification to the exact Newton parameter 1/3 which gives slightly better results

        r = y * a;                                // Now r = y * a is an approximation of a^(1/3), because y approximates a^(-2/3).
        r2_h = r * r;                             // Compute one pseudo Newton step on g(r)=a-r^3, but instead of dividing by f'(r)=3r^2 we multiply with 
                                                  // the approximation 0.3333...*y (division is usually a relatively expensive operation)
        r2_l = fmaf(r, r, -r2_h);                 // For better accuracy we split r*r=r^2 as r^2=r2_h+r2_l exactly.
        diff = fmaf(r2_h, -r, a);                 // Compute diff=a-r^3 accurately: diff=(a-r*r2_h)-r*r2_l with two fmaf instructions
        diff = fmaf(r2_l, -r, diff);              // 
        diff3 = diff * 0.333333333333333333f; 
        r = fmaf(diff3, y, r);                    // Now r is a near 0.5 ulp approximation to cbrtf(a)
                                                  
        r2_h = r * r;                             // One final Halley iteration
        r2_l = fmaf(r, r, -r2_h);
        diff = fmaf(r2_h, -r, a);
        diff = fmaf(r2_l, -r, diff);
        denom = fmaf(a, 3.0f, -2.0f * diff);     
        r = fmaf(diff/denom, r, r);

        r = issmall ? (r * 0x1.0p-27f) : (r * 2.0f); // Undo scaling
        r = copysignf(r, z);
    }
    return r;                                     
}



int main(){
    uint32_t i, ai;
    ai=0;
    float x, r;
    double ulp, err, maxerr, rd;
    maxerr=0.0;
    for(i=0; i<0x7f800002u; i=i+1){
        ai=i;
        x = i2f (ai);                                   // To test negative numbers: x = -i2f (ai);
        r = cbrtf_ac(x);
        rd = cbrt_ac( (double)x);                       // Compare with double precision result rounded to single precision
//        rd = nextafter(rd, 1e50);                     // Uncomment one of these lines for 2 alternative tests 
//        rd = nextafter(rd, -1e50);                    // to compensate for a cbrt which might be 1 ulp inaccurate
        if ( (r-(float)rd) != 0.0f ){
            printf("not equal %u %18.10e   %18.10e   %18.10e   %18.10e \n", i, x, r, rd, r-(float)rd);
        }
        ulp = (double)(nextafterf(r, 1.0f/0.0f)-r);     // Compare single precision with double precision and estimate ulp error
        err = fabs((double)r - rd)/ulp;
        if (err>maxerr){
            maxerr=err;
            printf("ulp err %11u %18.10e   %18.10e   %18.10e   \n", i, x, r, err);
        }
    }
    
    return 0;
}
union flt_uint32{
    float f;
    uint32_t i;
};

uint32_t f2i (float x){                 // Bitwise copy (type-punning) from float to int (no conversion)
    union flt_uint32 tmp ;              // With C++ use memcopy instead of this "union trick"   
    tmp.f = x;
    return tmp.i;
}

float i2f (uint32_t i){                 // Bitwise copy (type-punning) from int to float (no conversion)
    union flt_uint32 tmp ;              // With C++ use memcopy instead of this "union trick" 
    tmp.i = i;
    return tmp.f;
}


 double cbrt_ac(double z){                            // Accurate cube root (double)
    double a, y, r, r2_h, r2_l, y_a2y4, ayy, diff, diff3, denom;
    uint64_t ai, ai23, aim23;
    int issmall;
    
    a = fabs(z);
    issmall = (a <  0.015625);                       // Scale large, small and/or subnormal numbers to avoid underflow, overflow or subnormal numbers
    a = issmall ? (a * 0x1.0p+210) : (a * 0.125);
    ai = d2i(a);
    if ((ai >= 0x7FF0000000000000ull) || (z == 0.0)){    // Inf, 0.0 and NaN
        r = z + z;
    }
    else
    { 
        ai23 = 2 * (ai/3);                           // Integer division. The compiler, with suitable optimization level, should generate a much more efficient multiplication by 0xAAAAAAAAAAAAAAAB
        aim23 = 0x6A8EB53800000000ull - ai23;        // This uses a similar idea as the "fast inverse square root" approximation, see https://en.wikipedia.org/wiki/Fast_inverse_square_root
        y = i2d(aim23);                              // y is an approximation of a^(-2/3)

        ayy = (a * y) * y;                           // First Newton iteration for f(y)=a^2-y^-3 to calculate a better approximation y=a^(-2/3)
        y_a2y4 = fma(ayy, -ayy, y);
        y = fma(y_a2y4, 0.33333333333333333333, y);

        ayy = (a * y) * y;                           // Second Newton iteration
        y_a2y4 = fma(ayy, -ayy, y);
        y = fma(y_a2y4, 0.33523333333, y);           // This is a small modification to the exact Newton parameter 1/3 which gives slightly better results

        ayy = (a * y) * y;                           // Third Newton iteration
        y_a2y4 = fma(ayy, -ayy, y);
        y = fma(y_a2y4, 0.33333333333333333333, y);

        r = y * a;                                   // Now r = y * a is an approximation of a^(1/3), because y approximates a^(-2/3).
        r2_h = r * r;                                // Compute one pseudo Newton step with g(r)=a-r^3, but instead of dividing by f'(r)=3r^2 we multiply with 
                                                     // the approximation 0.3333...*y (division is usually a relatively expensive operation)
        r2_l = fma(r, r, -r2_h);                     // For better accuracy we split r*r=r^2 as r^2=r2_h+r2_l exactly.
        diff = fma(r2_h, -r, a);                     // Compute diff=a-r^3 accurately: diff=(a-r*r2_h)-r*r2_l with two fma instructions
        diff = fma(r2_l, -r, diff);               
        diff3 = diff * 0.33333333333333333333; 
        r = fma(diff3, y, r);                        // Now r approximates a^(1/3) within about 0.50002 ulp
                                                     
        r2_h = r * r;                                // One final Halley iteration
        r2_l = fma(r, r, -r2_h);
        diff = fma(r2_h, -r, a);
        diff = fma(r2_l, -r, diff);
        denom = fma(a, 3.0, -2.0 * diff);     
        r = fma(diff/denom, r, r);

        r = issmall ? (r * 0x1.0p-70) : (r * 2.0);   // Undo scaling
        r = copysign(r, z);
    }
    return r;                                     
}


union dbl_uint64{
    double d;
    uint64_t i;
};

uint64_t d2i (double x){                // Bitwise copy (type-punning) from double to uint64_t (no conversion)
    union dbl_uint64 tmp ;              // With C++ use memcopy instead of this "union trick"   
    tmp.d = x;
    return tmp.i;
}

double i2d (uint64_t i){                // Bitwise copy (type-punning) from uint64_t to double (no conversion)
    union dbl_uint64 tmp ;              // With C++ use memcopy instead of this "union trick" 
    tmp.i = i;
    return tmp.d;
}
wim
  • 3,702
  • 19
  • 23
  • Nice work on the `cbrt_ac()` implementation. I ran 16.06 billion random test vectors against an arbitrary-precision library configured for 250 bits pf precision. The maximum error encountered was 0.4999999999900527 ulps with a function argument of `0x1.9744ddf209acfp+9`. – njuffa Sep 05 '22 at 09:26
  • @njuffa Thanks for your comment and for testing the function against an arbitrary precision library! The error of 4.99999999994044e-01 ulp I found was with a function argument of 0x1.0cf9ef34d1abcp+2. – wim Sep 07 '22 at 20:49