Fast Arc Cos algorithm?

Question

I have my own, very fast cos function:

float sine(float x)
{
    const float B = 4/pi;
    const float C = -4/(pi*pi);

    float y = B * x + C * x * abs(x);

    //  const float Q = 0.775;
    const float P = 0.225;

    y = P * (y * abs(y) - y) + y;   // Q * y + P * y * abs(y)


    return y;
}

float cosine(float x)
{
    return sine(x + (pi / 2));
}

But now when I profile, I see that acos() is killing the processor. I don't need intense precision. What is a fast way to calculate acos(x) Thanks.

Your very fast function has a mean error of 16% in [-pi,pi] and is entirely unusable outside that interval. The standard `sinf` from `math.h` on my system takes only about 2.5x as much time as your approximation. Considering your function is inlined and the lib call is not, this is really not much difference. My guess is if you added range reduction so it was usuable in the same way as the standard function, you would have exactly the same speed. — Damon, Jul 05 '12 at 14:50
No, the maximum error is 0.001 (1/10th %). Did you forget to apply the correction? (y = P * bla...) Look at the original source and discussion here: http://devmaster.net/forums/topic/4648-fast-and-accurate-sinecosine/ Second, sin and cos pre-bounded by +-pi is a VERY common case, especially in graphics and simulation, both of which often require a fast approximate sin/cos. — jcwenger, Oct 18 '12 at 20:15

score 41 · Accepted Answer · answered Aug 01 '10 at 04:17

41

A simple cubic approximation, the Lagrange polynomial for x ∈ {-1, -½, 0, ½, 1}, is:

double acos(x) {
   return (-0.69813170079773212 * x * x - 0.87266462599716477) * x + 1.5707963267948966;
}

It has a maximum error of about 0.18 rad.

answered Aug 01 '10 at 04:17

dan04

87,747
23
163
198

2

Maximum error is 10.31 in degrees. Rather big, but in some solutions may be enough. Suitable where computational speed is more important than precision. May be quartic approximation would produce more precision and still be faster than native acos? – Timo Kähkönen Feb 02 '13 at 17:22
1

Sure there isn't a mistake in this formular? Just tried it with Wolfram Alpha and it doesn't look right: http://www.wolframalpha.com/input/?i=y%3D%282%2F9*pi*x*x-5*pi%2F18%29*x%2Bpi%2F2 – miho Apr 26 '13 at 14:29

spender · Answer 2 · 2010-08-01T03:42:06.653

27

Got spare memory? A lookup table (with interpolation, if required) is gonna be fastest.

edited Aug 01 '10 at 03:42

answered Aug 01 '10 at 03:36

spender

117,338
33
229
351

1

How could I implement this as a C function? – jmasterx Aug 01 '10 at 03:42
7

@Jex: bounds-check your argument (it must be between -1 and 1). Then multiply by a nice power of 2, say 64, yielding the range (-64, 64). Add 64 to make it non-negative (0, 128). Use the integer part to index a lookup table, if desired use the fractional part for interpolation between the two closest entries. If you don't want interpolation, try adding 64.5 and take the floor, this is the same as round-to-nearest. – Ben Voigt Aug 01 '10 at 05:42
4

Lookup tables require an index, which is going to require a float to int conversion, which will probably kill performance. – phkahler Aug 02 '10 at 13:15
2

@phkahler: float-to-int conversion is very cheap on x86, almost as cheap as an FP add, as you can see [in Agner Fog's latency/throughput/uop tables](http://agner.org/optimize). Range-checking the index to make sure it doesn't index outside the table is probably about as expensive. `int idx = x * 4096.0` will have ~ 9 cycle latency on Intel Haswell. By far the most expensive part of this is the cache miss from a decent-sized table. If there isn't a bunch of parallel computation that doesn't depend on the acos result, a large table will probably be slower (esp. with cache competition). – Peter Cordes Mar 27 '16 at 11:58

Fnord · Answer 3 · 2019-08-03T03:55:22.347

25

nVidia has some great resources that show how to approximate otherwise very expensive math functions, such as: acos asin atan2 etc etc...

These algorithms produce good results when speed of execution is more important (within reason) than precision. Here's their acos function:

// Absolute error <= 6.7e-5
float acos(float x) {
  float negate = float(x < 0);
  x = abs(x);
  float ret = -0.0187293;
  ret = ret * x;
  ret = ret + 0.0742610;
  ret = ret * x;
  ret = ret - 0.2121144;
  ret = ret * x;
  ret = ret + 1.5707288;
  ret = ret * sqrt(1.0-x);
  ret = ret - 2 * negate * ret;
  return negate * 3.14159265358979 + ret;
}

And here are the results for when calculating acos(0.5):

nVidia:   result: 1.0471513828611643
math.h:   result: 1.0471975511965976

That's pretty close! Depending on your required degree of precision, this might be a good option for you.

edited Aug 03 '19 at 03:55

answered Sep 25 '14 at 04:19

Fnord

5,365
4
31
48

2

In contrast to what the comment in the "reference implementation" on the nvidia site says, the absolute error is not <= 6.7e-5 but I was able to observe an error of 6.759167e-05. Also, how sure are you that this function is actually faster than just plain acos? On a 4th generation core i5 (Haswell) the nvidia function was consistenly 25% slower than acos from math.h. – josch Oct 02 '16 at 09:30
3

Any reason to use `3.14159265358979` instead of `math.pi` ? – ideasman42 Jun 27 '17 at 05:31
1

@ideasman42 Python was irrelevant to my answer. My goal was just to point out to nVidia's docs for approximation methods as a resource. So I edited my answer for clarity. But to answer your question: my guess is that these numbers were chosen to work well together. So using math.pi might not seem to make much of a difference in most cases until you hit the edge case where the error threshold would worsen. – Fnord Jun 29 '17 at 04:44
2

The reason for asking is there is a very small difference leading to the question: When porting this to any other language - should the constant for pi be used? Or is this an intentionally slightly modified pi, tuned to work better with the approximation? *(can be tested without too much trouble of course)* – ideasman42 Jun 30 '17 at 00:06
1

what does float(x < 0) mean? How can I turn it to scala? – mingzhao.pro Jul 01 '17 at 10:05
@mingzhao.pro it coverts the true/false result of the x<0 expression into a float. So if x<0 is true it converts it to 1.0, if false: 0.0 – Fnord Jul 01 '17 at 18:33
This approximation is within 10e-5 of the built-in acos() in a desktop opengl implementation I tried. Quite good! But it's also basically the same speed as the built-in acos() on a low end Android device I tested on. – Steve Dec 04 '17 at 23:40

Trey Reynolds · Answer 4 · 2018-02-11T06:29:36.933

I have my own. It's pretty accurate and sort of fast. It works off of a theorem I built around quartic convergence. It's really interesting, and you can see the equation and how fast it can make my natural log approximation converge here: https://www.desmos.com/calculator/yb04qt8jx4

Here's my arccos code:

function acos(x)
    local a=1.43+0.59*x a=(a+(2+2*x)/a)/2
    local b=1.65-1.41*x b=(b+(2-2*x)/b)/2
    local c=0.88-0.77*x c=(c+(2-a)/c)/2
    return (8*(c+(2-a)/c)-(b+(2-2*x)/b))/6
end

A lot of that is just square root approximation. It works really well, too, unless you get too close to taking a square root of 0. It has an average error (excluding x=0.99 to 1) of 0.0003. The problem, though, is that at 0.99 it starts going to shit, and at x=1, the difference in accuracy becomes 0.05. Of course, this could be solved by doing more iterations on the square roots (lol nope) or, just a little thing like, if x>0.99 then use a different set of square root linearizations, but that makes the code all long and ugly.

If you don't care about accuracy so much, you could just do one iteration per square root, which should still keep you somewhere in the range of 0.0162 or something as far as accuracy goes:

function acos(x)
    local a=1.43+0.59*x a=(a+(2+2*x)/a)/2
    local b=1.65-1.41*x b=(b+(2-2*x)/b)/2
    local c=0.88-0.77*x c=(c+(2-a)/c)/2
    return 8/3*c-b/3
end

If you're okay with it, you can use pre-existing square root code. It will get rid of the the equation going a bit crazy at x=1:

function acos(x)
    local a = math.sqrt(2+2*x)
    local b = math.sqrt(2-2*x)
    local c = math.sqrt(2-a)
    return 8/3*d-b/3
end

Frankly, though, if you're really pressed for time, remember that you could linearize arccos into 3.14159-1.57079x and just do:

function acos(x)
    return 1.57079-1.57079*x
end

Anyway, if you want to see a list of my arccos approximation equations, you can go to https://www.desmos.com/calculator/tcaty2sv8l I know that my approximations aren't the best for certain things, but if you're doing something where my approximations would be useful, please use them, but try to give me credit.

Also for anyone using c# this might be a good firstline: if (x < -1D || x > 1D || Double.IsNaN(x)) return Double.NaN; to be consistent with the .net framework acos function: https://msdn.microsoft.com/en-us/library/system.math.acos(v=vs.110).aspx — FocusedWolf, Jun 30 '15 at 19:26
Your first implementation is way off at x = -1, like by 0.5rad. — Steve, Dec 01 '17 at 03:56

score 9 · Answer 5 · answered Apr 03 '16 at 16:25

You can approximate the inverse cosine with a polynomial as suggested by dan04, but a polynomial is a pretty bad approximation near -1 and 1 where the derivative of the inverse cosine goes to infinity. When you increase the degree of the polynomial you hit diminishing returns quickly, and it is still hard to get a good approximation around the endpoints. A rational function (the quotient of two polynomials) can give a much better approximation in this case.

acos(x) ≈ π/2 + (ax + bx³) / (1 + cx² + dx⁴)

where

a = -0.939115566365855
b =  0.9217841528914573
c = -1.2845906244690837
d =  0.295624144969963174

has a maximum absolute error of 0.017 radians (0.96 degrees) on the interval (-1, 1). Here is a plot (the inverse cosine in black, cubic polynomial approximation in red, the above function in blue) for comparison:

The coefficients above have been chosen to minimise the maximum absolute error over the entire domain. If you are willing to allow a larger error at the endpoints, the error on the interval (-0.98, 0.98) can be made much smaller. A numerator of degree 5 and a denominator of degree 2 is about as fast as the above function, but slightly less accurate. At the expense of performance you can increase accuracy by using higher degree polynomials.

A note about performance: computing the two polynomials is still very cheap, and you can use fused multiply-add instructions. The division is not so bad, because you can use the hardware reciprocal approximation and a multiply. The error in the reciprocal approximation is negligible in comparison with the error in the acos approximation. On a 2.6 GHz Skylake i7, this approximation can do about 8 inverse cosines every 6 cycles using AVX. (That is throughput, the latency is longer than 6 cycles.)

@Gokul they are computed by this script: https://github.com/ruuda/convector/blob/2f5f2428fa6c54002bd2ee8ce3d0f2188aab49f8/tools/approx_acos.py — Ruud, Jul 01 '21 at 18:39

dan04 · Answer 6 · 2012-01-31T02:39:31.387

Another approach you could take is to use complex numbers. From de Moivre's formula,

ⅈ^x = cos(π/2*x) + ⅈ*sin(π/2*x)

Let θ = π/2*x. Then x = 2θ/π, so

sin(θ) = ℑ(ⅈ^^2θ/π)
cos(θ) = ℜ(ⅈ^^2θ/π)

How can you calculate powers of ⅈ without sin and cos? Start with a precomputed table for powers of 2:

ⅈ⁴ = 1
ⅈ² = -1
ⅈ¹ = ⅈ
ⅈ^1/2 = 0.7071067811865476 + 0.7071067811865475*ⅈ
ⅈ^1/4 = 0.9238795325112867 + 0.3826834323650898*ⅈ
ⅈ^1/8 = 0.9807852804032304 + 0.19509032201612825*ⅈ
ⅈ^1/16 = 0.9951847266721969 + 0.0980171403295606*ⅈ
ⅈ^1/32 = 0.9987954562051724 + 0.049067674327418015*ⅈ
ⅈ^1/64 = 0.9996988186962042 + 0.024541228522912288*ⅈ
ⅈ^1/128 = 0.9999247018391445 + 0.012271538285719925*ⅈ
ⅈ^1/256 = 0.9999811752826011 + 0.006135884649154475*ⅈ

To calculate arbitrary values of ⅈ^x, approximate the exponent as a binary fraction, and then multiply together the corresponding values from the table.

For example, to find sin and cos of 72° = 0.8π/2:

ⅈ^0.8 ≈ ⅈ^205/256 = ⅈ^0b11001101 = ⅈ^1/2 * ⅈ^1/4 * ⅈ^1/32 * ⅈ^1/64 * ⅈ^1/256
= 0.3078496400415349 + 0.9514350209690084*ⅈ

sin(72°) ≈ 0.9514350209690084 ("exact" value is 0.9510565162951535)
cos(72°) ≈ 0.3078496400415349 ("exact" value is 0.30901699437494745).

To find asin and acos, you can use this table with the Bisection Method:

For example, to find asin(0.6) (the smallest angle in a 3-4-5 triangle):

ⅈ⁰ = 1 + 0*ⅈ. The sin is too small, so increase x by 1/2.
ⅈ^1/2 = 0.7071067811865476 + 0.7071067811865475*ⅈ . The sin is too big, so decrease x by 1/4.
ⅈ^1/4 = 0.9238795325112867 + 0.3826834323650898*ⅈ. The sin is too small, so increase x by 1/8.
ⅈ^3/8 = 0.8314696123025452 + 0.5555702330196022*ⅈ. The sin is still too small, so increase x by 1/16.
ⅈ^7/16 = 0.773010453362737 + 0.6343932841636455*ⅈ. The sin is too big, so decrease x by 1/32.
ⅈ^13/32 = 0.8032075314806449 + 0.5956993044924334*ⅈ.

Each time you increase x, multiply by the corresponding power of ⅈ. Each time you decrease x, divide by the corresponding power of ⅈ.

If we stop here, we obtain acos(0.6) ≈ 13/32*π/2 = 0.6381360077604268 (The "exact" value is 0.6435011087932844.)

The accuracy, of course, depends on the number of iterations. For a quick-and-dirty approximation, use 10 iterations. For "intense precision", use 50-60 iterations.

njuffa · Answer 7 · 2018-01-08T20:39:43.890

A fast arccosine implementation, accurate to about 0.5 degrees, can be based on the observation that for x in [0,1], acos(x) ≈ √(2*(1-x)). An additional scale factor improves accuracy near zero. The optimal factor can be found by a simple binary search. Negative arguments are handled according to acos (-x) = π - acos (x).

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>

// Approximate acos(a) with relative error < 5.15e-3
// This uses an idea from Robert Harley's posting in comp.arch.arithmetic on 1996/07/12
// https://groups.google.com/forum/#!original/comp.arch.arithmetic/wqCPkCCXqWs/T9qCkHtGE2YJ
float fast_acos (float a)
{
    const float PI = 3.14159265f;
    const float C  = 0.10501094f;
    float r, s, t, u;
    t = (a < 0) ? (-a) : a;  // handle negative arguments
    u = 1.0f - t;
    s = sqrtf (u + u);
    r = C * u * s + s;  // or fmaf (C * u, s, s) if FMA support in hardware
    if (a < 0) r = PI - r;  // handle negative arguments
    return r;
}

float uint_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof(r));
    return r;
}

int main (void)
{
    double maxrelerr = 0.0;
    uint32_t a = 0;
    do {
        float x = uint_as_float (a);
        float r = fast_acos (x);
        double xx = (double)x;
        double res = (double)r;
        double ref = acos (xx);
        double relerr = (res - ref) / ref;
        if (fabs (relerr) > maxrelerr) {
            maxrelerr = fabs (relerr);
            printf ("xx=% 15.8e  res=% 15.8e  ref=% 15.8e  rel.err=% 15.8e\n",
                    xx, res, ref, relerr);
        }
        a++;
    } while (a);
    printf ("maximum relative error = %15.8e\n", maxrelerr);
    return EXIT_SUCCESS;
}

The output of the above test scaffold should look similar to this:

xx= 0.00000000e+000  res= 1.56272149e+000  ref= 1.57079633e+000  rel.err=-5.14060021e-003
xx= 2.98023259e-008  res= 1.56272137e+000  ref= 1.57079630e+000  rel.err=-5.14065723e-003
xx= 8.94069672e-008  res= 1.56272125e+000  ref= 1.57079624e+000  rel.err=-5.14069537e-003
xx=-2.98023259e-008  res= 1.57887137e+000  ref= 1.57079636e+000  rel.err= 5.14071269e-003
xx=-8.94069672e-008  res= 1.57887149e+000  ref= 1.57079642e+000  rel.err= 5.14075044e-003
maximum relative error = 5.14075044e-003

score 3 · Answer 8 · answered Nov 20 '16 at 10:35

Here is a great website with many options: https://www.ecse.rpi.edu/Homepages/wrf/Research/Short_Notes/arcsin/onlyelem.html

Personally I went the Chebyshev-Pade quotient approximation with with the following code:

double arccos(double x) {
const double pi = 3.141592653;
    return pi / 2 - (.5689111419 - .2644381021*x - .4212611542*(2*x - 1)*(2*x - 1)
         + .1475622352*(2*x - 1)*(2*x - 1)*(2*x - 1))
         / (2.006022274 - 2.343685222*x + .3316406750*(2*x - 1)*(2*x - 1) +
             .02607135626*(2*x - 1)*(2*x - 1)*(2*x - 1));
}

This is way off at x = -1, like by 0.5rad. Not usable. – Steve Dec 01 '17 at 03:29 — Steve, Dec 01 '17 at 03:29

John Doe · Answer 9 · 2019-07-30T22:32:26.593

If you're using Microsoft VC++, here's an inline __asm x87 FPU code version without all the CRT filler, error checks, etc. and unlike the earliest classic ASM code you can find, it uses a FMUL instead of the slower FDIV. It compiles/works with Microsoft VC++ 2005 Express/Pro what I always stick with for various reasons.

It's a little tricky to setup a function with "__declspec(naked)/__fastcall", pull parameters correctly, handle stack, so not for the faint of heart. If it fails to compile with errors on your version, don't bother unless you're experienced. Or ask me, I can rewrite it in a slightly friendlier __asm{} block. I would manually inline this if it's a critical part of a function in a loop for further performance gains if need be.

extern float __fastcall fs_acos(float x);
extern double __fastcall fs_Acos(double x);

// ACOS(x)- Computes the arccosine of ST(0)
// Allowable range: -1<=x<=+1
// Derivative Formulas: acos(x) = atan(sqrt((1 - x * x)/(x * x))) OR
// acos(x) = atan2(sqrt(1 - x * x), x)
// e.g. acos(-1.0) = 3.1415927

__declspec(naked) float __fastcall fs_acos(float x) { __asm {
    FLD   DWORD PTR [ESP+4] ;// Load/Push parameter 'x' to FPU stack
    FLD1            ;// Load 1.0
    FADD  ST, ST(1) ;// Compute 1.0 + 'x'
    FLD1            ;// Load 1.0
    FSUB  ST, ST(2) ;// Compute 1.0 - 'x'
    FMULP ST(1), ST ;// Compute (1-x) * (1+x)
    FSQRT           ;// Compute sqrt(result)
    FXCH  ST(1)
    FPATAN          ;// Compute arctangent of result / 'x' (ST1/ST0)
    RET 4
}}

__declspec(naked) double __fastcall fs_Acos(double x) { __asm { //
    FLD   QWORD PTR [ESP+4] ;// Load/Push parameter 'x' to FPU stack
    FLD1            ;// Load 1.0
    FADD  ST, ST(1) ;// Compute (1.0 + 'x')
    FLD1            ;// Load 1.0
    FSUB  ST, ST(2) ;// Compute (1.0 - 'x')
    FMULP ST(1), ST ;// Compute (1-x) * (1+x)
    FSQRT           ;// Compute sqrt((1-x) * (1+x))
    FXCH  ST(1) 
    FPATAN          ;// Compute arctangent of result / 'x' (ST1/ST0)
    RET 8
}}

I doubt that the FPU will be faster than SSE instructions, also, it's unusable for x64 targets since MSVC won't allow inlined asm blocks for such targets — Iris Technologies, Nov 16 '20 at 23:44

score 1 · Answer 10 · answered Jan 23 '20 at 14:40

Unfortunately I do not have enough reputation to comment. Here is a small modification of Nvidia's function, that deals with the fact that numbers that should be <= 1 while preserving performance as much as possible.

It may be important since rounding errors can lead number that should be 1.0 to be (oh so slightly) larger than 1.0.


double safer_acos(double x) {
  double negate = double(x < 0);
  x = abs(x);
  x -= double(x>1.0)*(x-1.0); // <- equivalent to min(1.0,x), but faster
  double ret = -0.0187293;
  ret = ret * x;
  ret = ret + 0.0742610;
  ret = ret * x;
  ret = ret - 0.2121144;
  ret = ret * x;
  ret = ret + 1.5707288;
  ret = ret * sqrt(1.0-x);
  ret = ret - 2 * negate * ret;
  return negate * 3.14159265358979 + ret;

  // In a single line (no gain using gcc)
  //return negate * 3.14159265358979 + (((((-0.0187293*x)+ 0.0742610)*x - 0.2121144)*x + 1.5707288)* sqrt(1.0-x))*(1.0-2.0*negate);

}

Fast Arc Cos algorithm?

10 Answers10

Linked