CORDIC Arcsine implementation fails

Question

I have recently implemented a library of CORDIC functions to reduce the required computational power (my project is based on a PowerPC and is extremely strict in its execution time specifications). The language is ANSI-C.

The other functions (sin/cos/atan) work within accuracy limits both in 32 and in 64 bit implementations.

Unfortunately, the asin() function fails systematically for certain inputs.

For testing purposes I have implemented an .h file to be used in a simulink S-Function. (This is only for my convenience, you can compile the following as a standalone .exe with minimal changes)

Note: I have forced 32 iterations because I am working in 32 bit precision and the maximum possible accuracy is required.

Cordic.h:

#include <stdio.h>
#include <stdlib.h>

#define FLOAT32 float
#define INT32 signed long int
#define BIT_XOR ^

#define CORDIC_1K_32 0x26DD3B6A                  
#define MUL_32       1073741824.0F     /*needed to scale float -> int*/            
#define INV_MUL_32   9.313225746E-10F  /*needed to scale int -> float*/

INT32 CORDIC_CTAB_32 [] = {0x3243f6a8, 0x1dac6705, 0x0fadbafc, 0x07f56ea6, 0x03feab76, 0x01ffd55b, 0x00fffaaa, 0x007fff55, 
                           0x003fffea, 0x001ffffd, 0x000fffff, 0x0007ffff, 0x0003ffff, 0x0001ffff, 0x0000ffff, 0x00007fff, 
                           0x00003fff, 0x00001fff, 0x00000fff, 0x000007ff, 0x000003ff, 0x000001ff, 0x000000ff, 0x0000007f, 
                           0x0000003f, 0x0000001f, 0x0000000f, 0x00000008, 0x00000004, 0x00000002, 0x00000001, 0x00000000};

/* CORDIC Arcsine Core: vectoring mode */
INT32 CORDIC_asin(INT32 arc_in)
{
  INT32 k;
  INT32 d;
  INT32 tx;
  INT32 ty;
  INT32 x;
  INT32 y;
  INT32 z;

  x=CORDIC_1K_32;
  y=0;
  z=0;

  for (k=0; k<32; ++k)
  {
    d = (arc_in - y)>>(31);
    tx = x - (((y>>k) BIT_XOR d) - d);
    ty = y + (((x>>k) BIT_XOR d) - d);
    z += ((CORDIC_CTAB_32[k] BIT_XOR d) - d);

    x = tx; 
    y = ty; 
  }  
  return z; 
}

/* Wrapper function for scaling in-out of cordic core*/
FLOAT32 asin_wrap(FLOAT32 arc)
{
  return ((FLOAT32)(CORDIC_asin((INT32)(arc*MUL_32))*INV_MUL_32));
}

This can be called in a manner similar to:

#include "Cordic.h"
#include "math.h"

void main()
{
    y1 = asin_wrap(value_32); /*my implementation*/
    y2 = asinf(value_32);     /*standard math.h for comparison*/
}

The results are as shown: enter image description here

Top left shows the [-1;1] input over 2000 steps (0.001 increments), bottom left the output of my function, bottom right the standard output and top right the difference of the two outputs.

It is immediate to see that the error is not within 32 bit accuracy.

I have analysed the steps performed (and the intermediate results) by my code and it seems to me that at a certain point the value of y is "close enough" to the initial value of arc_in and what could be related to a bit-shift causes the solution to diverge.

My questions:

I am at a loss, is this error inherent in the CORDIC implementation or have I made a mistake in the implementation? I was expecting the decrease of accuracy near the extremes, but those spikes in the middle are quite unexpected. (the most notable ones are just beyond +/- 0.6, but even removed these there are more at smaller values, albeit not as pronounced)
If it is something part of the CORDIC implementation, are there known workarounds?

EDIT:

Since some comment mention it, yes, I tested the definition of INT32, even writing #define INT32 int32_T does not change the results by the slightest amount.
The computation time on the target hardware has been measured by hundreds of repetitions of block of 10.000 iterations of the function with random input in the validity range. The observed mean results (for one call of the function) are as follows: math.h asinf() 100.00 microseconds CORDIC asin() 5.15 microseconds
(apparently the previous test had been faulty, a new cross-test has obtained no better than an average of 100 microseconds across the validity range)
I apparently found a better implementation. It can be downloaded in matlab version here and in C here. I will analyse more its inner workings and report later.

One place where you might be losing some precision would be where you're rigthshifting by 31 (i.e. dividing by 2^^31). In integer aritmetics, you should usually do something like `d = (arc_in - y); d += 1 << 30; d = d >> 31;`. — vgru, Sep 22 '14 at 14:38
@Groo I will check, but that step is used only to extract the sign bit. — Federico, Sep 22 '14 at 14:39
Oh, that might be true, I merely went over it quickly. Are you sure that right shift is implemented as an *arithmetic* right shift on your platform? OTOH, why didn't you just use an [existing implementation](http://www.voidware.com/cordic.htm), maybe just ported it to PowerPC if needed? — vgru, Sep 22 '14 at 14:42
@Groo thanks, I will test them and compare the results. Yes, the bit shifting has been tested. — Federico, Sep 22 '14 at 14:44
@chux thanks, but that's an extract of the platform API copied here to make a self-contained MWE, not my responsibility. — Federico, Sep 22 '14 at 14:46
The implementation linked by Groo exhibits the same odd behaviour. Strangely, the result of arcsin from 0.60726 to 0.68514 is fixed at 0.754805. I would think that if this is a known property of CORDIC it would be documented somewhere but there doesn't seem to be much available about the algorithm's accuracy. — uesp, Sep 22 '14 at 17:12
Note: I was testing with MSVC 2010 but get identical results on IdeOne (http://ideone.com/XkUmf9). — uesp, Sep 22 '14 at 17:24
@uesp thanks! I indeed forgot to mention the compiler, MSVC 2010 also for me, but thanks for checking! — Federico, Sep 22 '14 at 17:28
I did find this (http://cordic-bibliography.blogspot.ca/p/double-iterations-in-cordic.html) which mentions that double-iterations are needed for some functions (including arcsin) to reduce errors for some points but haven't been able to get an actual implementation working. Have you tried any non-CORDIC methods for computing asin? — uesp, Sep 22 '14 at 17:58
are you sure `signed long int` is 32bit on your platform? You could use `int32_t` to remove any doubt here — M.M, Sep 22 '14 at 20:20
@MattMcNabb please tell me which part of `that's an extract of the platform API copied here to make a self-contained MWE, not my responsibility` is not clear. — Federico, Sep 22 '14 at 20:33
@Federico perhaps that is the source of the bug, doesn't matter whose responsibility it is — M.M, Sep 22 '14 at 20:34

score 5 · Accepted Answer · edited May 23 '17 at 12:06

To review a few things mentioned in the comments:

The given code outputs values identical to another CORDIC implementation. This includes the stated inaccuracies.
The largest error is as you approach arcsin(1).
The second largest error is that the values of arcsin(0.60726) to arcsin(0.68514) all return 0.754805.
There are some vague references to inaccuracies in the CORDIC method for some functions including arcsin. The given solution is to perform "double-iterations" although I have been unable to get this to work (all values give a large amount of error).
The alternate CORDIC implemention has a comment /* |a| < 0.98 */ in the arcsin() implementation which would seem to reinforce that there is known inaccuracies close to 1.

As a rough comparison of a few different methods consider the following results (all tests performed on a desktop, Windows7 computer using MSVC++ 2010, benchmarks timed using 10M iterations over the arcsin() range 0-1):

Question CORDIC Code: 1050 ms, 0.008 avg error, 0.173 max error
Alternate CORDIC Code (ref): 2600 ms, 0.008 avg error, 0.173 max error
atan() CORDIC Code: 2900 ms, 0.21 avg error, 0.28 max error
CORDIC Using Double-Iterations: 4700 ms, 0.26 avg error, 0.917 max error (???)
Math Built-in asin(): 200 ms, 0 avg error, 0 max error
Rational Approximation (ref): 250 ms, 0.21 avg error, 0.26 max error
Linear Table Lookup (see below) 100 ms, 0.000001 avg error, 0.00003 max error
Taylor Series (7th power, ref): 300 ms, 0.01 avg error, 0.16 max error

These results are on a desktop so how relevant they would be for an embedded system is a good question. If in doubt, profiling/benchmarking on the relevant system would be advised. Most solutions tested don't have very good accuracy over the range (0-1) and all but one are actually slower than the built-in asin() function.

The linear table lookup code is posted below and is my usual method for any expensive mathematical function when speed is desired over accuracy. It simply uses a 1024 element table with linear interpolation. It seems to be both the fastest and most accurate of all methods tested, although the built-in asin() is not much slower really (test it!). It can easily be adjusted for more or less accuracy by changing the size of the table.

// Please test this code before using in anything important!
const size_t ASIN_TABLE_SIZE = 1024;
double asin_table[ASIN_TABLE_SIZE];

int init_asin_table (void)
{
    for (size_t i = 0; i < ASIN_TABLE_SIZE; ++i)
    {
        float f = (float) i / ASIN_TABLE_SIZE;
        asin_table[i] = asin(f);
    }    

    return 0;
}

double asin_table (double a)
{
    static int s_Init = init_asin_table(); // Call automatically the first time or call it manually
    double sign = 1.0;

    if (a < 0) 
    {
        a = -a;
        sign = -1.0;
    }

    if (a > 1) return 0;

    double fi = a * ASIN_TABLE_SIZE;
    double decimal = fi - (int)fi;

    size_t i = fi;
    if (i >= ASIN_TABLE_SIZE-1) return Sign * 3.14159265359/2;

    return Sign * ((1.0 - decimal)*asin_table[i] + decimal*asin_table[i+1]);
}

thanks for the answer and the tests! As for speed, it has been tested on the specific hardware and the CORDIC implementation posted above is faster by about a factor of 4-5 w.r.t. the built-in `asinf()`. A LUT approach would not be ideal for the memory requirements, but will be used if all else fails. — Federico, Sep 22 '14 at 20:30
You can adjust the LUT size to balance your memory requirements and desired accuracy (unless you are very tight on memory). For example, a 100 entry `float` LUT still has better average/max error than the unmodified CORDIC. While I've never tried it, you could also try a higher order approximation using the LUT. — uesp, Sep 22 '14 at 21:17
I accepted this answer because in the end the accepted solution (Rational Approximation) has been proposed by it. If you're interested, with CORDIC `atan2()` and fast inverse sqrt(), on the target hardware this approximation runs in 66microsecods vs the 100 of the built-in version. — Federico, Oct 23 '14 at 11:49
"The alternate CORDIC implemention has a comment `/* |a| < 0.98 */`". It suffer from the same issue as the question code. Discrepancies start already at `a=0.62` (https://godbolt.org/z/zr9mBR). — Friedrich -- Слава Україні, Mar 22 '20 at 03:36

score 4 · Answer 2 · 2014-09-23T23:12:14.003

The "single rotate" arcsine goes badly wrong when the argument is just greater than the initial value of 'x', where that is the magical scaling factor -- 1/An ~= 0.607252935 ~= 0x26DD3B6A.

This is because, for all arguments > 0, the first step always has y = 0 < arg, so d = +1, which sets y = 1/An, and leaves x = 1/An. Looking at the second step:

if arg <= 1/An, then d = -1, and the steps which follow converge to a good answer
if arg > 1/An, then d = +1, and this step moves further away from the right answer, and for a range of values a little bigger than 1/An, the subsequent steps all have d = -1, but are unable to correct the result :-(

I found:

 arg = 0.607 (ie 0x26D91687), relative error 7.139E-09 -- OK    
 arg = 0.608 (ie 0x26E978D5), relative error 1.550E-01 -- APALLING !!
 arg = 0.685 (ie 0x2BD70A3D), relative error 2.667E-04 -- BAD !!
 arg = 0.686 (ie 0x2BE76C8B), relative error 1.232E-09 -- OK, again

The descriptions of the method warn about abs(arg) >= 0.98 (or so), and I found that somewhere after 0.986 the process fails to converge and the relative error jumps to ~5E-02 and hits 1E-01 (!!) at arg=1 :-(

As you did, I also found that for 0.303 < arg < 0.313 the relative error jumps to ~3E-02, and reduces slowly until things return to normal. (In this case step 2 overshoots so far that the remaining steps cannot correct it.)

So... the single rotate CORDIC for arcsine looks rubbish to me :-(

Added later... when I looked even closer at the single rotate CORDIC, I found many more small regions where the relative error is BAD...

...so I would not touch this as a method at all... it's not just rubbish, it's useless.

BTW: I thoroughly recommend "Software Manual for the Elementary Functions", William Cody and William Waite, Prentice-Hall, 1980. The methods for calculating the functions are not so interesting any more (but there is a thorough, practical discussion of the relevant range-reductions required). However, for each function they give a good test procedure.

While the first part of this answer helps understanding that it was not an implementation failure, the fact that it is "useless" is the reason for which I asked the question in the first place, so is not adding anything to the discussion. — Federico, Sep 24 '14 at 06:28
OK... well, everything I have read about CORDIC suggests that it is powerful magic. I am frankly disappointed to discover that this application of CORDIC (arcsine with single rotation) is as bad as it is... and was moved to note that in the hope of saying other people effort discovering that for themselves. — , Sep 24 '14 at 09:40

Federico · Answer 3 · 2014-09-23T11:55:10.277

The additional source I linked at the end of the question apparently contains the solution.

The proposed code can be reduced to the following:

#define M_PI_2_32    1.57079632F
#define SQRT2_2      7.071067811865476e-001F /* sin(45°) = cos(45°) = sqrt(2)/2 */

FLOAT32 angles[] = {
    7.8539816339744830962E-01F, 4.6364760900080611621E-01F, 2.4497866312686415417E-01F, 1.2435499454676143503E-01F,
    6.2418809995957348474E-02F, 3.1239833430268276254E-02F, 1.5623728620476830803E-02F, 7.8123410601011112965E-03F,
    3.9062301319669718276E-03F, 1.9531225164788186851E-03F, 9.7656218955931943040E-04F, 4.8828121119489827547E-04F,
    2.4414062014936176402E-04F, 1.2207031189367020424E-04F, 6.1035156174208775022E-05F, 3.0517578115526096862E-05F,
    1.5258789061315762107E-05F, 7.6293945311019702634E-06F, 3.8146972656064962829E-06F, 1.9073486328101870354E-06F,
    9.5367431640596087942E-07F, 4.7683715820308885993E-07F, 2.3841857910155798249E-07F, 1.1920928955078068531E-07F,
    5.9604644775390554414E-08F, 2.9802322387695303677E-08F, 1.4901161193847655147E-08F, 7.4505805969238279871E-09F,
    3.7252902984619140453E-09F, 1.8626451492309570291E-09F, 9.3132257461547851536E-10F, 4.6566128730773925778E-10F};

FLOAT32 arcsin_cordic(FLOAT32 t)
{            
    INT32 i;
    INT32 j;
    INT32 flip;
    FLOAT32 poweroftwo;
    FLOAT32 sigma;
    FLOAT32 sign_or;
    FLOAT32 theta;
    FLOAT32 x1;
    FLOAT32 x2;
    FLOAT32 y1;
    FLOAT32 y2;

    flip       = 0; 
    theta      = 0.0F;
    x1         = 1.0F;
    y1         = 0.0F;
    poweroftwo = 1.0F;

    /* If the angle is small, use the small angle approximation */
    if ((t >= -0.002F) && (t <= 0.002F))
    {
        return t;
    }

    if (t >= 0.0F) 
    {
        sign_or = 1.0F;
    }
    else
    {
        sign_or = -1.0F;
    }

    /* The inv_sqrt() is the famous Fast Inverse Square Root from the Quake 3 engine
       here used with 3 (!!) Newton iterations */
    if ((t >= SQRT2_2) || (t <= -SQRT2_2))
    {
        t =  1.0F/inv_sqrt(1-t*t);
        flip = 1;
    }

    if (t>=0.0F) 
    {
        sign_or = 1.0F;
    }
    else
    {
        sign_or = -1.0F;
    }

    for ( j = 0; j < 32; j++ ) 
    {
        if (y1 > t)
        {
            sigma = -1.0F;
        }
        else
        {
            sigma = 1.0F;
        }

        /* Here a double iteration is done */
        x2 =                       x1  - (sigma * poweroftwo * y1);
        y2 = (sigma * poweroftwo * x1) +                       y1;

        x1 =                       x2  - (sigma * poweroftwo * y2);
        y1 = (sigma * poweroftwo * x2) +                       y2;

        theta  += 2.0F * sigma * angles[j];

        t *= (1.0F + poweroftwo * poweroftwo);

        poweroftwo *= 0.5F;
    }

    /* Remove bias */
    theta -= sign_or*4.85E-8F;

    if (flip)
    {
        theta = sign_or*(M_PI_2_32-theta);
    }

    return theta;
}

The following is to be noted:

It is a "Double-Iteration" CORDIC implementation.
The angles table thus differs in construction from the old table.
And the computation is done in floating point notation, this will cause a major increase in computation time on the target hardware.
A small bias is present in the output, removed via the theta -= sign_or*4.85E-8F; passage.

The following picture shows the absolute (left) and relative errors (right) of the old implementation (top) vs the implementation contained in this answer (bottom).

The relative error is obtained only by dividing the CORDIC output with the output of the built-in math.h implementation. It is plotted around 1 and not 0 for this reason.

The peak relative error (when not dividing by zero) is 1.0728836e-006.

The average relative error is 2.0253509e-007 (almost in accordance to 32 bit accuracy).

enter image description here

I find (so far) that the double rotate does a good job, using a 32-bit fixed point, but with 29 fraction bits (not 30, as before). Dealing with small arguments ensures that asin(0) returns 0 ! Running the code on abs(arg) and putting the sign back at the end ensures that the result is symetrical. The "bias" voodoo, I don't understand, and I don't think helps. The problem is that as the argument approaches 1, the accuracy drops off... if calculating sqrt((1-arg)/2) for values > 0.75 (say), then that can be improved. — , Sep 23 '14 at 23:51
@gmch you managed to get it in fixed point? care to share? The "voodoo" is measurable. Remove the correction and you will have a measurable bias near `0` (not 1). There is no more problem of accuracy at `1`, read the answer, I mirror the solution around `45°` -> the worst accuracy is around `45°`. — Federico, Sep 24 '14 at 06:27
Having looked more closely at the absolute errors, I do see a small bias, averaging -2 ULPs for arguments < 0.5. So, I agree it helps (given the right value !) to balance the error for the smaller arguments. In relative error terms, yes the problem is for small arguments -- particularly < 0.05. For fixed-point I think the absolute error is key, and in absolute error terms it's the large arguments where the error is worst, particularly > 0.99. So, it depends on what matters to you, really. — , Sep 24 '14 at 11:44
@gmch the largest argument possible after the inclusion of the "mirroring" is `sqrt(2)/2` that is about `0.707`, not even close to `0.99`. Also 32bit have an accuracy of about `1E-7`, the absolute error for larger values will always be greater, irrispective of algorithm used. — Federico, Sep 24 '14 at 11:49
My Cody&Waite suggests doing `pi/2 - 2*asin(sqrt((1-x)/2))`... but doing the sqrt() looks like a pain :-( I assume you are wanting to do CORDIC because in your application you wish to avoid Floating Point. I guess there's some level of accuracy which is good enough... without knowing what the requirement is, there's not much more I can add. — , Sep 24 '14 at 19:15

score 0 · Answer 4 · edited May 16 '21 at 19:31

For convergence of iterative process it is necessary that any "wrong" i-th iteration could be "corrected" in the subsequent (i+1)-th, (i+2)-th, (i+3)-th, etc. etc. iterations. Or, in other words, at least a half of the "wrong" i-th iteration could be corrected in the next (i+1)-th iteration. For atan(1/2^i) this condition is satisfied, i.e.:

atan(1/2^(i+1)) > 1/2*atan(1/2^i)

Read more at http://cordic-bibliography.blogspot.com/p/double-iterations-in-cordic.html and: http://baykov.de/CORDIC1972.htm

(note I'm the author of those pages)

CORDIC Arcsine implementation fails

4 Answers4