atan2 approximation with 11bits in mantissa on x86(with SSE2) and ARM(with vfpv4 NEON)

Question

I'm trying to implement fast atan2(float) with 11 bits accuracy in mantissa. The atan2 implementation will be used for image processing. So it might be better to be implemented with SIMD instruction(The impl targeting x86(with SSE2) & ARM(with vpfv4 NEON)).

For now, I use chebyshev polynomial approximation(https://jp.mathworks.com/help/fixedpoint/examples/calculate-fixed-point-arctangent.html).

I'm willing to implement vectorized code manually. I will use SSE2(or later) & NEON wrapper library. I plan or tried these implementation option

vectorized Polynomial approximation
- chebyshev polynomial(now implemented)
scalar look up table ( + linear interpolation)
vectorized look up table (is this possible? I'm not familiar with NEON, so I don' know some instructions like VGATHER in NEON)
vectorized CORDIC method (this may be slow since it require 12+ rotation iterations to get 11bits accuracy)

else good idea?

Normally you'd vectorize a math function by making a version that takes vector(s) as input. That's usually more efficient than trying to use SIMD to speed up evaluation for a single input. — Peter Cordes, Sep 14 '17 at 05:25
@PeterCordes Usually you could get the autovectorizer of gcc or clang to give you the vectorization for a well written approximation. It's only ARM NEON that screws this up due to missing ieee754 conformance (missing denormal numbers). — EOF, Sep 14 '17 at 19:52
@EOF Does that really work, though? Often a scalar implementation would be slightly branchy, or use a table lookup. A LUT might still be a win, but gcc won't auto-vectorize unpack / lookup / repack. For example see this AVX2 `__m256d` log2 function https://stackoverflow.com/a/45898937/224132, which uses a gather instruction and only a small polynomial. — Peter Cordes, Sep 14 '17 at 20:00
If you're basically going to manually vectorize, it may be easier to just write vectorized code directly, instead of scalar code for the auto-vectorizer to transform. Although you might then have to write it twice for NEON and SSE2 intrinsics, or use a generic wrapper library or GNU C "native" vector syntax, but you might find that the optimal strategy is different for NEON vs. SSE2. — Peter Cordes, Sep 14 '17 at 20:03
@PeterCordes I will use SSE2(or later) & NEON wrapper library. I plan or tried these implementation option * Polynomial approximation * chebyshev polynomial (now implemented) * scalar look up table ( + linear interpolation) * vectorized look up table (is this possible? I don't familiar with NEON, so I don' know some instructions like VGATHER in AVX2) * CORDIC method (this may be slow since it require 12+ rotation iterations to get 11bits accuracy) — Kuroda, Sep 15 '17 at 00:14
Ok good, you do want to write a function that takes 2 vectors of `float` inputs, and produces a vector of `atan2` results. See https://stackoverflow.com/questions/18187492/calculate-atan2-using-simd-instructions-and-neon-arm-assemlby for some links and ideas about vectorizing atan2. I get lots of google hits for `simd atan2`, e.g. https://handmade.network/forums/wip/t/1439-sse2_implementations_of_tan,_cot,_atan,_atan2 (IDK about the code quality of that one). Some of them for fast approximations, others for vectorized full-accuracy versions. — Peter Cordes, Sep 15 '17 at 00:55
Maybe also useful: https://www.dsprelated.com/showarticle/1052.php. — Peter Cordes, Sep 15 '17 at 01:23
[`vgatherdps`](http://felixcloutier.com/x86/VGATHERDPS:VGATHERQPS.html) is an AVX2 instruction for x86. NEON doesn't have gathers. For NEON, you'd just have to move data from vector registers to integer for use in 4 separate load instructions, and then shuffle those results back together. — Peter Cordes, Sep 15 '17 at 04:11
There's a supposedly-working AVX vectorized `atan2` on codereview: https://codereview.stackexchange.com/questions/174617/avx-assembly-for-fast-atan2-approximation — Peter Cordes, Sep 15 '17 at 10:04

njuffa · Accepted Answer · 2017-09-16T16:32:54.263

The first thing you would want to check is whether your compiler is able to vectorize atan2f (y,x) when applied to an array of floats. This usually requires at least a high optimization level such as -O3 and possibly specifying a relaxed "fast math" mode, in which errno handling, denormals and special inputs such as infinities and NaNs are largely ignored. With this approach, accuracy may be well in excess of what is required, but it may be hard to beat a carefully tuned library implementation with respect to performance.

The next thing to try is to write a simple scalar implementation with sufficient accuracy, and have the compiler vectorize it. Typically this means avoiding anything but very simple branches which can be converted to branchless code through if-conversion. An example of such code is the fast_atan2f() shown below. With the Intel compiler, invoked as icl /O3 /fp:precise /Qvec_report=2 my_atan2f.c, this is vectorized successfully: my_atan2f.c(67): (col. 9) remark: LOOP WAS VECTORIZED. Double checking the generated code through disassembly shows that fast_atan2f() has been inlined and vectorized using SSE instructions of the *ps flavor.

If all else fails, you could translate the code of fast_atan2() into platform-specific SIMD intrinsics by hand, which should not be all that hard to do. I do not have sufficient experience to do it quickly though.

I have used a very simple algorithm in this code, which is a simple argument reduction to produce a reduced argument in [0,1], followed by a minimax polynomial approximation, and a final step mapping the result back to the full circle. Core approximation was generated with the Remez algorithm and is evaluated using a 2nd-order Horner scheme. Fused-multiply add (FMA) can be used where available. Special cases involving infinities, NaNs, or 0/0 are not handled, for the sake of performance.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

/* maximum relative error about 3.6e-5 */
float fast_atan2f (float y, float x)
{
    float a, r, s, t, c, q, ax, ay, mx, mn;
    ax = fabsf (x);
    ay = fabsf (y);
    mx = fmaxf (ay, ax);
    mn = fminf (ay, ax);
    a = mn / mx;
    /* Minimax polynomial approximation to atan(a) on [0,1] */
    s = a * a;
    c = s * a;
    q = s * s;
    r =  0.024840285f * q + 0.18681418f;
    t = -0.094097948f * q - 0.33213072f;
    r = r * s + t;
    r = r * c + a;
    /* Map to full circle */
    if (ay > ax) r = 1.57079637f - r;
    if (x <   0) r = 3.14159274f - r;
    if (y <   0) r = -r;
    return r;
}

/* Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007 */
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC  ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)

float rand_float(void)
{
    volatile union {
        float f;
        unsigned int i;
    } cvt;
    do {
        cvt.i = KISS;
    } while (isnan(cvt.f) || isinf (cvt.f) || (fabsf (cvt.f) < powf (2.0f, -126)));
    return cvt.f;
}

int main (void)
{
    const int N = 10000;
    const int M = 100000;
    float ref, relerr, maxrelerr = 0.0f;
    float arga[N], argb[N], res[N];
    int i, j;

    printf ("testing atan2() with %d test vectors\n", N*M);

    for (j = 0; j < M; j++) {
        for (i = 0; i < N; i++) {
            arga[i] = rand_float();
            argb[i] = rand_float();
        }

        // This loop should be vectorized
        for (i = 0; i < N; i++) {
            res[i] = fast_atan2f (argb[i], arga[i]);
        }

        for (i = 0; i < N; i++) {
            ref = (float) atan2 ((double)argb[i], (double)arga[i]);
            relerr = (res[i] - ref) / ref;
            if ((fabsf (relerr) > maxrelerr) && 
                (fabsf (ref) >= powf (2.0f, -126))) { // result not denormal
                maxrelerr = fabsf (relerr);
            }
        }
    };

    printf ("max rel err = % 15.8e\n\n", maxrelerr);

    printf ("atan2(1,0)  = % 15.8e\n", fast_atan2f(1,0));
    printf ("atan2(-1,0) = % 15.8e\n", fast_atan2f(-1,0));
    printf ("atan2(0,1)  = % 15.8e\n", fast_atan2f(0,1));
    printf ("atan2(0,-1) = % 15.8e\n", fast_atan2f(0,-1));
    return EXIT_SUCCESS;
}

The output of the program above should look similar to the following:

testing atan2() with 1000000000 test vectors
max rel err =  3.53486939e-005

atan2(1,0)  =  1.57079637e+000
atan2(-1,0) = -1.57079637e+000
atan2(0,1)  =  0.00000000e+000
atan2(0,-1) =  3.14159274e+000

Godbolt links showing successful vectorization of `fast_atan2f()` inside loop: [ICC 17](https://godbolt.org/g/P7ztE3), [gcc 7.2](https://godbolt.org/g/bt8Xrs), [clang 5.0](https://godbolt.org/g/1WiEa8). I don't know how to set up ARM compilers for auto-vectorization with Neon. — njuffa, Sep 16 '17 at 09:13

atan2 approximation with 11bits in mantissa on x86(with SSE2) and ARM(with vfpv4 NEON)

1 Answers1