Single precision argument reduction for trigonometric functions in C

Question

I have implemented some approximations for trigonometric functions (sin,cos,arctan) computed with single precision (32 bit floating point) in C. They are accurate to about +/- 2 ulp.

My target device does not support any <cmath> or <math.h> methods. It does not provide a FMA, but a MAC ALU. ALU and LU compute in 32 bit format.

My arctan approximation is actually a modified version of the approximation of N.juffa, which approximates arctan on the full range. Sine and cosine function are accurate up to 2 ulp within the range [-pi,pi].

I am now aiming to provide a larger input range (as large as possible, ideally [FLT_MIN,FLT_MAX]) for sine and cosine, which leads me to argument reduction.

I'm currently reading different papers like ARGUMENT REDUCTION FOR HUGE ARGUMENTS: Good to the Last Bit by K.C.Ng or the paper about this new argument reduction algorithm, but I wasn't able to derive an implementation from it.

Also I want to mention two stackoverflow questions that refer to related problems: There is a approach with matlab and c++ which is based on the first paper I linked. It is actually using matlab, cmath methods and it limits the input to [0,20.000]. The other one was already mentioned in the comments. It is an approach to an implementation of sin and cos in C, using various c-libraries which are not available for me. Since both posts are already several years old, there might be some new findings.

It seems like the algorithm mostly used in this case is to store the number of 2/pi accurate up to the needed number of bits, to be able to compute the modulo calculation accurately and simultaneously avoid cancellation. My device does not provide a large DMEM, which means large look-up tables with hundreds of bits are not possible. This procedure is actually described on page 70 of this reference, which by the way provides a lot of useful informatin about floating point math.

So my question is: Is there another efficient way to reduce the arguments for sine and cosine obtaining single precision avoiding large LUTs? The papers mentioned above actually focus on double precision and use up to 1000 digits, which is not suitable for my usecase.

I actually haven't found any implementation in C nor an implementation aiming single precision calculation, I would be grateful for any sorts of hints /links /examples...

I admit this is beyond my expertise and I am NOT saying this is a duplicate. But the question seemed familiar to me (wrong or right... ) and I found this one with similar keywords. Maybe there is something helpful for you: https://stackoverflow.com/questions/50082227/trigonometric-argument-reduction-reduction-modulo-2π Also, it mentions `2*pi` while you have here `2/pi`; you need to check for typos, I lack the technical context. — Yunnosch, Sep 25 '20 at 06:17
@Yunnosch thank you for the reference. I've seen this before, and i forgot to mention this. I edited my question. To address the pi/2 or pi*2: This depends on the desired output. There are two mainly used intervals for sine/cosine approximations: [-pi/4,pi/4] --> x modulo pi/2, and [-pi,pi] or [0,2*pi] --> x modulo 2*pi. Both variants still lack of the same mentioned problem with loss of accuracy due to missing bits or cancellation. — Dexter S, Sep 25 '20 at 06:35
Even though it is technically beyond me, have my upvote for thoroughness of research and style of handling (my) feedback. — Yunnosch, Sep 25 '20 at 06:38
@DexterS This [answer](https://stackoverflow.com/a/30465751/780717) likely has everything you need. — njuffa, Sep 25 '20 at 07:05
@njuffa thanks for the hint. That might actually be a good approach. As I mentioned above, my ALU does not provide double precision, neither for floating point calculation nor for integer calculations. I will check if these restrictions will decrease the accuracy of this algorithm significantly... — Dexter S, Sep 25 '20 at 07:16
@DexterS Yes, expect to make adjustments, as it is unlikely that any answer on Stackoverflow will provide code that is custom tailored to your (unspecified) platform and its capabilities and restrictions. Is there a publicly available specification for your processor somewhere? — njuffa, Sep 25 '20 at 07:24
@DexterS You might also want to have a look at [this question](https://stackoverflow.com/questions/42455143/sine-cosine-modular-extended-precision-arithmetic) I would strongly suggest specifying in the question exactly what range of inputs you plan to handle. — njuffa, Sep 25 '20 at 07:35
@njuffa I expected that I would have to make adjustments, I was just explaining my next steps :). The processor is self-developed and therefore not publicly available. The linked post is another interesting approach, thank you! I plan to handle as much input as i can, if possible up to [FLT_MIN, FLT_MAX]. I'll edit my question. — Dexter S, Sep 25 '20 at 09:18
Re “large look-up tables with hundreds of bits are not possible”: It is not possible to perform a correct reduction without well over a hundred bits of π available in some form. The largest number that must be reduced is near 2^128, so you need to, in effect, subtract more than 2^126•π to get the remainder, and then you need the remainder to some number of bits. — Eric Postpischil, Sep 25 '20 at 14:00
@DexterS: Do you have to use radians? The simplest case is using "fractions of a circle" (where the value 1.0 represents 360 degrees) with a "0.32" fixed point unsigned integer format; because "wrap on overflow" means argument reduction happens for free (with guaranteed no precision loss), and you get 32 bits of precision (significantly better than a single precision float) .Of course after argument reduction you can convert to/from radians (if radians is actually helpful) without worrying about the error (from "difference between irrational number and representable number") being multiplied. — Brendan, Sep 28 '20 at 05:01

njuffa · Accepted Answer · 2022-03-23T10:03:16.143

The following code is based on a previous answer in which I demonstrated how to perform a fairly accurate argument reduction for trigonometric functions by using the Cody-Waite method of split constants for arguments small in magnitude, and the Payne-Hanek method for arguments large in magnitude. For details on the Payne-Hanek algorithm see there, for details on the Cody-Waite algorithm see this previous answer of mine.

Here I have made adjustments necessary to adjust to the restrictions of the asker's platform, in that no 64-bit types are supported, fused multiply-add is not supported, and helper functions from math.h are not available. I am assuming that float maps to IEEE-754 binary32 format, and that there is a way to re-interpret such a 32-bit float as a 32-bit unsigned integer and vice versa. I have implemented this re-interpretation via the standard portable idiom, that is, by using memcpy(), but other methods may be chosen appropriate for the unspecified target platform, such as inline assembly, machine-specific intrinsics, or volatile unions.

Since this code is basically a port of my previous code to a more restrictive environment, it lacks perhaps the elegance of a de novo design specifically targeted at that environment. I have basically replaced the frexp() helper function from math.h with some bit twiddling, emulated 64-bit integer computation with pairs of 32-bit integers, replaced the double-precision computation with 32-bit fixed-point computation (which worked much better than I had anticipated), and replaced all FMAs with the unfused equivalent.

Re-working the Cody-Waite portion of the argument reduction took quite a bit of work. Clearly, without FMA available, we need to ensure a sufficient number of trailing zero bits in the constituent parts of the constant π/2 (except the least significant one) to make sure the products are exact. I spent several hours experimentally puzzling out a particular split that delivers accurate results but also pushes the switchover point to the Payne-Hanek method as high as possible.

When USE_FMA = 1 is specified, the output of the test app, when compiled with a high-quality math library, should look similar to this:

Testing sinf ...  PASSED. max ulp err = 1.493253  diffsum = 337633490
Testing cosf ...  PASSED. max ulp err = 1.495098  diffsum = 342020968

With USE_FMA = 0 the accuracy changes slightly for the worse:

Testing sinf ...  PASSED. max ulp err = 1.498012  diffsum = 359702532
Testing cosf ...  PASSED. max ulp err = 1.504061  diffsum = 364682650

The diffsum output is a rough indicator of overall accuracy, here showing that about 90% of all inputs result in a correctly rounded single-precision response.

Note that it is important to compile the code with the strictest floating-point settings and highest degree of adherence to IEEE-754 the compiler offers. For the Intel compiler that I used to develop and test this code, that can be achieved by compiling with /fp:strict. Also, the quality of the math library used for reference is crucial for accurate assessment of the ulp error of this single-precision code. The Intel compiler comes with a math library that provides double-precision elementary math functions with just slightly over 0.5 ulp error in the HA (high accuracy) variant. Use of a multi-precision reference library may be preferable but would have slowed me down too much here.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>   // for memcpy()
#include <math.h>     // for test purposes, and when PORTABLE=1 or USE_FMA=1

#define USE_FMA   (0) // use fmaf() calls for arithmetic
#define PORTABLE  (0) // allow helper functions from math.h
#define HAVE_U64  (0) // 64-bit integer type available
#define CW_STAGES (3) // number of stages in Cody-Waite reduction when USE_FMA=0

#if USE_FMA
#define SIN_RED_SWITCHOVER  (117435.992f)
#define COS_RED_SWITCHOVER  (71476.0625f)
#define MAX_DIFF            (1)
#else // USE_FMA
#if CW_STAGES == 2
#define SIN_RED_SWITCHOVER  (3.921875f)
#define COS_RED_SWITCHOVER  (3.921875f)
#elif CW_STAGES == 3
#define SIN_RED_SWITCHOVER  (201.15625f)
#define COS_RED_SWITCHOVER  (142.90625f)
#endif // CW_STAGES
#define MAX_DIFF            (2)
#endif // USE_FMA

/* re-interpret the bit pattern of an IEEE-754 float as a uint32 */
uint32_t float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

/* re-interpret the bit pattern of a uint32 as an IEEE-754 float */
float uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

/* Compute the upper 32 bits of the product of two unsigned 32-bit integers */
#if HAVE_U64
uint32_t umul32_hi (uint32_t a, uint32_t b)
{
    return (uint32_t)(((uint64_t)a * b) >> 32);
}
#else // HAVE_U64
/* Henry S. Warren, "Hacker's Delight, 2nd ed.", Addison-Wesley 2012. Fig. 8-2 */
uint32_t umul32_hi (uint32_t a, uint32_t b)
{
    uint16_t a_lo = (uint16_t)a;
    uint16_t a_hi = a >> 16;
    uint16_t b_lo = (uint16_t)b;
    uint16_t b_hi = b >> 16;
    uint32_t p0 = (uint32_t)a_lo * b_lo;
    uint32_t p1 = (uint32_t)a_lo * b_hi;
    uint32_t p2 = (uint32_t)a_hi * b_lo;
    uint32_t p3 = (uint32_t)a_hi * b_hi;
    uint32_t t = (p0 >> 16) + p1;
    return (t >> 16) + (((uint32_t)(uint16_t)t + p2) >> 16) + p3;
}
#endif // HAVE_U64

/* 190 bits of 2/PI for Payne-Hanek style argument reduction. */
const uint32_t two_over_pi_f [] = 
{
    0x28be60db,
    0x9391054a,
    0x7f09d5f4,
    0x7d4d3770,
    0x36d8a566,
    0x4f10e410
};

/* Reduce a trig function argument using the slow Payne-Hanek method */
float trig_red_slowpath_f (float a, int *quadrant)
{
    uint32_t ia, hi, mid, lo, tmp, i, l, h, plo, phi;
    int32_t e, q;
    float r;

#if PORTABLE
    ia = (uint32_t)(fabsf (frexpf (a, &e)) * 4.29496730e+9f); // 0x1.0p32
#else // PORTABLE
    ia = ((float_as_uint32 (a) & 0x007fffff) << 8) | 0x80000000;
    e = ((float_as_uint32 (a) >> 23) & 0xff) - 126;
#endif // PORTABLE
    
    /* compute product x * 2/pi in 2.62 fixed-point format */
    i = (uint32_t)e >> 5;
    e = (uint32_t)e & 31;

    hi  = i ? two_over_pi_f [i-1] : 0;
    mid = two_over_pi_f [i+0];
    lo  = two_over_pi_f [i+1];
    tmp = two_over_pi_f [i+2];
 
    if (e) {
        hi  = (hi  << e) | (mid >> (32 - e));
        mid = (mid << e) | (lo  >> (32 - e));
        lo  = (lo  << e) | (tmp >> (32 - e));
    }

    /* compute 64-bit product phi:plo */
    phi = 0;
    l = ia * lo;
    h = umul32_hi (ia, lo);
    plo = phi + l;
    phi = h + (plo < l);
    l = ia * mid;
    h = umul32_hi (ia, mid);
    plo = phi + l;
    phi = h + (plo < l);
    l = ia * hi;
    phi = phi + l;

    /* split fixed-point result into integer and fraction portions */
    q = phi >> 30;               // integral portion = quadrant<1:0>
    phi = phi & 0x3fffffff;      // fraction
    if (phi & 0x20000000) {      // fraction >= 0.5
        phi = phi - 0x40000000;  // fraction - 1.0
        q = q + 1;
    }

    /* compute remainder of x / (pi/2) */
#if USE_FMA
    float phif, plof, chif, clof, thif, tlof;
    phif = 1.34217728e+8f * (float)(int32_t)(phi & 0xffffffe0); // 0x1.0p27
    plof = (float)((plo >> 5) | (phi << (32-5)));
    thif = phif + plof;
    plof = (phif - thif) + plof;
    phif = thif;
    chif =  1.08995894e-17f; //  0x1.921fb6p-57 // (1.5707963267948966 * 0x1.0p-57)_hi 
    clof = -3.03308686e-25f; // -0x1.777a5cp-82 // (1.5707963267948966 * 0x1.0p-57)_lo
    thif = phif * chif;
    tlof = fmaf (phif, chif, -thif);
    tlof = fmaf (phif, clof, tlof);
    tlof = fmaf (plof, chif, tlof);
    r = thif + tlof;
#else // USE_FMA
    /* record sign of fraction */
    uint32_t s = phi & 0x80000000;
    
    /* take absolute value of fraction */
    if ((int32_t)phi < 0) {
        phi = ~phi;
        plo = 0 - plo;
        phi += (plo == 0);
    }
    
    /* normalize fraction */
    e = 0;
    while ((int32_t)phi > 0) {
        phi = (phi << 1) | (plo >> 31);
        plo = plo << 1;
        e--;
    }
    
    /* multiply 32 high-order bits of fraction with pi/2 */
    phi = umul32_hi (phi, 0xc90fdaa2); // (uint32_t)rint(PI/2 * 2**31)
    
    /* normalize product */
    if ((int32_t)phi > 0) {
        phi = phi << 1;
        e--;
    }

    /* round and convert to floating point */
    uint32_t ri = s + ((e + 128) << 23) + (phi >> 8) + ((phi & 0xff) > 0x7e);
    r = uint32_as_float (ri);
#endif // USE_FMA
    if (a < 0.0f) {
        r = -r;
        q = -q;
    }

    *quadrant = q;
    return r;
}

/* Argument reduction for trigonometric functions that reduces the argument
   to the interval [-PI/4, +PI/4] and also returns the quadrant. It returns 
   -0.0f for an input of -0.0f 
*/
float trig_red_f (float a, float switch_over, int *q)
{    
    float j, r;

    if (fabsf (a) > switch_over) {
        /* Payne-Hanek style reduction. M. Payne and R. Hanek, "Radian reduction
           for trigonometric functions". SIGNUM Newsletter, 18:19-24, 1983
        */
        r = trig_red_slowpath_f (a, q);
    } else {
        /* Cody-Waite style reduction. W. J. Cody and W. Waite, "Software Manual
           for the Elementary Functions", Prentice-Hall 1980
        */
#if USE_FMA
        j = fmaf (a, 6.36619747e-1f, 1.2582912e+7f); // 0x1.45f306p-1, 0x1.8p+23
        j = j - 1.25829120e+7f; // 0x1.8p+23
        r = fmaf (j, -1.57079601e+00f, a); // -0x1.921fb0p+00 // pio2_high
        r = fmaf (j, -3.13916473e-07f, r); // -0x1.5110b4p-22 // pio2_mid
        r = fmaf (j, -5.39030253e-15f, r); // -0x1.846988p-48 // pio2_low
#else // USE_FMA
        j = (a * 6.36619747e-1f + 1.2582912e+7f); // 0x1.45f306p-1, 0x1.8p+23
        j = j - 1.25829120e+7f; // 0x1.8p+23
#if CW_STAGES == 2
        r = a - j * 1.57079625e+00f; // 0x1.921fb4p+0  // pio2_high 
        r = r - j * 7.54979013e-08f; // 0x1.4442d2p-24 // pio2_low
#elif CW_STAGES == 3
        r = a - j * 1.57078552e+00f; // 0x1.921f00p+00 // pio2_high
        r = r - j * 1.08043314e-05f; // 0x1.6a8880p-17 // pio2_mid
        r = r - j * 2.56334407e-12f; // 0x1.68c234p-39 // pio2_low
#endif // CW_STAGES
#endif // USE_FMA
        *q = (int)j;
    }
    return r;
}

/* Approximate sine on [-PI/4,+PI/4]. Maximum ulp error with USE_FMA = 0.64196
   Returns -0.0f for an argument of -0.0f
   Polynomial approximation based on T. Myklebust, "Computing accurate 
   Horner form approximations to special functions in finite precision
   arithmetic", http://arxiv.org/abs/1508.03211, retrieved on 8/29/2016
*/
float sinf_poly (float a, float s)
{
    float r, t;
#if USE_FMA
    r =              2.86567956e-6f;  //  0x1.80a000p-19 
    r = fmaf (r, s, -1.98559923e-4f); // -0x1.a0690cp-13
    r = fmaf (r, s,  8.33338592e-3f); //  0x1.111182p-07
    r = fmaf (r, s, -1.66666672e-1f); // -0x1.555556p-03
    t = fmaf (a, s, 0.0f); // ensure -0 is passed through
    r = fmaf (r, t, a);
#else // USE_FMA
    r =         2.86567956e-6f; //  0x1.80a000p-19
    r = r * s - 1.98559923e-4f; // -0x1.a0690cp-13
    r = r * s + 8.33338592e-3f; //  0x1.111182p-07
    r = r * s - 1.66666672e-1f; // -0x1.555556p-03
    t = a * s + 0.0f; // ensure -0 is passed through
    r = r * t + a;
#endif // USE_FMA
    return r;
}

/* Approximate cosine on [-PI/4,+PI/4]. Maximum ulp error with USE_FMA = 0.87444 */
float cosf_poly (float s)
{
    float r;
#if USE_FMA
    r =              2.44677067e-5f;  //  0x1.9a8000p-16
    r = fmaf (r, s, -1.38877297e-3f); // -0x1.6c0efap-10
    r = fmaf (r, s,  4.16666567e-2f); //  0x1.555550p-05
    r = fmaf (r, s, -5.00000000e-1f); // -0x1.000000p-01
    r = fmaf (r, s,  1.00000000e+0f); //  0x1.000000p+00
#else // USE_FMA
    r =         2.44677067e-5f; //  0x1.9a8000p-16
    r = r * s - 1.38877297e-3f; // -0x1.6c0efap-10
    r = r * s + 4.16666567e-2f; //  0x1.555550p-05
    r = r * s - 5.00000000e-1f; // -0x1.000000p-01
    r = r * s + 1.00000000e+0f; //  0x1.000000p+00
#endif // USE_FMA
    return r;
}

/* Map sine or cosine value based on quadrant */
float sinf_cosf_core (float a, int i)
{
    float r, s;

    s = a * a;
    r = (i & 1) ? cosf_poly (s) : sinf_poly (a, s);
    if (i & 2) {
        r = 0.0f - r; // don't change "sign" of NaNs
    }
    return r;
}

/* maximum ulp error with USE_FMA = 1: 1.495098  */
float my_sinf (float a)
{
    float r;
    int i;

    a = a * 0.0f + a; // inf -> NaN
    r = trig_red_f (a, SIN_RED_SWITCHOVER, &i);
    r = sinf_cosf_core (r, i);
    return r;
}

/* maximum ulp error with USE_FMA = 1: 1.493253 */
float my_cosf (float a)
{
    float r;
    int i;

    a = a * 0.0f + a; // inf -> NaN
    r = trig_red_f (a, COS_RED_SWITCHOVER, &i);
    r = sinf_cosf_core (r, i + 1);
    return r;
}

/* re-interpret bit pattern of an IEEE-754 double as a uint64 */
uint64_t double_as_uint64 (double a)
{
    uint64_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

double floatUlpErr (float res, double ref)
{
    uint64_t i, j, err, refi;
    int expoRef;
    
    /* ulp error cannot be computed if either operand is NaN, infinity, zero */
    if (isnan (res) || isnan (ref) || isinf (res) || isinf (ref) ||
        (res == 0.0f) || (ref == 0.0f)) {
        return 0.0;
    }
    /* Convert the float result to an "extended float". This is like a float
       with 56 instead of 24 effective mantissa bits.
    */
    i = ((uint64_t)float_as_uint32(res)) << 32;
    /* Convert the double reference to an "extended float". If the reference is
       >= 2^129, we need to clamp to the maximum "extended float". If reference
       is < 2^-126, we need to denormalize because of the float types's limited
       exponent range.
    */
    refi = double_as_uint64(ref);
    expoRef = (int)(((refi >> 52) & 0x7ff) - 1023);
    if (expoRef >= 129) {
        j = 0x7fffffffffffffffULL;
    } else if (expoRef < -126) {
        j = ((refi << 11) | 0x8000000000000000ULL) >> 8;
        j = j >> (-(expoRef + 126));
    } else {
        j = ((refi << 11) & 0x7fffffffffffffffULL) >> 8;
        j = j | ((uint64_t)(expoRef + 127) << 55);
    }
    j = j | (refi & 0x8000000000000000ULL);
    err = (i < j) ? (j - i) : (i - j);
    return err / 4294967296.0;
}

int main (void) 
{
    float arg, res, reff;
    uint32_t argi, resi, refi;
    int64_t diff, diffsum;
    double ref, ulp, maxulp;

    printf ("Testing sinf ...  ");
    diffsum = 0;
    maxulp = 0;
    argi = 0;
    do {
        arg = uint32_as_float (argi);
        res = my_sinf (arg);
        ref = sin ((double)arg);
        reff = (float)ref;
        resi = float_as_uint32 (res);
        refi = float_as_uint32 (reff);
        ulp = floatUlpErr (res, ref);
        if (ulp > maxulp) {
            maxulp = ulp;
        }
        diff = (resi > refi) ? (resi - refi) : (refi - resi);
        if (diff > MAX_DIFF) {
            printf ("\nerror @ %08x (% 15.8e): res=%08x (% 15.8e)  ref=%08x (%15.8e)\n", argi, arg, resi, res, refi, reff);
            return EXIT_FAILURE;
        }
        diffsum = diffsum + diff;
        argi++;
    } while (argi);
    printf ("PASSED. max ulp err = %.6f  diffsum = %lld\n", maxulp, diffsum);

    printf ("Testing cosf ...  ");
    diffsum = 0;
    maxulp = 0;
    argi = 0;
    do {
        arg = uint32_as_float (argi);
        res = my_cosf (arg);
        ref = cos ((double)arg);
        reff = (float)ref;
        resi = float_as_uint32 (res);
        refi = float_as_uint32 (reff);
        ulp = floatUlpErr (res, ref);
        if (ulp > maxulp) {
            maxulp = ulp;
        }
        diff = (resi > refi) ? (resi - refi) : (refi - resi);
        if (diff > MAX_DIFF) {
            printf ("\nerror @ %08x (% 15.8e): res=%08x (% 15.8e)  ref=%08x (%15.8e)\n", argi, arg, resi, res, refi, reff);
            return EXIT_FAILURE;
        }
        diffsum = diffsum + diff;
        argi++;
    } while (argi);
    diffsum = diffsum + diff;
    printf ("PASSED. max ulp err = %.6f  diffsum = %lld\n", maxulp, diffsum);
    return EXIT_SUCCESS;
}

that's actually incredible. Thanks a lot for your time!! I was working on this implementation at the same time as well, but you were obviously a lot faster, even if there are some things to do still.... Unfortunately the exact same code did not pass with me, it exits with ``Testing sinf ... error @ 4a2562ae ( 2.70967550e+06): res=b2a411de (-1.91002378e-08) ref=b2a411e0 (-1.91002414e-08)`` I'll review the code in detail next week... — Dexter S, Sep 25 '20 at 13:14
edit: error occurs running with USE_FMA = 1, PORTABLE = 0 against sin Is there a reason for testing against sin instead of sinf? — Dexter S, Sep 25 '20 at 13:21
I see you edited an hour ago. I am already very happy with the approach. I just want to mention that there is still an error: ``Testing sinf ... error @ 30c90fdc ( 1.46291823e-09): res=b0c90fd9 (-1.46291790e-09) ref=30c90fdc ( 1.46291823e-09)`` I'll keep on testing and modifying... — Dexter S, Sep 28 '20 at 04:48
And another question: I assume that there is no shorter way of performing an accurate argument reduction since i am very limited in terms of instruction memory...? — Dexter S, Sep 28 '20 at 04:54
I'm running sinf test part from this program in an older PC which does not support FMA (emulates fma) and can't get procedure passed until condition test block diff>MAX_DIFF is commented (when commented, test passes but the ULP is something around 2.4). When MAX_DIFF condition test is enabled process stops @ 7f8000000 probably because of -nan and couple Inf's : ![ulp-test-error.png](https://postimg.cc/HcWqKV0S) — Juha P, Mar 22 '22 at 06:58

Juha P · Answer 2 · 2022-03-29T08:07:25.090

There's a thread on Mathematics forum where user J. M. ain't a mathematician introduced improved Taylor/Padé idea to approximate cos and sin functions in range [-pi,pi]. Here's sine version translated to C++. This approximation is not as fast as library std::sin() function but might be worth to check if SSE/AVX/FMA implementation helps enough with the speed.

I have not tested ULP error against library sin() nor cos() function but by Julia Function Accuracy Test tool it looks like an excellent approximation method (add below code to the runtest.jl module which belongs to the Julia test suite):

function test_sine(x::AbstractFloat)  
 f=0.5  
 z=x*0.5
 k=0
    while (abs(z)>f)
        z*=0.5
        k=k+1  
    end 
    z2=z^2;  
    r=z*(1+(z2/105-1)*((z/3)^2))/  
          (1+(z2/7-4)*((z/3)^2));  
    while(k > 0)
        r = (2*r)/(1-r*r);  
        k=k-1
    end
    return (2*r)/(1+r*r)
 end

function test_cosine(x::AbstractFloat)  
f=0.5  
z=x*0.5
k=0
   while (abs(z)>f)
       z*=0.5
       k=k+1  
   end 
   z2=z^2;  
   r=z*(1+(z2/105-1)*((z/3)^2))/  
      (1+(z2/7-4)*((z/3)^2));  
   while (k > 0)
       r = (2*r)/(1-r*r);  
       k=k-1
   end
   return (1-r*r)/(1+r*r)
end  

  
pii = 3.141592653589793238462643383279502884

MAX_SIN(n::Val{pii}, ::Type{Float16}) = 3.1415926535897932f0
MAX_SIN(n::Val{pii}, ::Type{Float32}) = 3.1415926535897932f0
#MAX_SIN(n::Val{pii}, ::Type{Float64}) = 3.141592653589793238462643383279502884
MIN_SIN(n::Val{pii}, ::Type{Float16}) = -3.1415926535897932f0
MIN_SIN(n::Val{pii}, ::Type{Float32}) = -3.1415926535897932f0
#MIN_SIN(n::Val{pii}, ::Type{Float64}) = -3.141592653589793238462643383279502884

for (func, base) in (sin=>Val(pii), test_sine=>Val(pii), cos=>Val(pii), test_cosine=>Val(pii))    
    for T in (Float16, Float32)
        xx = range(MIN_SIN(base,T),  MAX_SIN(base,T), length = 10^6);
        test_acc(func, xx)
    end
end

Results for approximation and sin() and cos() in range [-pi,pi]:

Tol debug failed 0.0% of the time.
sin
ULP max 0.5008857846260071 at x = 2.203355
ULP mean 0.24990503381476237
Test Summary: | Pass  Total
Float32 sin   |    1      1
Tol debug failed 0.0% of the time.
sin
ULP max 0.5008857846260071 at x = 2.203355
ULP mean 0.24990503381476237
Test Summary: | Pass  Total
Float32 sin   |    1      1
Tol debug failed 0.0% of the time.
test_sine
ULP max 0.001272978144697845 at x = 2.899093
ULP mean 1.179825295005716e-8
Test Summary:     | Pass  Total
Float32 test_sine |    1      1
Tol debug failed 0.0% of the time.
test_sine
ULP max 0.001272978144697845 at x = 2.899093
ULP mean 1.179825295005716e-8
Test Summary:     | Pass  Total
Float32 test_sine |    1      1
Tol debug failed 0.0% of the time.
cos
ULP max 0.5008531212806702 at x = 0.45568538
ULP mean 0.2499933592458589
Test Summary: | Pass  Total
Float32 cos   |    1      1
Tol debug failed 0.0% of the time.
cos
ULP max 0.5008531212806702 at x = 0.45568538
ULP mean 0.2499933592458589
Test Summary: | Pass  Total
Float32 cos   |    1      1
Tol debug failed 0.0% of the time.
test_cosine
ULP max 0.0011584102176129818 at x = 1.4495481
ULP mean 1.6793535615395134e-8
Test Summary:       | Pass  Total
Float32 test_cosine |    1      1
Tol debug failed 0.0% of the time.
test_cosine
ULP max 0.0011584102176129818 at x = 1.4495481
ULP mean 1.6793535615395134e-8
Test Summary:       | Pass  Total
Float32 test_cosine |    1      1

Single precision argument reduction for trigonometric functions in C

2 Answers2