Best machine-optimized polynomial minimax approximation to arctangent on [-1,1]?

Question

For the simple and efficient implementation of fast math functions with reasonable accuracy, polynomial minimax approximations are often the method of choice. Minimax approximations are typically generated with a variant of the Remez algorithm. Various widely available tools such as Maple and Mathematica have built-in functionality for this. The generated coefficients are typically computed using high-precision arithmetic. It is well-known that simply rounding those coefficients to machine precision leads to suboptimal accuracy in the resulting implementation.

Instead, one searches for closely related sets of coefficients that are exactly representable as machine numbers to generate a machine-optimized approximation. Two relevant papers are:

Nicolas Brisebarre, Jean-Michel Muller, and Arnaud Tisserand, "Computing Machine-Efficient Polynomial Approximations", ACM Transactions on Mathematical Software, Vol. 32, No. 2, June 2006, pp. 236–256.

Nicolas Brisebarre and Sylvain Chevillard, "Efficient polynomial L∞-approximations", 18th IEEE Symposium on Computer Arithmetic (ARITH-18), Montpellier (France), June 2007, pp. 169-176.

An implementation of the LLL-algorithm from the latter paper is available as the fpminimax() command of the Sollya tool. It is my understanding that all algorithms proposed for the generation of machine-optimized approximations are based on heuristics, and that it is therefore generally unknown what accuracy can be achieved by an optimal approximation. It is not clear to me whether the availability of FMA (fused multiply-add) for the evaluation of the approximation has an influence on the answer to that question. It seems to me naively that it should.

I am currently looking at a simple polynomial approximation for arctangent on [-1,1] that is evaluated in IEEE-754 single-precision arithmetic, using the Horner scheme and FMA. See function atan_poly() in the C99 code below. For lack of access to a Linux machine at the moment, I did not use Sollya to generate these coefficients, but used my own heuristic that could be loosely described as a mixture of steepest decent and simulated annealing (to avoid getting stuck on local minima). The maximum error of my machine-optimized polynomial is very close to 1 ulp, but ideally I would like the maximum ulp error to be below 1 ulp.

I am aware that I could change my computation to increase the accuracy, for example by using a leading coefficient represented to more than single-precision precision, but I would like to keep the code exactly as is (that is, as simple as possible) adjusting only the coefficients to deliver the most accurate result possible.

A "proven" optimal set of coefficients would be ideal, pointers to relevant literature are welcome. I did a literature search but could not find any paper that advances the state of the art meaningfully beyond Sollya's fpminimax(), and none that examine the role of FMA (if any) in this issue.

// max ulp err = 1.03143
float atan_poly (float a)
{
    float r, s;
    s = a * a;
    r =              0x1.7ed1ccp-9f;
    r = fmaf (r, s, -0x1.0c2c08p-6f);
    r = fmaf (r, s,  0x1.61fdd0p-5f);
    r = fmaf (r, s, -0x1.3556b2p-4f);
    r = fmaf (r, s,  0x1.b4e128p-4f);
    r = fmaf (r, s, -0x1.230ad2p-3f);
    r = fmaf (r, s,  0x1.9978ecp-3f);
    r = fmaf (r, s, -0x1.5554dcp-2f);
    r = r * s;
    r = fmaf (r, a, a);
    return r;
}

// max ulp err = 1.52637
float my_atanf (float a)
{
    float r, t;
    t = fabsf (a);
    r = t;
    if (t > 1.0f) {
        r = 1.0f / r;
    }
    r = atan_poly (r);
    if (t > 1.0f) {
        r = fmaf (0x1.ddcb02p-1f, 0x1.aee9d6p+0f, -r); // pi/2 - r
    }
    r = copysignf (r, a);
    return r;
}

I like your computation of π/2 - r. How did you pick the two factors? Are these values classically known to be good factors of a close approximation of π, or did you find them yourself with some sort of exhaustive search? (PS: sorry I can't help with the actual question) — Pascal Cuoq, Nov 01 '14 at 21:40
@Pascal Cuoq Too bad, I had high hopes you would have some additional or superior insights. This issue has bugged me for years, both in general and for the specific case of arctangent. As for factoring pi/2, I simply did a brute-force search starting at the square root of pi/2 and incrementing one factor in 1-ulp steps. For `float` it only took seconds to arrive at the "optimal" factoring. — njuffa, Nov 01 '14 at 21:51
Are you using the standard rounding mode (to nearest integer)? I believe that given uniformly random inputs, the rounding error (in each `fma()`) is uniformly distributed between -1/2..+1/2 [ULP](http://en.wikipedia.org/wiki/Unit_in_the_last_place), and doesn't allow any "error allocation" tricks, which the other rounding modes (towards zero, in particular) might. — Nominal Animal, Nov 02 '14 at 16:37
Yes, `fmaf()` operates in IEEE-754 round-to-nearest-or-even mode in the code I show above. On some platforms I use (CUDA in particular) it would be trivial to control the rounding mode separately for each `fmaf()` by using intrinsics, but I don't see how that helps in any kind of predictable way. — njuffa, Nov 02 '14 at 16:43
As to the maximum error, you could calculate an upper limit by calculating atan() at high precision -- say, [1:31:480](http://en.wikipedia.org/wiki/Fixed_point_arithmetic#Notation) -- at regular intervals, and comparing your implementation to linearly interpolated value (exploiting the fact that atan(x) > (atan(x-dx)+atan(x+dx))/2 for dx≤x≤1-dx). The maximum error in your error estimate then depends on the number of interval (size of dx), and the precision of your high-precision known samples. — Nominal Animal, Nov 02 '14 at 16:43
The maximum ulp error stated in the comments in the code above was determined by comparing the results for all possible `float` inputs to a high-precision reference. So this is measured, not estimated error. What is not clear is how much lower the maximum ulp error could be with an optimal machine-optimized approximation, nor how to derive that approximation other than by exhaustive search (which I convinced myself is not practical as there are too many possible sets of coefficients to test). — njuffa, Nov 02 '14 at 16:48
Is there a useful polyhedral/integer programming approach here? Each `float` gives you a system of constraints on your 8 coefficients. A statement like `r3 = fmaf(r2, s, c3)` means that `r3`, when suitably scaled, is an integer satisfying `r2*s+c3 - eps <= r3 <= r2*s*c3 + eps` for some `eps` that depends on `r3`'s exponent. This gives you something like a 16-dimensional polyhedron for each test point. Maybe it's small/nice enough that you can project out the `r` variables to get a bunch of inequalities in the `c` variables. — tmyklebu, Nov 02 '14 at 17:07
I'd also note that the universe of possible cubic coefficients is relatively small---a few thousand at most. — tmyklebu, Nov 02 '14 at 17:08
Note that the two articles you cite do not need to mention FMA at all because they do not work at that level. They produce real polynomials the coefficients of which happen to be floating-point values (and that, **as real polynomials**, are good approximations of the target function). In other words, they take into account the fact that the coefficients will have to be represented as floating-point constants, but not the fact that the operations will be floating-point operations. Nothing changes when you add FMA to the equation because operation errors were being ignored in the first place. — Pascal Cuoq, Nov 02 '14 at 17:13
If I understand it correctly, the first paper I referenced is based on a polyhedral approach, and the LLL-algorithm in the second paper recasts the problem as an integer optimization problem. I admit that I do not grasp the mathematics of either paper sufficiently to come up with my own implementations. The use of FMA not only reduces the overall rounding error, but also protects against subtractive cancellation in approximations comprised of terms with alternating signs. It seems this would have some influence on the optimal choice for each coefficient, but I have not actually shown that. — njuffa, Nov 02 '14 at 17:21
@njuffa: Yes, but it's doing something simpler than what I proposed. They're trying to find the best approximating polynomial with integer coefficients without taking into account roundoff error. You *can* account for roundoff by tossing in variables corresponding to the intermediate values of `r` and demanding integrality. (Whether this gives you something computationally prohibitive I'm not too sure.) — tmyklebu, Nov 02 '14 at 17:24
I guess that people working on static analysis of floating point programs (like my ex-colleague Sylvie Putot, now at Ecole Polytechnique in France) are probably knowledgeable about these matters. But I am not! — Basile Starynkevitch, Nov 02 '14 at 17:33
@tmyklebu That is a good observation and something I need to ponder. I think Nominal Animal's suggestion to consider rounding mode merits investigation as well, I should be able to incorporate that into a heuristic approach. The drawback is that not many platforms offer static per-operation rounding modes. — njuffa, Nov 02 '14 at 17:35
@Basile Starynkevitch Pascal Cuoq who participates here is active in that area as well, in fact that what he does for a living from what I understand. — njuffa, Nov 02 '14 at 17:37
Pascal Cuoq is also an ex-colleague. But Syvie is even more numerical processing oriented than Pascal (and of course me). — Basile Starynkevitch, Nov 02 '14 at 17:40
If your goal is simply to compute a < 1ulp approximation quickly, have you considered using Newton's method? The derivative of atan(x) is 1/(1+x^2), which is inexpensive to calculate, and each iteration of Newton's method doubles the number of accurate digits, so you could probably use a Remez polynomial with many fewer terms, plus 1 or 2 N M steps at the end, for fewer flops overall. — j_random_hacker, Nov 02 '14 at 18:29
Actually, my suggestion of Newton's method seems not be practical after all: it seems we need the derivative of tan(x), not atan(x), and the simplest update rule I could get out of that requires computing both cos(x) and sin(x)! — j_random_hacker, Nov 02 '14 at 19:23
@j_random_hacker How about an “accurate table” approach? Can you express `atan(x0 + h) - atan(x0)` for one of several fixed `x0` over [0…1] as a function of `h` that does not require too many constants to evaluate (overall for, say, a hundred well-chosen values of `x0`)? — Pascal Cuoq, Nov 02 '14 at 19:38
@PascalCuoq: I don't follow, sorry. As I understand it, N M finds a zero in a given function, so in your example function, it would always conclude that h = 0. Suppose we want to find atan(a). The setups I tried were f(x) = x - atan(a) (which fails because we need to know atan(a) to begin with!) and f(x) = a - tan(x) (which works, probably, but necessitates computing 2 transcendental functions -- i.e. we make things worse!) — j_random_hacker, Nov 02 '14 at 19:44
@PascalCuoq: I just realised that you probably weren't referring to Newton's method at all! In that case, please disregard my last message. Your approach might be quite useful, but I expect the values of x would have to be evenly spaced in order to find the right table entry in constant time. — j_random_hacker, Nov 02 '14 at 19:47
@j_random_hacker Basic table-based methods use evenly spaced entries so that the right entry can be selected in constant time. Accurate table methods store x0 as well as y0, and pick x0 so that both are closely approximated by floats, but still nearly evenly spaced so that the right table entry can still be picked in constant time. See http://stackoverflow.com/a/23726810/139746 for an example of accurate table. — Pascal Cuoq, Nov 02 '14 at 20:07
I am purposefully not looking at table-based methods such as [Gal's accurate table method](http://en.wikipedia.org/wiki/Gal's_accurate_tables). FLOPS are increasing faster than memory throughput, table access is difficult when applying SIMD vectorization, and a cache access requires more energy than an FMA (by about a factor 10 from what I understand). — njuffa, Nov 02 '14 at 20:14
@njuffa: Isn't `atan(x) ≃ x` with error less than 1 ulp for at least -0.00025 ≤ x ≤ 0.00025? While that range covers just 1/4000th of the -1..1 range, it does cover all subnormals, and almost 45% of (normal) single-precision floating-point values within -1..1 -- considering your outer function, 45% of all unique arguments. Perhaps a worthwhile simplification? Also, you only need 36 bits to represent exactly the rest (0.00025 to 1) of the single-precision values (including your current coefficients); that might help with the integer approaches. — Nominal Animal, Nov 02 '14 at 23:54
I think you are referring to the feasibility of a brute-force search? My back-of-the-envelope calculation for that was: I need to look at seven binades for each coefficient set, or roughly 2^26 results. The search space is a "hyper cone" requiring about 20 bits across all coefficients. So a total of 2^46 results to check. A quick speed test showed that I could process about 2^40 results per day on my machine. I concluded that brute-force search isn't practical. — njuffa, Nov 03 '14 at 01:23
How did you get the set of possible polynomials down to 2^20? It sounds like you're doing something more intelligent than you've let on. — tmyklebu, Nov 03 '14 at 02:00
@tmyklebu: I used a rule of thumb based on observing how wide the hypercone of close-to-optimal candidates typically is. E.g., have the largest coefficient wiggle +/-1 ulp, the second-largest coefficient wiggle +/-2 ulp, etc. The actual "widening factor" is normally larger than 2 and depends on the sequence of coefficients. Solutions outside the hypercone are practically impossible. One therefore searches exhaustively only inside of it. For some functions, the exhaustive search inside the cone seems feasible, e.g. for single-precision `sin()` and `cos()` which require very few coefficients. — njuffa, Nov 03 '14 at 02:37
@tmyklebu: As I recall one of the two referenced papers has a much more stringent methodology for enumerating all possible candidates for an exhaustive search than my hand-wavy approach but I never quite groked it. — njuffa, Nov 03 '14 at 02:40
Hmm, OK. I can prove rigorous bounds on what the coefficients have to look like for a 1-ulp real-arithmetic approximation. But they start with `-0x1.55569ef3f5c3bp-2 <= c3 <= -0x1.5553aef86204bp-2` and go on to stuff like `0x1.a1dd3401c343cp-4 <= c9 <= 0x1.ca54fb24c305fp-4` and `0x1.cba097eee06bdp-11 <= c17 <= 0x1.4107a9c1278a4p-8` that don't really tell you all that much. (This is through the polyhedral machinery.) — tmyklebu, Nov 03 '14 at 03:58
Not just for brute-force search (didn't think of that myself, actually). There just might be a coefficient set whose max error is largest in the excluded 0 ≤ x ≤ 0.00025 range, and the max error in the 0.00025 ≤ x ≤ 1 range is smaller than in any coefficient set covering the full 0 ≤ x ≤ 1 range. Did you record the x's that incur > 1 ulp error? @tmyklebu: Do you have those in single precision? Or would they just boil down to -0x1.5556a0p-2f ≤ c3 ≤ -0x1.5553aep-2f et cetera? — Nominal Animal, Nov 03 '14 at 05:57
@Nominal Animal: For the posted code errors larger than 1 ulp occur for 0.85 < |a| < 0.99. Just 27 occurences in total. `ulp = 1.02852 @ -9.31304693e-001 -0x1.dcd3f8p-1` is the largest positive error, `ulp = -1.03143 @ -9.84267354e-001 -0x1.f7f1e4p-1` the largest negative one. — njuffa, Nov 03 '14 at 06:49
@tmyklebu: Could you list all the bounds, please? I know the bounds define an unfeasibly large solution space (probably close to 2^128) for a brute force search, but it would give a nice starting point. It is particularly interesting there are only 377 possible values for the third degree coefficient (for single-precision), especially comparing to the almost 21 million for the highest coefficient. The math itself is clearly out of my reach, but I'd like to try and see if I can approach the problem from some other direction. — Nominal Animal, Nov 04 '14 at 21:28
@NominalAnimal: I've got something else on my plate for the next couple of days. Bad code to generate the numbers (depends on PPL, GMP, and MPFR; currently configured, I believe, for a 1/3ish-ulp error bound) is at http://www.csclub.uwaterloo.ca/~tmyklebu/foo191.cc. — tmyklebu, Nov 04 '14 at 21:42
@njuffa: Not sure if edits show up as notifications, but I've changed my answer to a faithfully-rounded arctan on [1e-8, 1] that uses only one `fma` more than your scheme does. — tmyklebu, Nov 10 '14 at 03:44
@tmyklebu: I plugged your latest version into my test framework, and for [1e-8f, 1.0f] the maximum error reported is 3.73206 ulps. I changed the coefficients into single-precision ones by adding the `f` suffix, as the entire code sequence should be pure single-precision code. In your own testing, are you using double-precision operations to evaluate the polynomial by any chance, with only the final result getting rounded to single precision? — njuffa, Nov 10 '14 at 06:56
@njuffa: The code I'm testing is the code I posted, verbatim. Perhaps `fmaf` is not working on my platform? Does the code at http://pastebin.com/SxFr7zZz abort for you? — tmyklebu, Nov 10 '14 at 13:28
@tmyklebu: The code at pastebin prints "success". After copying the polynomial into my test framework, it reports a maximum error of 0.99560 ulps for the polynomial, and 1.50322 ulps for `my_atanf()`. If you could massage your answer based on the cleaned-up pastebin code (no global variables, please :-) and refreshing/tightening your description, I would be happy to accept it. Is there anything that can be stated about the optimality of your solution vs the best polynomial of the same degree? — njuffa, Nov 10 '14 at 14:17
@tmyklebu: I notice belatedly that I made a mistake when I first plugged your solution from last night (which is identical to the pastebin code) into my test framework, thus causing the larger error I reported. My apologies. — njuffa, Nov 10 '14 at 14:24
I have no reason to expect that my polynomial is optimal in any sense other than being the first one I found. What would you like me to tighten in the description? — tmyklebu, Nov 10 '14 at 14:56
@tmyklebu: The final version of your answer looks good to me, I have accepted it and awarded the well-deserved bonus. — njuffa, Nov 10 '14 at 15:04
@njuffa: You mention in a comment that a cache access is about 10 times slower or more power-hungry than an FMA. Are you able to show that a scheme like this provides either a speed or a power savings on any modern computer? — tmyklebu, Jan 08 '15 at 03:16
@tmyklebu: What I stated was that an FMA operation requires only about 1/10 the energy of an L1 cache access. I do not have a reference handy, but am fairly certain that I read this within the past year. I am afraid I do not have the equipment to measure this myself as the numbers are in the pJ range, but I will try to find the (or at least a) reference. — njuffa, Jan 08 '15 at 04:02
@tmyklebu: [This slide deck](http://eecs.oregonstate.edu/research/vlsi/teaching/ECE471_WIN14/mark_horowitz_ISSCC_2014.pdf), slide 32 shows single-precision FADD at 0.9 pJ, single-precision FMUL at 4 pJ, and L1 cache access at 20 pJ for a 32 KB cache. That is more like a factor of 5x between FMA and cache access, not 10x, but this is definitely not the source I saw before. Different authors may well have derived different figures depending on specifics of the process they assumed. — njuffa, Jan 08 '15 at 04:22
@njuffa: Thanks. I guess the availability of really huge vector registers in the near future might still make this approach worth refining, even looking at the power numbers on that slide deck. — tmyklebu, Jan 08 '15 at 04:46
@tmyklebu: [This extensive paper](https://smartech.gatech.edu/bitstream/handle/1853/45737/GT-CSE-2012-01.pdf) discusses energy use in the context of GPUs. It contains so many different numbers that I have not been able to quickly track down a nice comparison of energy use for FMA and cache access, but it may be good background reading. — njuffa, Jan 08 '15 at 04:49
@tmyklebu: "It's hard to make predictions, especially about the future". I am not sure which approach you refer to when you say "this approach is worth refining". Wide SIMD operations make table lookup difficult, as is also pointed out by [this work](http://www.yeppp.info/resources/ppam-presentation.pdf). There is a [proper paper](http://link.springer.com/chapter/10.1007/978-3-642-55224-3_9#page-1) (paywalled) by the same authors. My own work over the past 10 years plus published papers suggest that computation-intensive approaches based on minimax polynomials and FMA are the way to go. — njuffa, Jan 08 '15 at 05:05
@njuffa: I think I can get this to be faithfully-rounded everywhere, and I'm debating with myself over whether it's worth investing a few weeks in the near future trying. These pointers to literature, particularly the Dukhan and Vaduc paper, are extremely helpful. — tmyklebu, Jan 08 '15 at 13:25

tmyklebu · Accepted Answer · 2015-01-19T02:07:06.143

The following function is a faithfully-rounded implementation of arctan on [0, 1]:

float atan_poly (float a) {
  float s = a * a, u = fmaf(a, -a, 0x1.fde90cp-1f);
  float r1 =               0x1.74dfb6p-9f;
  float r2 = fmaf (r1, u,  0x1.3a1c7cp-8f);
  float r3 = fmaf (r2, s, -0x1.7f24b6p-7f);
  float r4 = fmaf (r3, u, -0x1.eb3900p-7f);
  float r5 = fmaf (r4, s,  0x1.1ab95ap-5f);
  float r6 = fmaf (r5, u,  0x1.80e87cp-5f);
  float r7 = fmaf (r6, s, -0x1.e71aa4p-4f);
  float r8 = fmaf (r7, u, -0x1.b81b44p-3f);
  float r9 = r8 * s;
  float r10 = fmaf (r9, a, a);
  return r10;
}

The following test harness will abort if the function atan_poly fails to be faithfully-rounded on [1e-16, 1] and print "success" otherwise:

int checkit(float f) {
  double d = atan(f);
  float d1 = d, d2 = d;
  if (d1 < d) d2 = nextafterf(d1, 1.0/0.0);
  else d1 = nextafterf(d1, -1.0/0.0);
  float p = atan_poly(f);
  if (p != d1 && p != d2) return 0;
  return 1;
}

int main() {
  for (float f = 1; f > 1e-16; f = nextafterf(f, -1.0/0.0)) {
    if (!checkit(f)) abort();
  }
  printf("success\n");
  exit(0);
}

The problem with using s in every multiplication is that the polynomial's coefficients do not decay rapidly. Inputs close to 1 result in lots and lots of cancellation of nearly equal numbers, meaning you're trying to find a set of coefficients so that the accumulated roundoff at the end of the computation closely approximates the residual of arctan.

The constant 0x1.fde90cp-1f is a number close to 1 for which (arctan(sqrt(x)) - x) / x^3 is very close to the nearest float. That is, it's a constant that goes into the computation of u so that the cubic coefficient is almost completely determined. (For this program, the cubic coefficient must be either -0x1.b81b44p-3f or -0x1.b81b42p-3f.)

Alternating multiplications by s and u has the effect of reducing the effect of roundoff error in ri upon r{i+2} by a factor of at most 1/4, since s*u < 1/4 whatever a is. This gives considerable leeway in choosing the coefficients of fifth order and beyond.

I found the coefficients with the aid of two programs:

One program plugs in a bunch of test points, writes down a system of linear inequalities, and computes bounds on the coefficients from that system of inequalities. Notice that, given a, one can compute the range of r8 that lead to a faithfully-rounded result. To get linear inequalities, I pretended r8 would be computed as a polynomial in the floats s and u in real-number arithmetic; the linear inequalities constrained this real-number r8 to lie in some interval. I used the Parma Polyhedra Library to handle these constraint systems.
Another program randomly tested sets of coefficients in certain ranges, plugging in first a set of test points and then all floats from 1 to 1e-8 in descending order and checking that atan_poly produces a faithful rounding of atan((double)x). If some x failed, it printed out that x and why it failed.

To get coefficients, I hacked this first program to fix c3, work out bounds on r7 for each test point, then get bounds on the higher-order coefficients. Then I hacked it to fix c3 and c5 and get bounds on the higher-order coefficients. I did this until I had all but the three highest-order coefficients, c13, c15, and c17.

I grew the set of test points in the second program until it either stopped printing anything out or printed out "success". I needed surprisingly few test points to reject almost all wrong polynomials---I count 85 test points in the program.

Here I show some of my work selecting the coefficients. In order to get a faithfully-rounded arctan for my initial set of test points assuming r1 through r8 are evaluated in real arithmetic (and rounded somehow unpleasantly but in a way I can't remember) but r9 and r10 are evaluated in float arithmetic, I need:

-0x1.b81b456625f15p-3 <= c3 <= -0x1.b81b416e22329p-3
-0x1.e71d48d9c2ca4p-4 <= c5 <= -0x1.e71783472f5d1p-4
0x1.80e063cb210f9p-5 <= c7 <= 0x1.80ed6efa0a369p-5
0x1.1a3925ea0c5a9p-5 <= c9 <= 0x1.1b3783f148ed8p-5
-0x1.ec6032f293143p-7 <= c11 <= -0x1.e928025d508p-7
-0x1.8c06e851e2255p-7 <= c13 <= -0x1.732b2d4677028p-7
0x1.2aff33d629371p-8 <= c15 <= 0x1.41e9bc01ae472p-8
0x1.1e22f3192fd1dp-9 <= c17 <= 0x1.d851520a087c2p-9

Taking c3 = -0x1.b81b44p-3, assuming r8 is also evaluated in float arithmetic:

-0x1.e71df05b5ad56p-4 <= c5 <= -0x1.e7175823ce2a4p-4
0x1.80df529dd8b18p-5 <= c7 <= 0x1.80f00e8da7f58p-5
0x1.1a283503e1a97p-5 <= c9 <= 0x1.1b5ca5beeeefep-5
-0x1.ed2c7cd87f889p-7 <= c11 <= -0x1.e8c17789776cdp-7
-0x1.90759e6defc62p-7 <= c13 <= -0x1.7045e66924732p-7
0x1.27eb51edf324p-8 <= c15 <= 0x1.47cda0bb1f365p-8
0x1.f6c6b51c50b54p-10 <= c17 <= 0x1.003a00ace9a79p-8

Taking c5 = -0x1.e71aa4p-4, assuming r7 is done in float arithmetic:

0x1.80e3dcc972cb3p-5 <= c7 <= 0x1.80ed1cf56977fp-5
0x1.1aa005ff6a6f4p-5 <= c9 <= 0x1.1afce9904742p-5
-0x1.ec7cf2464a893p-7 <= c11 <= -0x1.e9d6f7039db61p-7
-0x1.8a2304daefa26p-7 <= c13 <= -0x1.7a2456ddec8b2p-7
0x1.2e7b48f595544p-8 <= c15 <= 0x1.44437896b7049p-8
0x1.396f76c06de2ep-9 <= c17 <= 0x1.e3bedf4ed606dp-9

Taking c7 = 0x1.80e87cp-5, assuming r6 is done in float arithmetic:

0x1.1aa86d25bb64fp-5 <= c9 <= 0x1.1aca48cd5caabp-5
-0x1.eb6311f6c29dcp-7 <= c11 <= -0x1.eaedb032dfc0cp-7
-0x1.81438f115cbbp-7 <= c13 <= -0x1.7c9a106629f06p-7
0x1.36d433f81a012p-8 <= c15 <= 0x1.3babb57bb55bap-8
0x1.5cb14e1d4247dp-9 <= c17 <= 0x1.84f1151303aedp-9

Taking c9 = 0x1.1ab95ap-5, assuming r5 is done in float arithmetic:

-0x1.eb51a3b03781dp-7 <= c11 <= -0x1.eb21431536e0dp-7
-0x1.7fcd84700f7cfp-7 <= c13 <= -0x1.7ee38ee4beb65p-7
0x1.390fa00abaaabp-8 <= c15 <= 0x1.3b100a7f5d3cep-8
0x1.6ff147e1fdeb4p-9 <= c17 <= 0x1.7ebfed3ab5f9bp-9

I picked a point close to the middle of the range for c11 and randomly chose c13, c15, and c17.

EDIT: I've now automated this procedure. The following function is also a faithfully-rounded implementation of arctan on [0, 1]:

float c5 = 0x1.997a72p-3;
float c7 = -0x1.23176cp-3;
float c9 = 0x1.b523c8p-4;
float c11 = -0x1.358ff8p-4;
float c13 = 0x1.61c5c2p-5;
float c15 = -0x1.0b16e2p-6;
float c17 = 0x1.7b422p-9;

float juffa_poly (float a) {
  float s = a * a;
  float r1 =              c17;
  float r2 = fmaf (r1, s, c15);
  float r3 = fmaf (r2, s, c13);
  float r4 = fmaf (r3, s, c11);
  float r5 = fmaf (r4, s, c9);
  float r6 = fmaf (r5, s, c7);
  float r7 = fmaf (r6, s, c5);
  float r8 = fmaf (r7, s, -0x1.5554dap-2f);
  float r9 = r8 * s;
  float r10 = fmaf (r9, a, a);
  return r10;
}

I find it surprising that this code even exists. For coefficients near these, you can get a bound on the distance between r10 and the value of the polynomial evaluated in real arithmetic on the order of a few ulps thanks to the slow convergence of this polynomial when s is near 1. I had expected roundoff error to behave in a way that was fundamentally "untamable" simply by means of tweaking coefficients.

Thanks for providing some guidance for narrowing down the range for each coefficient. Unfortunately your results present a very large search space. I have made some progress after refining my original heuristic approach and plan to post an answer of sorts within the next 24 hours. — njuffa, Nov 08 '14 at 23:54
@njuffa: Looking forward to your answer! I was hoping this change would give a search space with more feasible solutions; wider constraints on the higher-order terms would seem to mean that the low-order bits of the higher-order terms can be used to combat whatever rounding anomalies are left over. — tmyklebu, Nov 09 '14 at 01:05
Very interesting. I obtain a max ulp of 1.3 comparing the ulp against mpfr using the edited coefficients. Do you think you could run your program and optimize the following Float64 coeffecints: http://pastebin.com/Zdyk4wA6 these coefficients give a ulp of 1.3 for Float64, but I think we could do better. — mu7z, Oct 04 '16 at 18:27
@musm: 64-bit floats are a different ballgame. Exhaustive enumeration of the space of 32-bit floating-point numbers is a step in this algorithm that's done repeatedly. — tmyklebu, Oct 06 '16 at 01:06

njuffa · Answer 2 · 2017-10-18T19:14:58.220

5

I pondered the various ideas I received in comments and also ran a few experiments based on that feedback. In the end I decided that a refined heuristic search was the best way forward. I have now managed to reduce the maximum error for atanf_poly() to 1.01036 ulps, with just three arguments exceeding my stated goal of a 1 ulp error bound:

ulp = -1.00829 @ |a| =  9.80738342e-001 0x1.f62356p-1 (3f7b11ab)
ulp = -1.01036 @ |a| =  9.87551928e-001 0x1.f9a068p-1 (3f7cd034)
ulp =  1.00050 @ |a| =  9.99375939e-001 0x1.ffae34p-1 (3f7fd71a)

Based on the manner of generating the improved approximation there is no guarantee that this is a best approximation; no scientific breakthrough here. As the ulp error of the current solution is not yet perfectly balanced, and since continuing the search continues to deliver better approximations (albeit at exponentially increasing time intervals) my guess is that a 1 ulp error bound is achievable, but at the same time we seem to be very close to the best machine-optimized approximation already.

The better quality of the new approximation is the result of a refined search process. I observed that all of the largest ulp errors in the polynomial occur close to unity, say in [0.75,1.0] to be conservative. This allows for a fast scan for interesting coefficient sets whose maximum error is smaller than some bound, say 1.08 ulps. I can then test in detail and exhaustively all coefficient sets within a heuristically chosen hyper-cone anchored at that point. This second step searches for minimum ulp error as the primary goal, and maximum percentage of correctly rounded results as a secondary objective. By using this two-step process across all four cores of my CPU I was able to significantly speed up the search process: I have been able to check about 2²¹ coefficient sets so far.

Based on the range of each coefficient across all "close" solutions I now estimate that the total useful search space for this approximation problem is >= 2²⁴ coefficient sets rather than the more optimistic number of 2²⁰ I threw out before. This seems like a feasible problem to solve for someone who is either very patient or has lots of computational horse-power at their disposal.

My updated code is as follows:

// max ulp err = 1.01036
float atanf_poly (float a)
{
    float r, s;
    s = a * a;
    r =              0x1.7ed22cp-9f;
    r = fmaf (r, s, -0x1.0c2c2ep-6f);
    r = fmaf (r, s,  0x1.61fdf6p-5f);
    r = fmaf (r, s, -0x1.3556b4p-4f);
    r = fmaf (r, s,  0x1.b4e12ep-4f);
    r = fmaf (r, s, -0x1.230ae0p-3f);
    r = fmaf (r, s,  0x1.9978eep-3f);
    r = fmaf (r, s, -0x1.5554dap-2f);
    r = r * s;
    r = fmaf (r, a, a);
    return r;
}

// max ulp err = 1.51871
float my_atanf (float a)
{
    float r, t;
    t = fabsf (a);
    r = t;
    if (t > 1.0f) {
        r = 1.0f / r;
    }
    r = atanf_poly (r);
    if (t > 1.0f) {
        r = fmaf (0x1.ddcb02p-1f, 0x1.aee9d6p+0f, -r); // pi/2 - r
    }
    r = copysignf (r, a);
    return r;
}

Update (after revisiting the issue two-and-a-half years later)

Using T. Myklebust's draft publication as a starting point, I found the arctangent approximation on [-1,1] that has the smallest error to have a maximum error of 0.94528 ulp.

/* Based on: Tor Myklebust, "Computing accurate Horner form approximations 
   to special functions in finite precision arithmetic", arXiv:1508.03211,
   August 2015. maximum ulp err = 0.94528
*/
float atanf_poly (float a)
{
    float r, s;
    s = a * a;                        
    r =              0x1.6d2086p-9f;  //  2.78569828e-3
    r = fmaf (r, s, -0x1.03f2ecp-6f); // -1.58660226e-2
    r = fmaf (r, s,  0x1.5beebap-5f); //  4.24722321e-2
    r = fmaf (r, s, -0x1.33194ep-4f); // -7.49753043e-2
    r = fmaf (r, s,  0x1.b403a8p-4f); //  1.06448799e-1
    r = fmaf (r, s, -0x1.22f5c2p-3f); // -1.42070308e-1
    r = fmaf (r, s,  0x1.997748p-3f); //  1.99934542e-1
    r = fmaf (r, s, -0x1.5554d8p-2f); // -3.33331466e-1
    r = r * s;
    r = fmaf (r, a, a);
    return r;
}

edited Oct 18 '17 at 19:14

answered Nov 09 '14 at 04:33

njuffa

23,970
4
78
130

1

Nice! I wrote some C using AVX (via ``) that finds max/min error in ulps, and the number of correctly rounded float results, at a rate of 7.5 to 8 cycles per argument (measured on i5-4200U). Within the [0.75f,1.0f] range, this should mean about 30 sets of coefficients per second per core per GHz, using about 96 megabytes of precalculated tables (under 1s to precalc using `atan()`). My code also states -1.00829..1.01257 ulps error for these coefficients, with 3534268 (84.263%) correctly rounded. If you think it useful/relevant, I'd be happy to clean it up and post that code here. – Nominal Animal Nov 09 '14 at 11:03
1

@NominalAnimal I am not sure I understand your comment: most sets of arguments can be rejected early, and it is never necessary to compute the maximum error on the full range (just stop as soon as it is larger than what you already have). But anyway, I for one would be very interested in your code. Don't worry about cleaning it up, as long as it compiles, it is useful! – Pascal Cuoq Nov 09 '14 at 11:14
@PascalCuoq: If I understood correctly, njuffa uses `f(C17,C15,C13,..,C5,C3) = maximum_error_in_ULPs` to define a surface (or actually two, one for maximum error below, and one for maximum error above). The optimum solution is then the lowest point in that surface, and you can use e.g. conjugate gradient or other methods to find it. Using AVX2, my function can calculate each point on that surface (if considering only x=[0.75,1]) in about 30 million CPU cycles, using only a single core. I was hoping it would speed up njuffa's approach. I'll post my code. :) – Nominal Animal Nov 10 '14 at 00:44
@Nominal Anmimal This will give people with recent hardware the chance to join the search. I am currently on a platform without FMA-capable hardware, which makes for a rather slow search. Plus my reference `atan()` is probably not the fastest either. – njuffa Nov 10 '14 at 01:26
@njuffa: Are you using a double-precision polynomial reference `arctan()` function? I know library trig functions aren't reliable, but since testing showed I can duplicate your results, I didn't bother to look for anything better. I wonder if, with a polynomial reference arctangent, the difference in partial derivatives wrt. each coefficient at the maximum error x yields the error surface gradient -- or a good enough approximation, at least? – Nominal Animal Nov 10 '14 at 02:22
I am using the double-precision `atan()` function from the highly-accurate Intel math library implementation as a reference and I am building my code with `/fp:strict`. I have access to alternative double-precision implementations of `atan()` that I could use to double check, and there is also the x87 instruction `FPATAN` as a fall-back. So far I have no reason to believe that there are any issues with the Intel math library as a reference. – njuffa Nov 10 '14 at 02:29
@njuffa: Right -- I do trust Intel MKL and AMD MCL, and it seems GLIBC trig functions are pretty precise, too. I just meant that in the past, some C library `atan()` implementations have been known to sometimes lack in precision. – Nominal Animal Nov 10 '14 at 02:34
@njuffa: I have another faithfully-rounded `arctan` for you :) Got the solution in my latest edit using [QSopt_ex](http://www.math.uwaterloo.ca/~bico/qsopt/ex/) to solve the LPs. The solution looked a lot like how I found the first "polynomial," but I automated the diving process and strengthened some inequalities at each level of the dive. I needed to dive in the backtracking process here. – tmyklebu Jan 19 '15 at 01:51
@tmyklebu: Thanks for the pointer to QSopt_ex, I wasn't aware of it. Did you update your existing answer with your latest result ? Never mind, I have spotted your latest update above. I have been busy with optimized polynomial approximations for `atan()` as well as other functions, also with [this question on Math.SE](http://math.stackexchange.com/questions/1084122/diophantine-approximation-with-additional-constraints) – njuffa Jan 19 '15 at 02:06
Yeah, it's at the very end. (I butchered the coefficients the first time and fixed it about 30 seconds ago.) – tmyklebu Jan 19 '15 at 02:07
By "dive in the backtracking process," I meant "backtrack in the diving process." – tmyklebu Jan 19 '15 at 02:14
I verified your latest solution as having a maximum error of 0.99994 ulps. Awesome! I am blown away by your persistence and the fact that you managed to find the solution that I suspected existed all along, but was unable to find with my primitive tools. Have you considered publishing your approach? It should be widely useable and useful. – njuffa Jan 19 '15 at 02:21
Why did you suspect this thing existed? My intuition kept screaming to me that the jaggedness of roundoff error would ruin everything. (And yes, I'm in the process of cleaning up the code and trying to solve a couple other problems. But I'm also in the process of a number of other things. I'll write this up...at some point.) – tmyklebu Jan 19 '15 at 02:25
The longer I was running my heuristic searches, the closer I was getting, with very few outliers outside the 1 ulp bound. In all those solutions the error wasn't completely balance yet, giving me hope there was a "true minimax equiripple" solution still out there. I understand that rounding "jaggedness" as you call it may prevent solutions with perfectly balanced error. If you want to compare solutions, I have results for single-precision sin,cos,tan on [0,pi/4] and a few others. Shoot me an email if interested, this is not the right place for further discussion. – njuffa Jan 19 '15 at 02:33

Pascal Cuoq · Answer 3 · 2014-11-03T21:49:57.077

This is not an answer to the question, but is too long to fit in a comment:

your question is about the optimal choice of coefficients C3, C5, …, C17 in a polynomial approximation to arctangent where you have pinned C1 to 1 and C2, C4, …, C16 to 0.

The title of your question says you are looking for approximations on [-1, 1], and a good reason to pin the even coefficients to 0 is that it is sufficient and necessary for the approximation to be exactly an odd function. The code in your question “contradicts” the title by applying the polynomial approximation only on [0, 1].

If you use the Remez algorithm to look for coefficients C2, C3, …, C8 to a polynomial approximation of arctangent on [0, 1] instead, you may end up with something like the values below:

#include <stdio.h>
#include <math.h>

float atan_poly (float a)
{
  float r, s;
  s = a;
  //  s = a * a;

  r =             -3.3507930064626076153585890630056286726807491543578e-2;
  r = fmaf (r, s, 1.3859776280052980081098065189344699108643282883702e-1);
  r = fmaf (r, s, -1.8186361916440430105127602496688553126414578766147e-1);
  r = fmaf (r, s, -1.4583047494913656326643327729704639191810926020847e-2);
  r = fmaf (r, s, 2.1335202878219865228365738728594741358740655881373e-1);
  r = fmaf (r, s, -3.6801711826027841250774413728610805847547996647342e-3);
  r = fmaf (r, s, -3.3289852243978319173749528028057608377028846413080e-1);
  r = fmaf (r, s, -1.8631479933914856903459844359251948006605218562283e-5);
  r = fmaf (r, s, 1.2917291732886065585264586294461539492689296797761e-7);

  r = fmaf (r, a, a);
  return r;
}

int main() {
  for (float x = 0.0f; x < 1.0f; x+=0.1f)
    printf("x: %f\n%a\n%a\n\n", x, atan_poly(x), atan(x));
}

This has roughly the same complexity as the code in your question—the number of multiplications is similar. Looking at this polynomial, there is no reason in particular to want to pin any coefficient to 0. If we wanted to approximate an odd function over [-1, 1] without pinning the even coefficients, they would automatically come up very small and subject to absorption, and then we would want to pin them to 0, but for this approximation over [0, 1], they don't, so we don't have to pin them.

It could have been better or worse than the odd polynomial in your question. It turns out that it is worse (see below). This quick-and-dirty application of LolRemez 0.2 (code included at the bottom of the question) seems to be, however, good enough to raise the question of the choice of coefficients. I would in particular be curious what happens if you subject the coefficients in this answer to the same “mixture of steepest decent and simulated annealing” optimization step that you applied to get the coefficients in your question.

So, to summarize this remark-posted-as-an-answer, are you sure that you are looking for optimal coefficients C3, C5, …, C17? It seems to me that you are looking for the best sequence of single-precision floating-point operations that produce a faithful approximation to arctangent, and that this approximation does not have to be the Horner form of a degree 17 odd polynomial.

x: 0.000000
0x0p+0
0x0p+0

x: 0.100000
0x1.983e2cp-4
0x1.983e28938f9ecp-4

x: 0.200000
0x1.94442p-3
0x1.94441ff1e8882p-3

x: 0.300000
0x1.2a73a6p-2
0x1.2a73a71dcec16p-2

x: 0.400000
0x1.85a37ap-2
0x1.85a3770ebe7aep-2

x: 0.500000
0x1.dac67p-2
0x1.dac670561bb5p-2

x: 0.600000
0x1.14b1dcp-1
0x1.14b1ddf627649p-1

x: 0.700000
0x1.38b116p-1
0x1.38b113eaa384ep-1

x: 0.800000
0x1.5977a8p-1
0x1.5977a686e0ffbp-1

x: 0.900000
0x1.773388p-1
0x1.77338c44f8faep-1

This is the code that I linked to LolRemez 0.2 in order to optimize the relative accuracy of a degree-9 polynomial approximation of arctangent on [0, 1]:

#include "lol/math/real.h"
#include "lol/math/remez.h"

using lol::real;
using lol::RemezSolver;

real f(real const &y)
{
  return (atan(y) - y) / y;
}

real g(real const &y)
{
  return re (atan(y) / y);
}

int main(int argc, char **argv)
{
  RemezSolver<8, real> solver;
  solver.Run("1e-1000", 1.0, f, g, 50);
  return 0;
}

Heuristically, using a polynomial in `s = a*a` instead of in `a` should ensure more rapid convergence on more inputs and hence a smaller chance of adverse roundoff errors. This might, or might not, matter. — tmyklebu, Nov 03 '14 at 15:53
@tmyklebu One reason I ended up looking at other sets of non-null coefficients is precisely that, in the case of arctangent, just the odd coefficients did not seem to do a very good job of converging to something within 1ULP of the real function. I had never understood why people preferred, say, Padé approximants even when the definition interval did not include a singularity (as it does not here), but now I see that arctangent does not want to be approximated by a polynomial. — Pascal Cuoq, Nov 03 '14 at 16:59
The fact that my code uses a primary approximation interval of [0,1] is an artifact of the implementation of `my_atanf()`. Without the initial `fabsf()` nothing would change numerically, since in `atan_poly()` variable `s` is always positive, however it would complicate the sign-bit handling slightly. In terms of a machine-efficient polynomial, pinning the even coefficients to zero makes intuitive sense, as tmyklebu points out, and this approach is also supported by the literature. If not pinning leads to a more accurate result in the same number of operations, that is perfectly acceptable. — njuffa, Nov 03 '14 at 17:12

score 2 · Answer 4 · answered Nov 10 '14 at 02:14

This is not an answer, but an extended comment too.

Recent Intel CPUs and some future AMD CPUs have AVX2. In Linux, look for avx2 flag in /proc/cpuinfo to see if your CPU supports these.

AVX2 is an extension that allows us to construct and compute using 256-bit vectors -- for example, eight single-precision numbers, or four double-precision numbers -- instead of just scalars. It includes FMA3 support, meaning fused multiply-add for such vectors. Simply put, AVX2 allows us to evaluate eight polynoms in parallel, in pretty much the same time as we evaluate a single one using scalar operations.

The function error8() analyses one set of coefficients, using predefined values of x, comparing against precalculated values of atan(x), and returns the error in ULPs (below and above the desired result separately), as well as the number of results that match the desired floating-point value exactly. These are not needed for simply testing whether a set of coefficients is better than the currently best known set, but allow different strategies on which coefficients to test. (Basically, the maximum error in ULPs forms a surface, and we're trying to find the lowest point on that surface; knowing the "height" of the surface at each point allows us to make educated guesses as to which direction to go -- how to change the coefficients.)

There are four precalculated tables used: known_x for the arguments, known_f for the correctly-rounded single-precision results, known_a for the double-precision "accurate" value (I'm just hoping the library atan() is precise enough for this -- but one should not rely on it without checking!), and known_m to scale the double-precision difference to ULPs. Given a desired range in arguments, the precalculate() function will precalculate these using the library atan() function. (It also relies on IEEE-754 floating-point formats and float and integer byte order being the same, but this is true on the CPUs this code runs on.)

Note that the known_x, known_f, and known_a arrays could be stored in binary files; the known_m contents are trivially derived from known_a. Using the library atan() without verifying it is not a good idea -- but because mine match njuffa's results, I didn't bother to look for a better reference atan().

For simplicity, here is the code in the form of an example program:

#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <immintrin.h>
#include <math.h>
#include <errno.h>

/** poly8() - Compute eight polynomials in parallel.
 * @x - the arguments
 * @c - the coefficients.
 *
 * The first coefficients are for degree 17, the second
 * for degree 15, and so on, down to degree 3.
 *
 * The compiler should vectorize the expression using vfmaddXXXps
 * given an AVX2-capable CPU; for example, Intel Haswell,
 * Broadwell, Haswell E, Broadwell E, Skylake, or Cannonlake;
 * or AMD Excavator CPUs. Tested on Intel Core i5-4200U.
 *
 * Using GCC-4.8.2 and
 *     gcc -O2 -march=core-avx2 -mtune=generic
 * this code produces assembly (AT&T syntax)
 *     vmulps       %ymm0, %ymm0, %ymm2
 *     vmovaps      (%rdi), %ymm1
 *     vmovaps      %ymm0, %ymm3
 *     vfmadd213ps  32(%rdi), %ymm2, %ymm1
 *     vfmadd213ps  64(%rdi), %ymm2, %ymm1
 *     vfmadd213ps  96(%rdi), %ymm2, %ymm1
 *     vfmadd213ps  128(%rdi), %ymm2, %ymm1
 *     vfmadd213ps  160(%rdi), %ymm2, %ymm1
 *     vfmadd213ps  192(%rdi), %ymm2, %ymm1
 *     vfmadd213ps  224(%rdi), %ymm2, %ymm1
 *     vmulps       %ymm2, %ymm1, %ymm0
 *     vfmadd132ps  %ymm3, %ymm3, %ymm0
 *     ret
 * if you omit the 'static inline'.
*/
static inline __v8sf poly8(const __v8sf x, const __v8sf *const c)
{
    const __v8sf xx = x * x;
    return (((((((c[0]*xx + c[1])*xx + c[2])*xx + c[3])*xx + c[4])*xx + c[5])*xx + c[6])*xx + c[7])*xx*x + x;
}

/** error8() - Calculate maximum error in ULPs
 * @x  - the arguments
 * @co - { C17, C15, C13, C11, C9, C7, C5, C3 }
 * @f  - the correctly rounded results in single precision
 * @a  - the expected results in double precision
 * @m  - 16777216.0 raised to the same power of two as @a normalized
 * @n  - number of vectors to test
 * @max_under - pointer to store the maximum underflow (negative, in ULPs) to
 * @max_over  - pointer to store the maximum overflow (positive, in ULPs) to
 * Returns the number of correctly rounded float results, 0..8*n.
*/
size_t error8(const __v8sf *const x, const float *const co,
              const __v8sf *const f, const __v4df *const a, const __v4df *const m,
              const size_t n,
              float *const max_under, float *const max_over)
{
    const __v8sf c[8] = { { co[0], co[0], co[0], co[0], co[0], co[0], co[0], co[0] },
                          { co[1], co[1], co[1], co[1], co[1], co[1], co[1], co[1] },
                          { co[2], co[2], co[2], co[2], co[2], co[2], co[2], co[2] },
                          { co[3], co[3], co[3], co[3], co[3], co[3], co[3], co[3] },
                          { co[4], co[4], co[4], co[4], co[4], co[4], co[4], co[4] },
                          { co[5], co[5], co[5], co[5], co[5], co[5], co[5], co[5] },
                          { co[6], co[6], co[6], co[6], co[6], co[6], co[6], co[6] },
                          { co[7], co[7], co[7], co[7], co[7], co[7], co[7], co[7] } };
    __v4df min = { 0.0, 0.0, 0.0, 0.0 };
    __v4df max = { 0.0, 0.0, 0.0, 0.0 };
    __v8si eqs = { 0, 0, 0, 0, 0, 0, 0, 0 };
    size_t i;

    for (i = 0; i < n; i++) {
        const __v8sf v = poly8(x[i], c);
        const __v4df d0 = { v[0], v[1], v[2], v[3] };
        const __v4df d1 = { v[4], v[5], v[6], v[7] };
        const __v4df err0 = (d0 - a[2*i+0]) * m[2*i+0];
        const __v4df err1 = (d1 - a[2*i+1]) * m[2*i+1];
        eqs -= (__v8si)_mm256_cmp_ps(v, f[i], _CMP_EQ_OQ);
        min = _mm256_min_pd(min, err0);
        max = _mm256_max_pd(max, err1);
        min = _mm256_min_pd(min, err1);
        max = _mm256_max_pd(max, err0);
    }

    if (max_under) {
        if (min[0] > min[1]) min[0] = min[1];
        if (min[0] > min[2]) min[0] = min[2];
        if (min[0] > min[3]) min[0] = min[3];
        *max_under = min[0];
    }

    if (max_over) {
        if (max[0] < max[1]) max[0] = max[1];
        if (max[0] < max[2]) max[0] = max[2];
        if (max[0] < max[3]) max[0] = max[3];
        *max_over = max[0];
    }

    return (size_t)((unsigned int)eqs[0])
         + (size_t)((unsigned int)eqs[1])
         + (size_t)((unsigned int)eqs[2])
         + (size_t)((unsigned int)eqs[3])
         + (size_t)((unsigned int)eqs[4])
         + (size_t)((unsigned int)eqs[5])
         + (size_t)((unsigned int)eqs[6])
         + (size_t)((unsigned int)eqs[7]);
}

/** precalculate() - Allocate and precalculate tables for error8().
 * @x0   - First argument to precalculate
 * @x1   - Last argument to precalculate
 * @xptr - Pointer to a __v8sf pointer for the arguments
 * @fptr - Pointer to a __v8sf pointer for the correctly rounded results
 * @aptr - Pointer to a __v4df pointer for the comparison results
 * @mptr - Pointer to a __v4df pointer for the difference multipliers
 * Returns the vector count if successful,
 * 0 with errno set otherwise.
*/
size_t precalculate(const float x0, const float x1,
                 __v8sf **const xptr, __v8sf **const fptr,
                 __v4df **const aptr, __v4df **const mptr)
{
    const size_t align = 64;
    unsigned int i0, i1;
    size_t       n, i, sbytes, dbytes;
    __v8sf      *x = NULL;
    __v8sf      *f = NULL;
    __v4df      *a = NULL;
    __v4df      *m = NULL;

    if (!xptr || !fptr || !aptr || !mptr) {
        errno = EINVAL;
        return (size_t)0;
    }

    memcpy(&i0, &x0, sizeof i0);
    memcpy(&i1, &x1, sizeof i1);

    i0 ^= (i0 & 0x80000000U) ? 0xFFFFFFFFU : 0x80000000U;
    i1 ^= (i1 & 0x80000000U) ? 0xFFFFFFFFU : 0x80000000U;

    if (i1 > i0)
        n = (((size_t)i1 - (size_t)i0) | (size_t)7) + (size_t)1;
    else
    if (i0 > i1)
        n = (((size_t)i0 - (size_t)i1) | (size_t)7) + (size_t)1;
    else {
        errno = EINVAL;
        return (size_t)0;
    }

    sbytes = n * sizeof (float);
    if (sbytes % align)
        sbytes += align - (sbytes % align);

    dbytes = n * sizeof (double);
    if (dbytes % align)
        dbytes += align - (dbytes % align);

    if (posix_memalign((void **)&x, align, sbytes)) {
        errno = ENOMEM;
        return (size_t)0;
    }
    if (posix_memalign((void **)&f, align, sbytes)) {
        free(x);
        errno = ENOMEM;
        return (size_t)0;
    }
    if (posix_memalign((void **)&a, align, dbytes)) {
        free(f);
        free(x);
        errno = ENOMEM;
        return (size_t)0;
    }
    if (posix_memalign((void **)&m, align, dbytes)) {
        free(a);
        free(f);
        free(x);
        errno = ENOMEM;
        return (size_t)0;
    }

    if (x1 > x0) {
        float *const xp = (float *)x;
        float        curr = x0;

        for (i = 0; i < n; i++) {
            xp[i] = curr;
            curr = nextafterf(curr, HUGE_VALF);
        }

        i = n;
        while (i-->0 && xp[i] > x1)
            xp[i] = x1;
    } else {
        float *const xp = (float *)x;
        float        curr = x0;

        for (i = 0; i < n; i++) {
            xp[i] = curr;
            curr = nextafterf(curr, -HUGE_VALF);
        }

        i = n;
        while (i-->0 && xp[i] < x1)
            xp[i] = x1;
    }

    {
        const float *const xp = (const float *)x;
        float *const       fp = (float *)f;
        double *const      ap = (double *)a;
        double *const      mp = (double *)m;

        for (i = 0; i < n; i++) {
            const float curr = xp[i];
            int         temp;

            fp[i] = atanf(curr);
            ap[i] = atan((double)curr);

            (void)frexp(ap[i], &temp);
            mp[i] = ldexp(16777216.0, temp);
        }
    }

    *xptr = x;
    *fptr = f;
    *aptr = a;
    *mptr = m;

    errno = 0;
    return n/8;
}

static int parse_range(const char *const str, float *const range)
{
    float fmin, fmax;
    char  dummy;

    if (sscanf(str, " %f %f %c",   &fmin, &fmax, &dummy) == 2 ||
        sscanf(str, " %f:%f %c",   &fmin, &fmax, &dummy) == 2 ||
        sscanf(str, " %f,%f %c",   &fmin, &fmax, &dummy) == 2 ||
        sscanf(str, " %f/%f %c",   &fmin, &fmax, &dummy) == 2 ||
        sscanf(str, " %ff %ff %c", &fmin, &fmax, &dummy) == 2 ||
        sscanf(str, " %ff:%ff %c", &fmin, &fmax, &dummy) == 2 ||
        sscanf(str, " %ff,%ff %c", &fmin, &fmax, &dummy) == 2 ||
        sscanf(str, " %ff/%ff %c", &fmin, &fmax, &dummy) == 2) {
        if (range) {
            range[0] = fmin;
            range[1] = fmax;
        }
        return 0;
    }

    if (sscanf(str, " %f %c",  &fmin, &dummy) == 1 ||
        sscanf(str, " %ff %c", &fmin, &dummy) == 1) {
        if (range) {
            range[0] = fmin;
            range[1] = fmin;
        }
        return 0;
    }

    return errno = ENOENT;
}

static int fix_range(float *const f)
{
    if (f && f[0] > f[1]) {
        const float tmp = f[0];
        f[0] = f[1];
        f[1] = tmp;
    }
    return f && isfinite(f[0]) && isfinite(f[1]) && (f[1] >= f[0]);
}

static const char *f2s(char *const buffer, const size_t size, const float value, const char *const invalid)
{
    char  format[32];
    float parsed;
    int   decimals, length;

    for (decimals = 0; decimals <= 16; decimals++) {
        length = snprintf(format, sizeof format, "%%.%df", decimals);
        if (length < 1 || length >= (int)sizeof format)
            break;

        length = snprintf(buffer, size, format, value);
        if (length < 1 || length >= (int)size)
            break;

        if (sscanf(buffer, "%f", &parsed) == 1 && parsed == value)
            return buffer;

        decimals++;
    }

    for (decimals = 0; decimals <= 16; decimals++) {
        length = snprintf(format, sizeof format, "%%.%dg", decimals);
        if (length < 1 || length >= (int)sizeof format)
            break;

        length = snprintf(buffer, size, format, value);
        if (length < 1 || length >= (int)size)
            break;

        if (sscanf(buffer, "%f", &parsed) == 1 && parsed == value)
            return buffer;

        decimals++;
    }

    length = snprintf(buffer, size, "%a", value);
    if (length < 1 || length >= (int)size)
        return invalid;

    if (sscanf(buffer, "%f", &parsed) == 1 && parsed == value)
        return buffer;

    return invalid;
}

int main(int argc, char *argv[])
{
    float xrange[2] = { 0.75f, 1.00f };
    float c17range[2], c15range[2], c13range[2], c11range[2];
    float c9range[2], c7range[2], c5range[2], c3range[2];
    float c[8];

    __v8sf *known_x;
    __v8sf *known_f;
    __v4df *known_a;
    __v4df *known_m;
    size_t  known_n;

    if (argc != 10 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s C17 C15 C13 C11 C9 C7 C5 C3 x\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "Each of the coefficients can be a constant or a range,\n");
        fprintf(stderr, "for example 0.25 or 0.75:1. x must be a non-empty range.\n");
        fprintf(stderr, "\n");
        return EXIT_FAILURE;
    }

    if (parse_range(argv[1], c17range) || !fix_range(c17range)) {
        fprintf(stderr, "%s: Invalid C17 range or constant.\n", argv[1]);
        return EXIT_FAILURE;
    }
    if (parse_range(argv[2], c15range) || !fix_range(c15range)) {
        fprintf(stderr, "%s: Invalid C15 range or constant.\n", argv[2]);
        return EXIT_FAILURE;
    }
    if (parse_range(argv[3], c13range) || !fix_range(c13range)) {
        fprintf(stderr, "%s: Invalid C13 range or constant.\n", argv[3]);
        return EXIT_FAILURE;
    }
    if (parse_range(argv[4], c11range) || !fix_range(c11range)) {
        fprintf(stderr, "%s: Invalid C11 range or constant.\n", argv[4]);
        return EXIT_FAILURE;
    }
    if (parse_range(argv[5], c9range) || !fix_range(c9range)) {
        fprintf(stderr, "%s: Invalid C9 range or constant.\n", argv[5]);
        return EXIT_FAILURE;
    }
    if (parse_range(argv[6], c7range) || !fix_range(c7range)) {
        fprintf(stderr, "%s: Invalid C7 range or constant.\n", argv[6]);
        return EXIT_FAILURE;
    }
    if (parse_range(argv[7], c5range) || !fix_range(c5range)) {
        fprintf(stderr, "%s: Invalid C5 range or constant.\n", argv[7]);
        return EXIT_FAILURE;
    }
    if (parse_range(argv[8], c3range) || !fix_range(c3range)) {
        fprintf(stderr, "%s: Invalid C3 range or constant.\n", argv[8]);
        return EXIT_FAILURE;
    }

    if (parse_range(argv[9], xrange) || xrange[0] == xrange[1] ||
        !isfinite(xrange[0]) || !isfinite(xrange[1])) {
        fprintf(stderr, "%s: Invalid x range.\n", argv[9]);
        return EXIT_FAILURE;
    }

    known_n = precalculate(xrange[0], xrange[1], &known_x, &known_f, &known_a, &known_m);
    if (!known_n) {
        if (errno == ENOMEM)
            fprintf(stderr, "Not enough memory for precalculated tables.\n");
        else
            fprintf(stderr, "Invalid (empty) x range.\n");
        return EXIT_FAILURE;
    }

    fprintf(stderr, "Precalculated %lu arctangents to compare to.\n", 8UL * (unsigned long)known_n);
    fprintf(stderr, "\nC17 C15 C13 C11 C9 C7 C5 C3 max-ulps-under max-ulps-above correctly-rounded percentage cycles\n");
    fflush(stderr);

    {
        const double  percent = 12.5 / (double)known_n;
        size_t        rounded;
        char          c17buffer[64], c15buffer[64], c13buffer[64], c11buffer[64];
        char          c9buffer[64], c7buffer[64], c5buffer[64], c3buffer[64];
        char          minbuffer[64], maxbuffer[64];
        float         minulps, maxulps;
        unsigned long tsc_start, tsc_stop;

        for (c[0] = c17range[0]; c[0] <= c17range[1]; c[0] = nextafterf(c[0], HUGE_VALF))
        for (c[1] = c15range[0]; c[1] <= c15range[1]; c[1] = nextafterf(c[1], HUGE_VALF))
        for (c[2] = c13range[0]; c[2] <= c13range[1]; c[2] = nextafterf(c[2], HUGE_VALF))
        for (c[3] = c11range[0]; c[3] <= c11range[1]; c[3] = nextafterf(c[3], HUGE_VALF))
        for (c[4] = c9range[0]; c[4] <= c9range[1]; c[4] = nextafterf(c[4], HUGE_VALF))
        for (c[5] = c7range[0]; c[5] <= c7range[1]; c[5] = nextafterf(c[5], HUGE_VALF))
        for (c[6] = c5range[0]; c[6] <= c5range[1]; c[6] = nextafterf(c[6], HUGE_VALF))
        for (c[7] = c3range[0]; c[7] <= c3range[1]; c[7] = nextafterf(c[7], HUGE_VALF)) {
            tsc_start = __builtin_ia32_rdtsc();
            rounded = error8(known_x, c, known_f, known_a, known_m, known_n, &minulps, &maxulps);
            tsc_stop = __builtin_ia32_rdtsc();
            printf("%-13s %-13s %-13s %-13s %-13s %-13s %-13s %-13s %-13s %-13s %lu %.3f %lu\n",
                   f2s(c17buffer, sizeof c17buffer, c[0], "?"),
                   f2s(c15buffer, sizeof c15buffer, c[1], "?"),
                   f2s(c13buffer, sizeof c13buffer, c[2], "?"),
                   f2s(c11buffer, sizeof c11buffer, c[3], "?"),
                   f2s(c9buffer, sizeof c9buffer, c[4], "?"),
                   f2s(c7buffer, sizeof c7buffer, c[5], "?"),
                   f2s(c5buffer, sizeof c5buffer, c[6], "?"),
                   f2s(c3buffer, sizeof c3buffer, c[7], "?"),
                   f2s(minbuffer, sizeof minbuffer, minulps, "?"),
                   f2s(maxbuffer, sizeof maxbuffer, maxulps, "?"),
                   rounded, (double)rounded * percent,
                   (unsigned long)(tsc_stop - tsc_start));
            fflush(stdout);

        }
    }

    return EXIT_SUCCESS;
}

The code does compile using GCC-4.8.2 on Linux, but might have to be modified for other compilers and/or OSes. (I'd be happy to include/accept edits fixing those, though. I just don't have Windows or ICC myself so I could check.)

To compile this, I recommend

gcc -Wall -O3 -fomit-frame-pointer -march=native -mtune=native example.c -lm -o example

Run without arguments to see usage; or

./example 0x1.7ed24ap-9f -0x1.0c2c12p-6f  0x1.61fdd2p-5f -0x1.3556b0p-4f 0x1.b4e138p-4f -0x1.230ae2p-3f  0x1.9978eep-3f -0x1.5554dap-2f 0.75:1

to check what it reports for njuffa's coefficient set, compared against standard C library atan() function, with all possible x in [0.75, 1] considered.

Instead of a fixed coefficient, you can also use min:max to define a range to scan (scanning all unique single-precision floating-point values). Each possible combination of the coefficients is tested.

Because I prefer decimal notation, but need to keep the values exact, I use the f2s() function to display the floating-point values. It is a simple brute-force helper function, that uses the shortest formatting that yields the same value when parsed back to float.

For example,

./example 0x1.7ed248p-9f:0x1.7ed24cp-9f -0x1.0c2c10p-6f:-0x1.0c2c14p-6f 0x1.61fdd0p-5f:0x1.61fdd4p-5f -0x1.3556aep-4f:-0x1.3556b2p-4f 0x1.b4e136p-4f:0x1.b4e13ap-4f -0x1.230ae0p-3f:-0x1.230ae4p-3f 0x1.9978ecp-3f:0x1.9978f0p-3f -0x1.5554d8p-2f:-0x1.5554dcp-2f 0.75:1

computes all the 6561 (3⁸) coefficient combinations ±1 ULP around njuffa's set for x in [0.75, 1]. (Indeed, it shows that decreasing C₁₇ by 1 ULP to 0x1.7ed248p-9f yields the exact same results.)

(That run took 90 seconds on Core i5-4200U at 2.6 GHz -- pretty much in line in my estimate of 30 coefficient sets per second per GHz per core. While this code is not threaded, the key functions are thread-safe, so threading should not be too difficult. This Core i5-4200U is a laptop, and gets pretty hot even when stressing just one core, so I didn't bother.)

(I consider the above code to be in public domain, or CC0-licensed where public domain dedication is not possible. In fact, I'm not sure if it is creative enough to be copyrightable at all. Anyway, feel free to use it anywhere in any way you wish, as long as you don't blame me if it breaks.)

Questions? Enhancements? Edits to fix Linux/GCC'isms are welcome!

Best machine-optimized polynomial minimax approximation to arctangent on [-1,1]?

4 Answers4

Linked