This is not an answer, but an extended comment too.
Recent Intel CPUs and some future AMD CPUs have AVX2. In Linux, look for avx2
flag in /proc/cpuinfo
to see if your CPU supports these.
AVX2 is an extension that allows us to construct and compute using 256-bit vectors -- for example, eight single-precision numbers, or four double-precision numbers -- instead of just scalars. It includes FMA3 support, meaning fused multiply-add for such vectors. Simply put, AVX2 allows us to evaluate eight polynoms in parallel, in pretty much the same time as we evaluate a single one using scalar operations.
The function error8()
analyses one set of coefficients, using predefined values of x
, comparing against precalculated values of atan(x)
, and returns the error in ULPs (below and above the desired result separately), as well as the number of results that match the desired floating-point value exactly. These are not needed for simply testing whether a set of coefficients is better than the currently best known set, but allow different strategies on which coefficients to test. (Basically, the maximum error in ULPs forms a surface, and we're trying to find the lowest point on that surface; knowing the "height" of the surface at each point allows us to make educated guesses as to which direction to go -- how to change the coefficients.)
There are four precalculated tables used: known_x
for the arguments, known_f
for the correctly-rounded single-precision results, known_a
for the double-precision "accurate" value (I'm just hoping the library atan()
is precise enough for this -- but one should not rely on it without checking!), and known_m
to scale the double-precision difference to ULPs. Given a desired range in arguments, the precalculate()
function will precalculate these using the library atan()
function. (It also relies on IEEE-754 floating-point formats and float and integer byte order being the same, but this is true on the CPUs this code runs on.)
Note that the known_x
, known_f
, and known_a
arrays could be stored in binary files; the known_m
contents are trivially derived from known_a
. Using the library atan()
without verifying it is not a good idea -- but because mine match njuffa's results, I didn't bother to look for a better reference atan()
.
For simplicity, here is the code in the form of an example program:
#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <immintrin.h>
#include <math.h>
#include <errno.h>
/** poly8() - Compute eight polynomials in parallel.
* @x - the arguments
* @c - the coefficients.
*
* The first coefficients are for degree 17, the second
* for degree 15, and so on, down to degree 3.
*
* The compiler should vectorize the expression using vfmaddXXXps
* given an AVX2-capable CPU; for example, Intel Haswell,
* Broadwell, Haswell E, Broadwell E, Skylake, or Cannonlake;
* or AMD Excavator CPUs. Tested on Intel Core i5-4200U.
*
* Using GCC-4.8.2 and
* gcc -O2 -march=core-avx2 -mtune=generic
* this code produces assembly (AT&T syntax)
* vmulps %ymm0, %ymm0, %ymm2
* vmovaps (%rdi), %ymm1
* vmovaps %ymm0, %ymm3
* vfmadd213ps 32(%rdi), %ymm2, %ymm1
* vfmadd213ps 64(%rdi), %ymm2, %ymm1
* vfmadd213ps 96(%rdi), %ymm2, %ymm1
* vfmadd213ps 128(%rdi), %ymm2, %ymm1
* vfmadd213ps 160(%rdi), %ymm2, %ymm1
* vfmadd213ps 192(%rdi), %ymm2, %ymm1
* vfmadd213ps 224(%rdi), %ymm2, %ymm1
* vmulps %ymm2, %ymm1, %ymm0
* vfmadd132ps %ymm3, %ymm3, %ymm0
* ret
* if you omit the 'static inline'.
*/
static inline __v8sf poly8(const __v8sf x, const __v8sf *const c)
{
const __v8sf xx = x * x;
return (((((((c[0]*xx + c[1])*xx + c[2])*xx + c[3])*xx + c[4])*xx + c[5])*xx + c[6])*xx + c[7])*xx*x + x;
}
/** error8() - Calculate maximum error in ULPs
* @x - the arguments
* @co - { C17, C15, C13, C11, C9, C7, C5, C3 }
* @f - the correctly rounded results in single precision
* @a - the expected results in double precision
* @m - 16777216.0 raised to the same power of two as @a normalized
* @n - number of vectors to test
* @max_under - pointer to store the maximum underflow (negative, in ULPs) to
* @max_over - pointer to store the maximum overflow (positive, in ULPs) to
* Returns the number of correctly rounded float results, 0..8*n.
*/
size_t error8(const __v8sf *const x, const float *const co,
const __v8sf *const f, const __v4df *const a, const __v4df *const m,
const size_t n,
float *const max_under, float *const max_over)
{
const __v8sf c[8] = { { co[0], co[0], co[0], co[0], co[0], co[0], co[0], co[0] },
{ co[1], co[1], co[1], co[1], co[1], co[1], co[1], co[1] },
{ co[2], co[2], co[2], co[2], co[2], co[2], co[2], co[2] },
{ co[3], co[3], co[3], co[3], co[3], co[3], co[3], co[3] },
{ co[4], co[4], co[4], co[4], co[4], co[4], co[4], co[4] },
{ co[5], co[5], co[5], co[5], co[5], co[5], co[5], co[5] },
{ co[6], co[6], co[6], co[6], co[6], co[6], co[6], co[6] },
{ co[7], co[7], co[7], co[7], co[7], co[7], co[7], co[7] } };
__v4df min = { 0.0, 0.0, 0.0, 0.0 };
__v4df max = { 0.0, 0.0, 0.0, 0.0 };
__v8si eqs = { 0, 0, 0, 0, 0, 0, 0, 0 };
size_t i;
for (i = 0; i < n; i++) {
const __v8sf v = poly8(x[i], c);
const __v4df d0 = { v[0], v[1], v[2], v[3] };
const __v4df d1 = { v[4], v[5], v[6], v[7] };
const __v4df err0 = (d0 - a[2*i+0]) * m[2*i+0];
const __v4df err1 = (d1 - a[2*i+1]) * m[2*i+1];
eqs -= (__v8si)_mm256_cmp_ps(v, f[i], _CMP_EQ_OQ);
min = _mm256_min_pd(min, err0);
max = _mm256_max_pd(max, err1);
min = _mm256_min_pd(min, err1);
max = _mm256_max_pd(max, err0);
}
if (max_under) {
if (min[0] > min[1]) min[0] = min[1];
if (min[0] > min[2]) min[0] = min[2];
if (min[0] > min[3]) min[0] = min[3];
*max_under = min[0];
}
if (max_over) {
if (max[0] < max[1]) max[0] = max[1];
if (max[0] < max[2]) max[0] = max[2];
if (max[0] < max[3]) max[0] = max[3];
*max_over = max[0];
}
return (size_t)((unsigned int)eqs[0])
+ (size_t)((unsigned int)eqs[1])
+ (size_t)((unsigned int)eqs[2])
+ (size_t)((unsigned int)eqs[3])
+ (size_t)((unsigned int)eqs[4])
+ (size_t)((unsigned int)eqs[5])
+ (size_t)((unsigned int)eqs[6])
+ (size_t)((unsigned int)eqs[7]);
}
/** precalculate() - Allocate and precalculate tables for error8().
* @x0 - First argument to precalculate
* @x1 - Last argument to precalculate
* @xptr - Pointer to a __v8sf pointer for the arguments
* @fptr - Pointer to a __v8sf pointer for the correctly rounded results
* @aptr - Pointer to a __v4df pointer for the comparison results
* @mptr - Pointer to a __v4df pointer for the difference multipliers
* Returns the vector count if successful,
* 0 with errno set otherwise.
*/
size_t precalculate(const float x0, const float x1,
__v8sf **const xptr, __v8sf **const fptr,
__v4df **const aptr, __v4df **const mptr)
{
const size_t align = 64;
unsigned int i0, i1;
size_t n, i, sbytes, dbytes;
__v8sf *x = NULL;
__v8sf *f = NULL;
__v4df *a = NULL;
__v4df *m = NULL;
if (!xptr || !fptr || !aptr || !mptr) {
errno = EINVAL;
return (size_t)0;
}
memcpy(&i0, &x0, sizeof i0);
memcpy(&i1, &x1, sizeof i1);
i0 ^= (i0 & 0x80000000U) ? 0xFFFFFFFFU : 0x80000000U;
i1 ^= (i1 & 0x80000000U) ? 0xFFFFFFFFU : 0x80000000U;
if (i1 > i0)
n = (((size_t)i1 - (size_t)i0) | (size_t)7) + (size_t)1;
else
if (i0 > i1)
n = (((size_t)i0 - (size_t)i1) | (size_t)7) + (size_t)1;
else {
errno = EINVAL;
return (size_t)0;
}
sbytes = n * sizeof (float);
if (sbytes % align)
sbytes += align - (sbytes % align);
dbytes = n * sizeof (double);
if (dbytes % align)
dbytes += align - (dbytes % align);
if (posix_memalign((void **)&x, align, sbytes)) {
errno = ENOMEM;
return (size_t)0;
}
if (posix_memalign((void **)&f, align, sbytes)) {
free(x);
errno = ENOMEM;
return (size_t)0;
}
if (posix_memalign((void **)&a, align, dbytes)) {
free(f);
free(x);
errno = ENOMEM;
return (size_t)0;
}
if (posix_memalign((void **)&m, align, dbytes)) {
free(a);
free(f);
free(x);
errno = ENOMEM;
return (size_t)0;
}
if (x1 > x0) {
float *const xp = (float *)x;
float curr = x0;
for (i = 0; i < n; i++) {
xp[i] = curr;
curr = nextafterf(curr, HUGE_VALF);
}
i = n;
while (i-->0 && xp[i] > x1)
xp[i] = x1;
} else {
float *const xp = (float *)x;
float curr = x0;
for (i = 0; i < n; i++) {
xp[i] = curr;
curr = nextafterf(curr, -HUGE_VALF);
}
i = n;
while (i-->0 && xp[i] < x1)
xp[i] = x1;
}
{
const float *const xp = (const float *)x;
float *const fp = (float *)f;
double *const ap = (double *)a;
double *const mp = (double *)m;
for (i = 0; i < n; i++) {
const float curr = xp[i];
int temp;
fp[i] = atanf(curr);
ap[i] = atan((double)curr);
(void)frexp(ap[i], &temp);
mp[i] = ldexp(16777216.0, temp);
}
}
*xptr = x;
*fptr = f;
*aptr = a;
*mptr = m;
errno = 0;
return n/8;
}
static int parse_range(const char *const str, float *const range)
{
float fmin, fmax;
char dummy;
if (sscanf(str, " %f %f %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %f:%f %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %f,%f %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %f/%f %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %ff %ff %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %ff:%ff %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %ff,%ff %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %ff/%ff %c", &fmin, &fmax, &dummy) == 2) {
if (range) {
range[0] = fmin;
range[1] = fmax;
}
return 0;
}
if (sscanf(str, " %f %c", &fmin, &dummy) == 1 ||
sscanf(str, " %ff %c", &fmin, &dummy) == 1) {
if (range) {
range[0] = fmin;
range[1] = fmin;
}
return 0;
}
return errno = ENOENT;
}
static int fix_range(float *const f)
{
if (f && f[0] > f[1]) {
const float tmp = f[0];
f[0] = f[1];
f[1] = tmp;
}
return f && isfinite(f[0]) && isfinite(f[1]) && (f[1] >= f[0]);
}
static const char *f2s(char *const buffer, const size_t size, const float value, const char *const invalid)
{
char format[32];
float parsed;
int decimals, length;
for (decimals = 0; decimals <= 16; decimals++) {
length = snprintf(format, sizeof format, "%%.%df", decimals);
if (length < 1 || length >= (int)sizeof format)
break;
length = snprintf(buffer, size, format, value);
if (length < 1 || length >= (int)size)
break;
if (sscanf(buffer, "%f", &parsed) == 1 && parsed == value)
return buffer;
decimals++;
}
for (decimals = 0; decimals <= 16; decimals++) {
length = snprintf(format, sizeof format, "%%.%dg", decimals);
if (length < 1 || length >= (int)sizeof format)
break;
length = snprintf(buffer, size, format, value);
if (length < 1 || length >= (int)size)
break;
if (sscanf(buffer, "%f", &parsed) == 1 && parsed == value)
return buffer;
decimals++;
}
length = snprintf(buffer, size, "%a", value);
if (length < 1 || length >= (int)size)
return invalid;
if (sscanf(buffer, "%f", &parsed) == 1 && parsed == value)
return buffer;
return invalid;
}
int main(int argc, char *argv[])
{
float xrange[2] = { 0.75f, 1.00f };
float c17range[2], c15range[2], c13range[2], c11range[2];
float c9range[2], c7range[2], c5range[2], c3range[2];
float c[8];
__v8sf *known_x;
__v8sf *known_f;
__v4df *known_a;
__v4df *known_m;
size_t known_n;
if (argc != 10 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
fprintf(stderr, " %s C17 C15 C13 C11 C9 C7 C5 C3 x\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "Each of the coefficients can be a constant or a range,\n");
fprintf(stderr, "for example 0.25 or 0.75:1. x must be a non-empty range.\n");
fprintf(stderr, "\n");
return EXIT_FAILURE;
}
if (parse_range(argv[1], c17range) || !fix_range(c17range)) {
fprintf(stderr, "%s: Invalid C17 range or constant.\n", argv[1]);
return EXIT_FAILURE;
}
if (parse_range(argv[2], c15range) || !fix_range(c15range)) {
fprintf(stderr, "%s: Invalid C15 range or constant.\n", argv[2]);
return EXIT_FAILURE;
}
if (parse_range(argv[3], c13range) || !fix_range(c13range)) {
fprintf(stderr, "%s: Invalid C13 range or constant.\n", argv[3]);
return EXIT_FAILURE;
}
if (parse_range(argv[4], c11range) || !fix_range(c11range)) {
fprintf(stderr, "%s: Invalid C11 range or constant.\n", argv[4]);
return EXIT_FAILURE;
}
if (parse_range(argv[5], c9range) || !fix_range(c9range)) {
fprintf(stderr, "%s: Invalid C9 range or constant.\n", argv[5]);
return EXIT_FAILURE;
}
if (parse_range(argv[6], c7range) || !fix_range(c7range)) {
fprintf(stderr, "%s: Invalid C7 range or constant.\n", argv[6]);
return EXIT_FAILURE;
}
if (parse_range(argv[7], c5range) || !fix_range(c5range)) {
fprintf(stderr, "%s: Invalid C5 range or constant.\n", argv[7]);
return EXIT_FAILURE;
}
if (parse_range(argv[8], c3range) || !fix_range(c3range)) {
fprintf(stderr, "%s: Invalid C3 range or constant.\n", argv[8]);
return EXIT_FAILURE;
}
if (parse_range(argv[9], xrange) || xrange[0] == xrange[1] ||
!isfinite(xrange[0]) || !isfinite(xrange[1])) {
fprintf(stderr, "%s: Invalid x range.\n", argv[9]);
return EXIT_FAILURE;
}
known_n = precalculate(xrange[0], xrange[1], &known_x, &known_f, &known_a, &known_m);
if (!known_n) {
if (errno == ENOMEM)
fprintf(stderr, "Not enough memory for precalculated tables.\n");
else
fprintf(stderr, "Invalid (empty) x range.\n");
return EXIT_FAILURE;
}
fprintf(stderr, "Precalculated %lu arctangents to compare to.\n", 8UL * (unsigned long)known_n);
fprintf(stderr, "\nC17 C15 C13 C11 C9 C7 C5 C3 max-ulps-under max-ulps-above correctly-rounded percentage cycles\n");
fflush(stderr);
{
const double percent = 12.5 / (double)known_n;
size_t rounded;
char c17buffer[64], c15buffer[64], c13buffer[64], c11buffer[64];
char c9buffer[64], c7buffer[64], c5buffer[64], c3buffer[64];
char minbuffer[64], maxbuffer[64];
float minulps, maxulps;
unsigned long tsc_start, tsc_stop;
for (c[0] = c17range[0]; c[0] <= c17range[1]; c[0] = nextafterf(c[0], HUGE_VALF))
for (c[1] = c15range[0]; c[1] <= c15range[1]; c[1] = nextafterf(c[1], HUGE_VALF))
for (c[2] = c13range[0]; c[2] <= c13range[1]; c[2] = nextafterf(c[2], HUGE_VALF))
for (c[3] = c11range[0]; c[3] <= c11range[1]; c[3] = nextafterf(c[3], HUGE_VALF))
for (c[4] = c9range[0]; c[4] <= c9range[1]; c[4] = nextafterf(c[4], HUGE_VALF))
for (c[5] = c7range[0]; c[5] <= c7range[1]; c[5] = nextafterf(c[5], HUGE_VALF))
for (c[6] = c5range[0]; c[6] <= c5range[1]; c[6] = nextafterf(c[6], HUGE_VALF))
for (c[7] = c3range[0]; c[7] <= c3range[1]; c[7] = nextafterf(c[7], HUGE_VALF)) {
tsc_start = __builtin_ia32_rdtsc();
rounded = error8(known_x, c, known_f, known_a, known_m, known_n, &minulps, &maxulps);
tsc_stop = __builtin_ia32_rdtsc();
printf("%-13s %-13s %-13s %-13s %-13s %-13s %-13s %-13s %-13s %-13s %lu %.3f %lu\n",
f2s(c17buffer, sizeof c17buffer, c[0], "?"),
f2s(c15buffer, sizeof c15buffer, c[1], "?"),
f2s(c13buffer, sizeof c13buffer, c[2], "?"),
f2s(c11buffer, sizeof c11buffer, c[3], "?"),
f2s(c9buffer, sizeof c9buffer, c[4], "?"),
f2s(c7buffer, sizeof c7buffer, c[5], "?"),
f2s(c5buffer, sizeof c5buffer, c[6], "?"),
f2s(c3buffer, sizeof c3buffer, c[7], "?"),
f2s(minbuffer, sizeof minbuffer, minulps, "?"),
f2s(maxbuffer, sizeof maxbuffer, maxulps, "?"),
rounded, (double)rounded * percent,
(unsigned long)(tsc_stop - tsc_start));
fflush(stdout);
}
}
return EXIT_SUCCESS;
}
The code does compile using GCC-4.8.2 on Linux, but might have to be modified for other compilers and/or OSes. (I'd be happy to include/accept edits fixing those, though. I just don't have Windows or ICC myself so I could check.)
To compile this, I recommend
gcc -Wall -O3 -fomit-frame-pointer -march=native -mtune=native example.c -lm -o example
Run without arguments to see usage; or
./example 0x1.7ed24ap-9f -0x1.0c2c12p-6f 0x1.61fdd2p-5f -0x1.3556b0p-4f 0x1.b4e138p-4f -0x1.230ae2p-3f 0x1.9978eep-3f -0x1.5554dap-2f 0.75:1
to check what it reports for njuffa's coefficient set, compared against standard C library atan()
function, with all possible x in [0.75, 1] considered.
Instead of a fixed coefficient, you can also use min:max
to define a range to scan (scanning all unique single-precision floating-point values). Each possible combination of the coefficients is tested.
Because I prefer decimal notation, but need to keep the values exact, I use the f2s()
function to display the floating-point values. It is a simple brute-force helper function, that uses the shortest formatting that yields the same value when parsed back to float.
For example,
./example 0x1.7ed248p-9f:0x1.7ed24cp-9f -0x1.0c2c10p-6f:-0x1.0c2c14p-6f 0x1.61fdd0p-5f:0x1.61fdd4p-5f -0x1.3556aep-4f:-0x1.3556b2p-4f 0x1.b4e136p-4f:0x1.b4e13ap-4f -0x1.230ae0p-3f:-0x1.230ae4p-3f 0x1.9978ecp-3f:0x1.9978f0p-3f -0x1.5554d8p-2f:-0x1.5554dcp-2f 0.75:1
computes all the 6561 (38) coefficient combinations ±1 ULP around njuffa's set for x in [0.75, 1]. (Indeed, it shows that decreasing C17 by 1 ULP to 0x1.7ed248p-9f yields the exact same results.)
(That run took 90 seconds on Core i5-4200U at 2.6 GHz -- pretty much in line in my estimate of 30 coefficient sets per second per GHz per core. While this code is not threaded, the key functions are thread-safe, so threading should not be too difficult. This Core i5-4200U is a laptop, and gets pretty hot even when stressing just one core, so I didn't bother.)
(I consider the above code to be in public domain, or CC0-licensed where public domain dedication is not possible. In fact, I'm not sure if it is creative enough to be copyrightable at all. Anyway, feel free to use it anywhere in any way you wish, as long as you don't blame me if it breaks.)
Questions? Enhancements? Edits to fix Linux/GCC'isms are welcome!