vectorizing branched table lookup in SSE fast approximate cosine

Question

I'm making a small game engine for personal use. The target architecture is x86_64 preferably with SSE2.

The sine/cosine function is one of the core parts, and it's implemented as a precomputed table of 1024 cosine values for the input range [0, π / 2].

The scalar implementation is quite straightforward.

typedef unsigned uns;
typedef float flt;

enum {COS_TABLE_SIZE = 1 << 10};
extern flt COS_TABLE[COS_TABLE_SIZE];

flt f(uns i) {
    flt *t = COS_TABLE;
    uns z = COS_TABLE_SIZE;
    switch (i / z) {
    case 0:
        return +t[+(i - z * 0) + 0];
    case 1:
        return -t[-(i - z * 1) + z];
    case 2:
        return -t[+(i - z * 2) + 0];
    case 3:
        return +t[-(i - z * 3) + z];
    default:
        __builtin_unreachable();
    }
}

The code isn't tested yet, so there could be an error in math.

Modern compilers aren't sophisticated enough to generate good code for the naivest approach to vectorization.

typedef uns u32;
typedef u32 vec_u32 __attribute__((vector_size(16)));
typedef flt vec_flt __attribute__((vector_size(16)));

vec_flt fv(vec_u32 i) {
    vec_flt r;
    for (uns j = 0; j < 4; ++j) {
        r[j] = f(i[j]);
    }
    return r;
}

Both GCC and Clang produce horrible code for fv. So I decided to do the vectorization manually.

When you have a look below, the code above /*---*/ isn't related much to this question. The branching in the scalar version is converted to a branchless vectorized version. Please do comment if there is room for improvement in that part.

Anyway, this question is about the lines below /*---*/. The given problem is to create a vector from a vector of indices by doing table look-ups with those indices. In C, the upper part looks more complex, but in machine level the lower part is more expensive. Extracting each index value separately from a vector and then reconstructing a vector with the results doesn't seem to be a simple task.

What is an efficient way to deal with such problem? It is a personal project, and any kind of restructuring is always possible.

I prefer SSE2 for portability, but if there is a better solution available in later extensions, it would be good to know.

static uns uns_log2(uns x) {
    if (__builtin_constant_p(x)) {
        return 31 - __builtin_clz(x);
    }
    uns r = 0;
    __asm__ ("bsr\t%0, %1" : "+r"(r) : "r"(x));
    return r;
}

static u32 flt_reint_u32(flt x) {
    u32 r;
    memcpy(&r, &x, sizeof(x));
    return r;
}

static flt u32_reint_flt(u32 x) {
    flt r;
    memcpy(&r, &x, sizeof(x));
    return r;
}

static vec_u32 vec_u32_fill(u32 x) {
    return (vec_u32){x, x, x, x};
}

vec_flt fv2(vec_u32 i) {
    flt *t = COS_TABLE;
    uns z = COS_TABLE_SIZE;
    vec_u32 q = i >> uns_log2(z);
    i -= q << uns_log2(z);
    vec_u32 c = q == 1 | q == 3;
    i = i & ~c | z - i & c;
    vec_u32 s = vec_u32_fill(0x80000000);
    s &= ~(q == 0 | q == 3);
    
    /*---*/
    
    vec_flt r;
    for (uns j = 0; j < 4; ++j) {
        r[j] = u32_reint_flt(flt_reint_u32(t[i[j]]) ^ s[j]);
    }
    return r;
}

https://godbolt.org/z/aejc69q9Y

Have you looked at existing vectorized sin/cos library functions, like glibc's libmvec? (https://sourceware.org/glibc/wiki/libmvec) Looks like their [SSE4 version](https://code.woboq.org/userspace/glibc/sysdeps/x86_64/fpu/multiarch/svml_s_cosf4_core_sse4.S.html) uses a different strategy, though, with a polynomial approximation. And only falling back to calling scalar cosf for some inputs. Yours is I guess aiming to be faster and lower-precision? — Peter Cordes, Apr 04 '22 at 21:37
Without AVX2 for gather loads, you're definitely going to have to extract to scalar for table lookups. When you need all 4 elements, probably best to just store to tmp array and scalar reload it, instead of doing movd + 3x pextrd. Especially with SSE2! You might still do the first with movd, and the rest as reloads. The manual gather should be 4x movss + 3 shuffles, or possibly 2x `movss + movhps` pairs to save front-end bandwidth (but still costs a p5 uop in the back end micro-fused with it) and one `shufps`, if cache-line splits don't make 8-byte loads worse. (Page-align your table) — Peter Cordes, Apr 04 '22 at 21:40
XOR is slow, IF-ELSE is slow, get rid of these. Try to use `register` keyword for the the hot variables. Declare the hot variables as high as possible, best in function scope. Jump tables (`goto`) are faster than function calls. Try to write the hot stuff all within one singular function. — paladin, Apr 05 '22 at 02:30
As @PeterCordes said, doing a LUT without AVX2 is not possible in parallel. And you are likely much faster with a polynomial approximation. If you just need 10 bits of precision, 2--4 terms should be sufficient (you can also split the interval in two halves and calculate sine and cosine at the same time). — chtz, Apr 05 '22 at 08:16
@paladin The only thing I can partly agree with of your comment is that branching is (potentially) slow. But on what (modern) architecture is XOR slow, or which compiler needs the `register` keyword to decide what variables are hot? Also, compilers are pretty good at inlining, so no need to write big monolithic functions. — chtz, Apr 05 '22 at 08:20
@paladin: if/else can be slow if done on runtime variables, but `__builtin_constant_p()` is evaluated at compile time. Integer `^` is one of the cheapest integer operations, and x86 can PXOR on 128 bits at once. The `register` keyword has literally no effect on performance when optimization is enabled in modern compilers. Declaring variables earlier also has no effect. These helper functions will all inline; the only slow thing is the `switch` in the first version. Have you [looked at optimized asm output from a compiler](//stackoverflow.com/q/38552116/224132) in the past 2 decades? — Peter Cordes, Apr 05 '22 at 08:23
Related question: https://stackoverflow.com/questions/18662261/fastest-implementation-of-sine-cosine-and-square-root-in-c-doesnt-need-to-b — chtz, Apr 05 '22 at 08:37

xiver77 · Answer 1 · 2022-04-08T12:39:26.393

Thanks for all the comments. This is my current solution.

Table look-up with vectors cannot be done efficiently without AVX2. Fortunately, for sine and cosine, Taylor series expansion apparently converges quite fast. With only 4 terms, the maximum absolute error is 0.0000068 and the average absolute error is 0.0000042.

The coefficients of the polynomial with a certain length can be optimized to minimize the error. The precomputed coefficients are from this website (https://publik-void.github.io/sin-cos-approximations/, Cos, abs. error minimized, degree 6)

This is the whole assembly output for cosine. The input should always be in the range [-π / 2, π / 2].

vec_flt_cos:
    andps   xmm0, XMMWORD PTR .LC0[rip]
    movaps  xmm1, XMMWORD PTR .LC2[rip]
    movaps  xmm3, XMMWORD PTR .LC1[rip]
    movaps  xmm2, XMMWORD PTR .LC3[rip]
    subps   xmm1, xmm0
    cmpltps xmm3, xmm0
    pxor    xmm1, xmm0
    pand    xmm1, xmm3
    pxor    xmm1, xmm0
    movdqa  xmm0, XMMWORD PTR .LC7[rip]
    mulps   xmm1, xmm1
    pand    xmm0, xmm3
    mulps   xmm2, xmm1
    addps   xmm2, XMMWORD PTR .LC4[rip]
    mulps   xmm2, xmm1
    addps   xmm2, XMMWORD PTR .LC5[rip]
    mulps   xmm1, xmm2
    addps   xmm1, XMMWORD PTR .LC6[rip]
    pxor    xmm0, xmm1
    ret

Below is the code for testing the accuracy, if you're interested.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#include <math.h>

#define FLT_PI 3.14159265358979323846f
#define FLT_DPI 6.28318530717958647693f
#define FLT_HPI 1.57079632679489661923f

typedef unsigned uns;
typedef uint32_t u32;
typedef uint64_t u64;
typedef int32_t i32;
typedef float flt;
typedef double dbl;
typedef flt vec_flt __attribute__((vector_size(16)));
typedef u32 vec_u32 __attribute__((vector_size(16)));
typedef i32 vec_i32 __attribute__((vector_size(16)));

static vec_flt vec_flt_fill(flt x) {
    return (vec_flt){x, x, x, x};
}

static vec_flt vec_u32_reint_flt(vec_u32 x) {
    vec_flt r;
    memcpy(&r, &x, sizeof(x));
    return r;
}

static vec_u32 vec_flt_reint_u32(vec_flt x) {
    vec_u32 r;
    memcpy(&r, &x, sizeof(x));
    return r;
}

static vec_flt vec_flt_sq(vec_flt x) {
    return x * x;
}

static vec_flt vec_flt_abs(vec_flt x) {
    return vec_u32_reint_flt(vec_flt_reint_u32(x) & 0x7fffffff);
}

vec_flt vec_flt_cos(vec_flt x) {
    flt c[] = {
        0.999993295282167421664399661287022669f,
        -0.49991243971224581435251505760757806f,
        0.0414877480454292132253667471195955447f,
        -0.00127120948569655081466419067530634131f
    };
    x = vec_flt_abs(x);
    vec_u32 m = x > FLT_HPI;
    x = vec_flt_sq(
        vec_u32_reint_flt(
            m & vec_flt_reint_u32(FLT_PI - x) | ~m & vec_flt_reint_u32(x)
        )
    );
    return vec_u32_reint_flt(
        vec_flt_reint_u32(x * (x * (x * c[3] + c[2]) + c[1]) + c[0]) ^
        m & 0x80000000
    );
}

vec_flt vec_flt_sin(vec_flt x) {
    x -= FLT_HPI;
    return vec_flt_cos(
        x + vec_u32_reint_flt(
            x < -FLT_PI & vec_flt_reint_u32(vec_flt_fill(FLT_DPI))
        )
    );
}

enum {Z = 200000000};

static flt th(uns i) {
    return (flt)i * (FLT_DPI / (flt)(Z - 1)) - FLT_PI;
}

static void compute(vec_flt (*af)(vec_flt), vec_flt (*ef)(vec_flt), char *id) {
    static flt ap[Z], ex[Z];
    for (uns i = 0; i < Z; i += 4) {
        vec_flt x;
        for (uns j = 0; j < 4; ++j) {
            x[j] = th(i + j);
        }
        vec_flt r = af(x);
        memcpy(ap + i, &r, sizeof(r));
        r = ef(x);
        memcpy(ex + i, &r, sizeof(r));
    }
    dbl sum = 0.0;
    dbl max = 0.0;
    for (uns i = 0; i < Z; ++i) {
        dbl e = fabs((double)(ap[i] - ex[i]));
        sum += e;
        if (e > max) {
            max = e;
        }
    }
    printf("(%s) avg: %.12f max: %.12f\n", id, sum / (dbl)Z, max);
}

static vec_flt excos(vec_flt x) {
    return (vec_flt){
        (flt)cos((dbl)x[0]), (flt)cos((dbl)x[1]),
        (flt)cos((dbl)x[2]), (flt)cos((dbl)x[3])
    };
}

static vec_flt exsin(vec_flt x) {
    return (vec_flt){
        (flt)sin((dbl)x[0]), (flt)sin((dbl)x[1]),
        (flt)sin((dbl)x[2]), (flt)sin((dbl)x[3])
    };
}

int main() {
    compute(excos, vec_flt_cos, "vec_flt_cos");
    compute(exsin, vec_flt_sin, "vec_flt_sin");
    return 0;
}

Looks reasonable. Being able to skip range-reduction by requiring limited-range input helps a lot, I'd imagine. — Peter Cordes, Apr 08 '22 at 13:38
@PeterCordes Making an arbitrary input fit into the required range isn't *that* expensive (11 instructions, https://godbolt.org/z/EGbxdvs6f), but the main problem is that the error gets bigger as the initial input's absolute value gets bigger with the method I used. The error will remain if I subtract with a loop until it fits, but that's not efficient. — xiver77, Apr 08 '22 at 14:24
@PeterCordes BTW if you see the Godbolt link above, GCC uses `comiss` to branch while Clang chooses `cmpss` without a branch. The branching version won't make sense with the vectorized version, but in the given scalar code, which compiler is producing better code? — xiver77, Apr 08 '22 at 14:27
Yeah, range-reduction is a hard problem for accuracy. To really do it well, you'd need extended-precision or something, e.g. for the value of Pi. https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/ points out the problem for fsin as you approach sin(+Pi) ~= 0.0 — Peter Cordes, Apr 08 '22 at 14:30
branchless is great if the branch is unpredictable, otherwise branchy shortens the critical path and may run fewer total instructions. Clang's implementation is a good branchless strategy, making good use of `cmpss` / `andps` it looks like. Only briefly skimmed, though. — Peter Cordes, Apr 08 '22 at 14:33

vectorizing branched table lookup in SSE fast approximate cosine

1 Answers1