2

I am writing a raycaster, and I am trying to speed it up by making lookup tables for my most commonly called trig functions, namely sin, cos, and tan. This first snippet is my table lookup code. In order to avoid making a lookup table for each, I am just making one sin table, and defining cos(x) as sin(half_pi - x) and tan(x) as sin(x) / cos(x).

#include <math.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>

const float two_pi = M_PI * 2, half_pi = M_PI / 2;

typedef struct {
    int fn_type, num_vals;
    double* vals, step;
} TrigTable;

static TrigTable sin_table;

TrigTable init_trig_table(const int fn_type, const int num_vals) {
    double (*trig_fn) (double), period;
    switch (fn_type) {
        case 0: trig_fn = sin, period = two_pi; break;
        case 1: trig_fn = cos, period = two_pi; break;
        case 2: trig_fn = tan, period = M_PI; break;
    }

    TrigTable table = {fn_type, num_vals,
        calloc(num_vals, sizeof(double)), period / num_vals};

    for (double x = 0; x < period; x += table.step)
        table.vals[(int) round(x / table.step)] = trig_fn(x);

    return table;
}

double _lookup(const TrigTable table, const double x) {
    return table.vals[(int) round(x / table.step)];
}

double lookup_sin(double x) {
    const double orig_x = x;
    if (x < 0) x = -x;
    if (x > two_pi) x = fmod(x, two_pi);

    const double result = _lookup(sin_table, x);
    return orig_x < 0 ? -result : result;
}

double lookup_cos(double x) {
    return lookup_sin(half_pi - x);
}

double lookup_tan(double x) {
    return lookup_sin(x) / lookup_cos(x);
}

Here is how I went about benchmarking my code: my function for the current time in milliseconds is from here. The problem arises here: when timing my lookup_sin vs math.h's sin, my variant takes around three times longer: Table time vs default: 328 ms, 108 ms.

Here is the timing for cos: Table time vs default: 332 ms, 109 ms

Here is the timing for tan: Table time vs default: 715 ms, 153 ms

What makes my code so much slower? I would think that precomputing sin values would greatly accelerate my code. Perhaps it's the fmod in the lookup_sin function? Please provide whatever insight that you have. I am compiling with clang with no optimizations enabled, so that the calls to each trig function are not removed (I am ignoring the return value).

const int64_t millis() {
    struct timespec now;
    timespec_get(&now, TIME_UTC);
    return ((int64_t) now.tv_sec) * 1000 + ((int64_t) now.tv_nsec) / 1000000;
}

const int64_t benchmark(double (*trig_fn) (double)) {
    const int64_t before = millis();

    for (double i = 0; i < 10000; i += 0.001)
        trig_fn(i);

    return millis() - before;
}

int main() {
    sin_table = init_trig_table(0, 15000);

    const int64_t table_time = benchmark(lookup_sin), default_time = benchmark(sin);
    printf("Table time vs default: %lld ms, %lld ms\n", table_time, default_time);

    free(sin_table.vals);
}
Caspian Ahlberg
  • 934
  • 10
  • 19
  • There's a saying about premature optimization... – Shawn Apr 14 '21 at 16:12
  • @Shawn Root of all evil, I know. It's just that my raycaster is taking close to 100% of my CPU time, and I want to cut that down. – Caspian Ahlberg Apr 14 '21 at 16:38
  • 2
    @CaspianAhlberg: Look for it to cast more rays per second, not to take less than 100% cpu time. – Mike Dunlavey Apr 14 '21 at 16:41
  • Given the low precision, consider `sinf, cosf` instead of `sin,cos` for initialization. – chux - Reinstate Monica Apr 14 '21 at 16:45
  • 4
    The lookup contain contains two divisions, in `fmod` and division by the step size. Division is notoriously slow. It contains several tests that may result in suboptimal test-and-branch instructions. (High performance code often reduces control flow branching in favor of data manipulation.) I do not see the step size stated in the question. If the table is large, then it will not remain in cache and memory lookup will be slow. – Eric Postpischil Apr 14 '21 at 16:59
  • 4
    Re “I am compiling with clang with no optimizations enabled”: Hopefully you mean the timing program is compiled without optimization and not the lookup routine. – Eric Postpischil Apr 14 '21 at 17:00
  • The library routines probably reduce the argument by multiplying by a prepared inverse of 2π (or some fraction thereof) and taking the fractional part. This avoids the division. Search for “Payne Hanek” for further information. Likely a fused multiply-add and extended-precision techniques are used. And this reduces modulo the period and provides quadrant (or smaller sector) information in a single operation (so you do not need both `fmod` and division by the step size). Then the sine is likely computed with a highly engineered polynomial, faster than memory lookup. – Eric Postpischil Apr 14 '21 at 17:05
  • Note that if the phase step is fixed, you can get a fast solution by directly performing complex calculations by multiplying with `dz = cos(step) + I * sin(step)`. I implemented it decades ago to simulate a Doppler channel. – Damien Apr 14 '21 at 17:07
  • (There’s actually a related function in the standard C library, `remquo`, that gives you both a remainder modulo `y` and the low bits of the quotient. The low bits provide a sector of the period, while the remainder is used to calculate the function within that sector, using coefficients tailored for that sector and selected by the integer bits. But it is sort of only for “amateur” use, as it performs a division rather than the multiplication described above and does not support extended precision, so it cannot support high-performance or high-precision routines.) – Eric Postpischil Apr 14 '21 at 17:10
  • Generally, you can replace `if (x > two_pi) x = fmod(x, two_pi);` by `if (x > two_pi) x -= two_pi;` And don't forget to optimise the compilation. – Damien Apr 14 '21 at 17:13
  • What operating system are you using? What architecture? – Eric Postpischil Apr 14 '21 at 17:14
  • @EricPostpischil Big Sur, 2017 Macbook Pro – Caspian Ahlberg Apr 14 '21 at 17:14
  • 3
    @CaspianAhlberg: Well, then the reason the system `sin` and `cos` are faster than your lookup code is because I wrote them. – Eric Postpischil Apr 14 '21 at 17:15
  • @EricPostpischil No way! That's awesome. – Caspian Ahlberg Apr 14 '21 at 17:21
  • @EricPostpischil Gotta love the [because I wrote them](https://stackoverflow.com/questions/67094804/c-trigonometry-lookup-table-slower-than-math-hs-functions/67096198?noredirect=1#comment118598316_67094804). – chux - Reinstate Monica Apr 14 '21 at 17:36
  • 1
    @CaspianAhlberg Generally speaking, the use of lookup tables for math function implementation in software is rarely the best choice these days. My own professional experience in this regard largely matches the points made in this paper: Marat Dukhan and Richard Vuduc, "Methods for high-throughput computation of elementary functions". In *Parallel Processing and Applied Mathematics*, pp. 86-95. Springer, 2014. – njuffa Apr 14 '21 at 20:08
  • Time the routines with all arguments in [0, π/2], with arguments around 4, and with arguments around 1e100. As I recall, they are dispatched to at least three cases depending on how much argument reduction needs to be done, so the cases with larger arguments should be slower. I am not sure of the bounds between the cases, but those are likely targets. – Eric Postpischil Apr 14 '21 at 23:21

1 Answers1

2

Reduce the floating point math.

OP's code is doing excessive FP math in what should be a scale and lookup.

Scale the radians per a pre-computed factor into an index.

The number of entries in the lookup table should be an unsigned power-of-2 so the mod is a simple &.

At first, let us simplify and have [0 ... 2*pi) map to indexes [0 ... number_of_entries) to demo the idea.

double lookup_sin_alt(double x) {
  long scaled_x = lround(x * scale_factor);  // This should be the _only_ line of FP code
  // All following code is integer code.
  scaled_x += number_of_entries/4 ; // If we are doing cosine
  unsigned index = scaled_x & (number_of_entries - 1);  // This & replaces fmod
  double result = table.vals[index];
  return result;
}

Later we can use a quarter size table [0 ... pi/2] and steer selection/reconstruction with integer operations.

Given OP's low precision requirements, consider using float instead of double throughout including float functions like lroundf().

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256