Fast but accurate trigonometric modulo range reduction for moderate magnitude inputs

Question

So far I'm finding this answer regarding Approximate cosine to be highly accurate, better than other solutions I've come up with using minimax algorithm on different kernel domains, etc... but I'm running into an issue: sin() using this code seems highly accurate, even on larger domains (e.g. -128pi to 128pi). cos() seems to require a better range reduction, as even on slightly higher input domain (-2pi to 2pi), I see 6 ulps error, and this increases quickly as domain increases.

So the question is, is there an accurate, but still fast, range reduction algorithm for moderate sized inputs (e.g. -10000 to 10000) that would still work for single-precision float? I can increase accuracy/domain by using double precision, but even for double inputs, such an algorithm would be useful. I'd prefer to stay away from precomputed tables if possible, as it makes SIMD solutions much harder to make performant.

[edit] To be clear, I don't expect an exact/1ulp solution for a huge range of floats (after all, radian values become silly in magnitudes where precision is super crappy), but I'd like something that would lead to <=1ulp range reduction in at least e.g. [-2Pi:Pi] or [-8Pi:8Pi].

Upon debugging, it seems that only a subset of numbers that range reduce to values pretty close to zero (order 1e-8) are causing issues, winding up about 4 ulps different depending on if I use double precision for range reduction or single precision. This winds up with values near zero (but with phase shift used differently in sin/cos, the function sensitivity near zero is different...).

Can you update your title so that it reflects the actual problem? Because right now it could be about literally anything that takes input. Also note that cosine is modular, so "larger inputs" are irrelevant: the only numbers that matter are those ranging from -pi to pi, and anything smaller or larger simply reduces to that. So if you're not first modulo-reducing your inputs, there's your accuracy boost. — Mike 'Pomax' Kamermans, Aug 18 '23 at 16:18
Modulo reduction is the range reduction I'm talking about. I'll modify the question — graphicsMan, Aug 18 '23 at 16:25

score 1 · Answer 1 · answered Aug 18 '23 at 19:55

Okay, I found a reasonable answer to this. The original modulo range reduction method uses two rounds of fma with a single-precision estimate of pi/2 and then a pi/2 residual in single precision, which is basically float(double(pi_2) - float(pi_2)).

We can extend this (experimentally... it makes a kind of sense, but I don't have formal proof) by applying 3rd round of fma. The one trick is that we need to truncate the bottom bit of each of the first two estimates in succession so the fmas have a bit of extra working room before rounding.

With this simple technique, I am now able to get 1 ulp error all the way up to [-128K*Pi:128K*Pi] input domain, and only 2 ulp error around 32K*Pi. For fun, I can extend this to a fourth round and get only 2 ulp error at 1M*Pi.

Fast but accurate trigonometric modulo range reduction for moderate magnitude inputs

1 Answers1