So far I'm finding this answer regarding Approximate cosine to be highly accurate, better than other solutions I've come up with using minimax algorithm on different kernel domains, etc... but I'm running into an issue: sin() using this code seems highly accurate, even on larger domains (e.g. -128pi to 128pi). cos() seems to require a better range reduction, as even on slightly higher input domain (-2pi to 2pi), I see 6 ulps error, and this increases quickly as domain increases.
So the question is, is there an accurate, but still fast, range reduction algorithm for moderate sized inputs (e.g. -10000 to 10000) that would still work for single-precision float? I can increase accuracy/domain by using double precision, but even for double inputs, such an algorithm would be useful. I'd prefer to stay away from precomputed tables if possible, as it makes SIMD solutions much harder to make performant.
[edit] To be clear, I don't expect an exact/1ulp solution for a huge range of floats (after all, radian values become silly in magnitudes where precision is super crappy), but I'd like something that would lead to <=1ulp range reduction in at least e.g. [-2Pi:Pi] or [-8Pi:8Pi].
Upon debugging, it seems that only a subset of numbers that range reduce to values pretty close to zero (order 1e-8) are causing issues, winding up about 4 ulps different depending on if I use double precision for range reduction or single precision. This winds up with values near zero (but with phase shift used differently in sin/cos, the function sensitivity near zero is different...).