About optimized math functions, ranges and intervals

Question

I'm trying to wrap my head around how people that code math functions for games and rendering engines can use an optimized math function in an efficient way; let me explain that further.

There is an high need for fast trigonometric functions in those fields, at times you can optimize a sin, a cos or other functions by rewriting them in a different form that is valid only for a given interval, often times this means that your approximation of f(x) is just for the first quadrant, meaning 0 <= x <= pi/2 .

Now the input for your f(x) is still about all 4 quadrants, but the real formula only covers 1/4 of that interval, the straightforward solution is to detect the quadrant by analyzing the input and see in which range it belongs to, then you adjust the result of the formula accordingly if the input comes from a quadrant that is not the first quadrant .

This is good in theory but this also presents a couple of really bad problems, especially considering the fact that you are doing all this to steal a couple of cycles from your CPU ( you also get a consistent implementation, that is not platform dependent like an hardcoded fsin in Intel x86 asm that only works on x86 and has a certain error range, all of this may differ on other platforms with other asm instructions ), so you should keep things working at a concurrent and high performance level .

The reason I can't wrap my head around the "switch case" with quadrants solution is:

it just prevents possible optimizations, namely memoization, considering that you usually want to put that switch-case inside the same functions that actually computes the f(x), probably the situation can be improved by implementing the formula for f(x) outside said function, but this is will lead to a doubling in the number of functions to maintain for any given math library
increase probability of more branching with a concurrent execution
generally speaking doesn't lead to better, clean, dry code, and conditional statements are often times a potential source of bugs, I don't really like switch-cases and similar things .

Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?

to me, this seems like overkilled micro optimization. First measure, then optimize — Creris, Oct 18 '14 at 21:22
@Creris to me it's not even an optimization, it's just that in my opinion that way of writing code is messy and error-prone, doesn't go to the right direction . — user2485710, Oct 18 '14 at 21:24
I just tested it and I can call Visual studio's implementation of `std::sin` using `clock`, which adds extra overhead in Release mode 20 milion times per second. Is that not enough? Again, useless microoptimization that is not backed with actual benchmarking — Creris, Oct 18 '14 at 21:32
@Creris that's not what I'm talking about: 1) I need a custom implementation of all the math functions that I need so I can port them to new and different archs while enjoying the same behaviour, `std::sin` it's just a placeholder for an unspecified implementation that can be anything from `fsin` in Intel x86 asm, to SSE or vector instructions, and I'm not even considering possible variants on other platforms 2) So for any platform I get the same `y` for the same values of `x` 3) My real problem is more about idioms and solutions that gracefully derail into a clean and dry code . — user2485710, Oct 18 '14 at 21:37
That's not at all obvious from the question you wrote. What exactly are you asking? — Alan Stokes, Oct 18 '14 at 21:43
Does anyone actually still use the `fsin` instruction? They shouldn''t, really. It is slow... and not even precise when the argument is too big. — Marc Glisse, Oct 18 '14 at 21:45
@MarcGlisse if you use `std::sin` in your code and `-mfpmath=387` when compiling with `gcc` you will get `fsin`, it's more popular then what you probably think, and also pops up with other kind of optimization flags . I think that it will be there with `-mfpmath=sse` too, but I'm not too sure about this, I'll probably check on that. — user2485710, Oct 18 '14 at 22:01
@AlanStokes my `foo()` has to manage a range `X` of values, I have a formula that only computes a specific subset of that `X`, how to map the result of computing a subset of `X` to _full_ `X` considering the appropriate change in sign and values ? How you will code a `sin` function if I'll give to you a formula that is only valid for the first quadrant ? — user2485710, Oct 18 '14 at 22:03
Indeed gcc will generate the `fsin` instruction with `-m32 -O -ffast-math`, which is ironic since IIRC calling `sin` is faster even on non-sse machines... (and glibc's sin does not use fsin) — Marc Glisse, Oct 18 '14 at 22:24
Why on earth are you calculating `sin` for values outside +/-2^63? It is quite pointless, and most likely, your result will be inaccurate since you have very little precision in the input number (in other words, since sin/cos repeats at 2pi intervals, if your number is outside of +/-2^63, it will not have the precision to distinguish where in 2pi the input is), so it's pretty pointless to do this in the first place. It would be great if you actually showed some real code that demonstrates what you are doing, and why it's a problem. — Mats Petersson, Oct 18 '14 at 23:08
Listen to @uesp. I would only add, you seem to be asking on the basis of generalities, assuming that trig functions will use a large enough percent of time to worry about. They might, and just as easily, they might not. Thinking they will is like driving off randomly in the night, expecting to go someplace worthwhile. [*DIAGNOSE FIRST*](http://stackoverflow.com/a/927773/23771), then deal with fixes. If and only if the trig functions take more than 10% of time, figure out how to call them less, such as by memoizing. — Mike Dunlavey, Oct 19 '14 at 04:37
re-read what you said, you even flag this question with optimization, and in first 3 or so lines talk about optimization, and then you jump on precision. These are 2 different things so you cant blame me for saying you that this is microoptimization, when in fact your question changes half way through — Creris, Oct 19 '14 at 12:50

score 2 · Accepted Answer · answered Oct 19 '14 at 01:50

Note: In the below answer I am speaking very generally about code.

Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?

The general answer to this is: In the most obvious and simplest way possible that achieves your purpose.

I'm not entirely sure I follow most of your arguments/questions but I have a feeling you are looking for problems where really none exist. Do you truly have the need to re-implement the trigonometric functions? Don't fall into the trap of NIH (Not Invented Here).

the straightforward solution is to detect the quadrant

Yes! I love straightforward code. Code that is perfectly obvious at a glance what it does. Now, sometimes, just sometimes, you have to do some crazy things to get it to do what you want: for performance, or avoiding bugs out of your control. But the first version should be most obvious and simple code that solves your problem. From there you do testing, profiling, and benchmarking and if (only if) you find performance or other issues, then you go into the crazy stuff.

This is good in theory but this also presents a couple of really bad problems,

I would say that this is good in theory and in practice for most cases and I definitely don't see any "bad" problems. Minor issues in specific corner cases or design requirements at most.

A few things on a few of your specific comments:

approximation of f(x) is just for the first quadrant: Yes, and there are multiple reasons for this. One simply is that most trigonometric functions have identities so you can easily use these to reduce range of input parameters. This is important as many numerical techniques only work over a specific range of inputs, or are more accurate/performant for small inputs. Next, for very large inputs you'll have to trim the range anyways for most numerical techniques to work or at least work in a decent amount of time and have sufficient accuracy. For example, look at the Taylor expansion for cos() and see how long it takes to converge sufficiently for large vs small inputs.
it just prevents possible optimizations: Chances are your c++ compiler these days is way better at optimizations than you are. Sometimes it isn't but the general procedure is to let the compiler do its optimization and only do manual optimizations where you have measured and proven that you need it. Theses days it is very non-intuitive to tell what code is faster by just looking at it (you can read all the questions on SO about performance issues and how crazy some of the root causes are).
namely memoization: I've never seen memoization in place for a double function. Just think how many doubles are there between 0 and 1. Now in reduced accuracy situations you can take advantage of it but this is easily implemented as a custom function tailored for that exact situation. Thinking about it, I'm not exactly sure how to implement memoization for a double function that actually means anything and doesn't loose accuracy or performance in the process.
increase probability of more branching with a concurrent execution: I'm not sure I'd implement trigonometric functions in a concurrent manner but I suppose its entirely possible to get some performance benefits. But again, the compiler is generally better at optimizations than you so let it do its jobs and then benchmark/profile to see if you really need to do better.
doesn't lead to better, clean, dry code: I'm not sure what exactly you mean here, or what "dry code" is for that matter. Yes, sometimes you can get into trouble by too many or too complex if/switch blocks but I can't see a simple check for 4 quadrants apply here...it's a pretty basic and simple case.
So for any platform I get the same y for the same values of x: My guess is that getting "exact" values for all 53 bits of double across multiple platforms and systems is not going to be possible. What is the result if you only have 52 bits correct? This would be a great area to do some tests in and see what you get.

I've used trigonometric functions in C for over 20 years and 99% of the time I just use whatever built-in function is supplied. In the rare case I need more performance (or accuracy) as proven by testing or benchmarking, only then do I actually roll my own custom implementation for that specific case. I don't rewrite the entire gamut of <math.h> functions in the hope that one day I might need them.

I would suggest try coding a few of these functions in as many ways as you can find and do some accuracy and benchmark tests. This will give you some practical knowledge and give you some hard data on whether you actually need to reimplement these functions or not. At the very least this should give you some practical experience with implementing these types of functions and chances are answer a lot of your questions in the process.

the main reason I'm doing this is because this way I get a consistent behaviour, in many compilers 1 flag is enough to change the behaviour and the implementation of a math function, it's not just an hardware or platform problem, there is too much entropy, and it's not just about performances, there is no algorithm specified in the standard for each function, so you can't rely on a consistent behaviour. I was probably wrong when posting this here because this dynamics on this site do not really encourage a debate, but I was just trying to follow some logic rather then benchmarking all the — user2485710, Oct 19 '14 at 05:51
possible implementation of `std::sin` that my library/compiler provides, to then switch to another library/compiler, repeat the same process, with all possible combinations of flags, to get some results that are just about 1 specific function. My reasoning looks logical to me, but all the users at this point are basically saying the same thing, and I should rely on standard library/compiler generated stuff . — user2485710, Oct 19 '14 at 05:52
For 99% of cases the standard implementations are fine. If you are *really* sure you need it then re-implementing for your specific needs are fine, just make sure you understand the consequences of doing so. From the information you've supplied I don't quite understand the necessary of doing so. — uesp, Oct 19 '14 at 13:21
you are coding a 3D rendering engine where for some reasons you are using a trigonometric function or any other math function `f(x)`, on x86 `f(x)` gives a certain error and a certain level of precision, on ARM `f(x)` is computed with a different algorithm and therefore leads to different results . Turns out that the result of `f(x)`, according to the formula `F`, which is in a fraction form, that you use to compute ... a material ? some lighting ? That result, it's part of the denominator, so even a small deviance counts, to get a consistent result for `F`, — user2485710, Oct 19 '14 at 13:33
otherwise you end up in producing different renders on X86 and ARM especially in some cases when your formula provides some inversely proportional values where the error or the lack of precision propagates really fast. — user2485710, Oct 19 '14 at 13:34
This is the nature of the various numerical methods of computing these functions and the choice taken between accuracy and performance. Have you actually measured the required accuracy for your situation? In all the 3D stuff I've done you don't really need much accuracy in the trigonometric functions...6-7 digits computed by a lookup table and linear interpolation is usually more than enough. You usually end up with more issues due to the inherent inaccuracy in the `float` data type. — uesp, Oct 19 '14 at 13:48
I would encourage you to take an evidence based approach. Create a simple prototype that demonstrates the issue(s) and measure it quantitatively. It could well be that your use case is so far outside the normal that a custom solution like you propose is indeed necessary. Having some hard facts to show this would be good though. — uesp, Oct 19 '14 at 13:52
Also, ask yourself if you really want to reimplement high-precision, general trig functions. Look at this implementation of `sin()` for example: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/ieee754/dbl-64/s_sin.c;hb=HEAD#l281 — uesp, Oct 19 '14 at 13:59
I don't care that much about precision as much as I care about a consistent behaviour across different platforms, my example was about having a consistent behaviour, non something that is necessarily highly precise; I can also afford a significant loss in precision given the fact that I'll deal with 3D/2D operations for imaging and realtime graphics and not scientific computations . — user2485710, Oct 19 '14 at 14:10
In my experience with realtime 3D rendering, you should rarely need to call sin or cos in performance critical code. Usually writing your code efficiently they fall out of other vector operations (vector dot and cross product) or are pre-computed and baked into your transformation matrices or shader constants. Visual results should not generally be very sensitive to the precision of trig functions either. Are you sure they're really the root cause of the platform differences you're seeing? — mattnewport, Oct 20 '14 at 20:52

About optimized math functions, ranges and intervals

1 Answers1