Fast implementation of trigonometric functions for c++

Question

Short version: I'd like to know whether there are implementations of the standard trigonometric functions that are faster than the ones included in math.h.

Long version: I got a program that's quite heavy on numerics (it's a physics simulation) and that needs to call trigonometric functions, mostly sin and cos, a lot. Currently I'm simply using the implementations included in math.h. Profiling shows that the calls to these functions cost more than I was expecting (hoping).

While there is most certainly plenty of room for optimization in other parts of the code, having faster sin and cos might give me some additional percent.. So, do you guys have any suggestions?
In another post the usage of self-made lookup tables is suggested. But maybe there are alternatives? Or ready-made and well tested lookup solutions in some libraries?

Most fast-transcendentals are geared towards game engines, which don't care that much about accuracy. How important is accuracy to your problem? — Marcelo Cantos, Apr 25 '11 at 09:46
Profile first. "might give some additional percent" is not worth trying to optimize. — pmr, Apr 25 '11 at 09:47
@pmr: As stated in my question, I AM profiling and from this my expectation would be "a couple of percent" in runtime - maybe 2% or 3%, but that's a very rough estimate certainly. But with runtimes on the order of days, any percent I can get, might indeed be worth it.. — janitor048, Apr 25 '11 at 09:58
Lookup tables are kind of 1985. Modern CPUs are much faster at crunching numbers than reading from memory. Unless your lookup table is very small, and you do a lot of sin/cos in a batch so you're guaranteed that the LUT is in level-1 cache it is not worth it. I've seen minimax polys in SSE that effectively run in 18-20 cycles (pipelining ftw). This is about twice as much as the best case for a LUT, and slightly faster than the average case, especially if you do something other than a synthetic benchmark (but, it does not take away cache lines from other code). — Damon, Apr 25 '11 at 09:59
@Marcelo: Yes, this would ultimately be the question. I would eventually have to test it, my gut feeling tells me that accuracy to say 4 or 5 digits would be enough in most places.. — janitor048, Apr 25 '11 at 09:59
Though, like the previous commenters already hinted, you should first consider if a dozen cycles is a problem. Unless you do several millions of trig function calls per frame, it should not matter on a CPU that isn't 15 years old (and if you do that many, you're likely doing something wrong). — Damon, Apr 25 '11 at 10:00
When your bottleneck is trig functions, a thing to consider is to use trigonometric formulas to reduce the number of calls. If for instance you are computing sin(n*x) and cos(n*x) for a bunch of consecutive integers n, it may be worth to compute cos x and sin x and use recurrences (cos(a+b) = cos a cos b - sin a sin b and sin(a+b) = sin a cos b + cos a sin b) — Alexandre C., Apr 25 '11 at 10:31
See http://stackoverflow.com/questions/523531/fast-transcendent-trigonometric-functions-for-java It's for Java but the formulas will work in C++. — lhf, Apr 25 '11 at 11:39
`math.h` doesn't include any implementation. The implementation is in the library that will be linked to your code. To answer your question you have to tell what target CPU and compiler you are using. — Curd, Apr 25 '11 at 14:16
I've implemented a fast sine function on cpu side which is at least **two times faster** than math.h ' s sine function however I used a very small lookup table(20 floats). it's accuracy is also not bad at all; **average relative error rate is 0.095%**. you can check it out from [http://www.hevi.info/tag/fast-sine-function/](http://www.hevi.info/tag/fast-sine-function/) — hevi, Dec 27 '11 at 23:32
Did you already check if your algorithm is parallelizable? If you can get it to run on a GPU (via openCL for instance), then instead of 2%-3% you might be looking at 90%-95% faster (https://developer.nvidia.com/opencl) — Toad, Jan 17 '13 at 16:42

score 19 · Accepted Answer · edited Aug 04 '19 at 03:59

19

Here are some good slides on how to do power series approximations (NOT Taylor series though) of trig functions: Faster Math Functions.

It's geared towards game programmers, which means accuracy gets sacrificed for performance, but you should be able to add another term or two to the approximations to get some of the accuracy back.

The nice thing about this is that you should also be able to extend it to SIMD easily, so that you could compute the sin or cos of 4 values at one (2 if you're using double precision).

Hope that helps...

edited Aug 04 '19 at 03:59

Theraot

31,890
5
57
86

answered Apr 25 '11 at 12:45

celion

3,864
25
19

The presentations provided in your link seem to be very interesting. I'll look into these approximations some more, maybe this might indeed be sufficient for some parts of my code – janitor048 Apr 25 '11 at 15:57
I mark this one as accepted answer since there are so many interesting suggestions in the presentation linked at above mentioned URL. But don't miss the other answers.. – janitor048 Apr 26 '11 at 20:55
2

The link is dead, here's the last seen version from the archive: http://web.archive.org/web/20160322120707/http://www.research.scea.com/gdc2003/fast-math-functions.html – Justas Jan 19 '17 at 19:41
1

Blog post from the author with the presentation and further comments: http://basesandframes.wordpress.com/2016/05/17/faster-math-functions – Řrřola Jun 15 '18 at 07:08
you can operate on 4 or 8 doubles at once with AVX/AVX-512 – phuclv Aug 04 '19 at 05:04

Jeremy Trifilo · Answer 2 · 2013-01-18T19:05:03.797

This should be pretty damn fast if you can optimize it further please do and post the code on like pastie.org or something.

Computer Specifications -> 512MB Ram , Visual Studio 2010 , Windows XP Professional SP3 Version 2002 , Intel (R) Pentium (R) 4 CPU 2.8GHZ.

This is insanely accurate and will actually provide slightly better results in some situations. E.g. 90, 180, 270 degrees in C++ returns a non 0 decimal.

FULL TABLE OF 0 through 359 Degrees: https://pastee.org/dhwbj

FORMAT -> DEGREE # -> MINE_X(#) , CosX(#) , MINE_Z(#) , SinZ(#).

Below is the code used to construct the above shown table. You can probably make it even more accurate if you use a larger data type. I utilized an unsigned short and did N/64000. So What ever the cos(##) and sin(##) where closest to I rounded to that index. I also tried to use as little extra data as possible so this wouldn't be some cluttered table with 720 float values for cos and sin. Which would probably give better results, but be a complete waste of memory. The table below is as small as I could make it. I'd like to see if it's possible to make an equation that could round to all these short values and use that instead. I'm not sure if it would be any faster, but it would eliminate the table completely and probably not reduce speed by anything or much.

So the accuracy in comparison to the C++ cos/sin operations is 99.99998% through 100%.

Below is the table used to calculate the cos/sin values.

static const unsigned __int16 DEGREE_LOOKUP_TABLE[91] =
{
    64000, 63990, 63961, 63912, 63844, 63756,
    63649, 63523, 63377, 63212, 63028, 62824,
    62601, 62360, 62099, 61819, 61521, 61204,
    60868, 60513, 60140, 59749, 59340, 58912,
    58467, 58004, 57523, 57024, 56509, 55976,
    55426, 54859, 54275, 53675, 53058, 52426,
    51777, 51113, 50433, 49737, 49027, 48301,
    47561, 46807, 46038, 45255, 44458, 43648,
    42824, 41988, 41138, 40277, 39402, 38516,
    37618, 36709, 35788, 34857, 33915, 32962,
    32000, 31028, 30046, 29055, 28056, 27048,
    26031, 25007, 23975, 22936, 21889, 20836,
    19777, 18712, 17641, 16564, 15483, 14397,
    13306, 12212, 11113, 10012,  8907,  7800,
     6690,  5578,  4464,  3350,  2234,  1117,
        0,
};

Below is the actual code that does the cos/sin calculations.

    int deg1 = (int)degrees;
    int deg2 = 90 - deg1;
    float module = degrees - deg1;
    double vX = DEGREE_LOOKUP_TABLE[deg1] * 0.000015625;
    double vZ = DEGREE_LOOKUP_TABLE[deg2] * 0.000015625;
    double mX = DEGREE_LOOKUP_TABLE[deg1 + 1] * 0.000015625;
    double mZ = DEGREE_LOOKUP_TABLE[deg2 - 1] * 0.000015625;
    float vectorX = vX + (mX - vX) * module;
    float vectorZ = vZ + (mZ - vZ) * module;
    if (quadrant & 1)
    {
        float tmp = vectorX;
        if (quadrant == 1)
        {
            vectorX = -vectorZ;
            vectorZ = tmp;
        } else {
            vectorX = vectorZ;
            vectorZ = -tmp;
        }
    } else if (quadrant == 2) {
        vectorX = -vectorX;
        vectorZ = -vectorZ;
    }

SPEEDS BELOW using the originally mention computer specifications. I was running it in debug mode before this is debug mode, but is ran through the executable which I believe is debug without debugging.

MY METHOD

1,000 Iterations -> 0.004641 MS or 4641 NanoSeconds.
100,000 Iterations -> 4.4328 MS.
100,000,000 Iterations -> 454.079 MS.
1,000,000,000 Iterations -> 4065.19 MS.

COS/SIN METHOD

1,000 Iterations -> 0.581016 MS or 581016 NanoSeconds.
100,000 Iterations -> 25.0049 MS.
100,000,000 Iterations -> 24,731.6 MS.
1,000,000,000 Iterations -> 246,096 MS.

So to summarize the above performing both cos(###) and sin(###) with my strategy allows roughly 220,000,000 executions per second. Utilizing the computer specifications shown originally. This is fairly quick and utilizes very little memory so it's a great substitute to math cos/sin functions normally found in C++. If you'd like to see the accuracy open the link shown above and there is a print out of degrees 0 trough 359. Also this supports 0 through 89 and quadrants 0 through 3. So you'd need to either use that or perform (DEGREES % 90).

The reason why `sin(90)` is not 0 in C++ is easy: C++ uses radians, not degrees. — MSalters, Jan 18 '13 at 09:20
Makes sense I never really thought about it since the value was so minuscule it was basically 0. Although I guess with divide by 180 and multiplying by PI. There's probably very little guarantee you'd ever get the radian value of 90, 180, and 270. — Jeremy Trifilo, Jan 18 '13 at 16:52
The link to the table of results is non functional. It would be good to know what is the maximum error expressed in ULP units. It may be difficult to calculate it exactly. At least experimental results (but with finer dividing of range 0 - 360) would be helpful. — truthseeker, Nov 15 '14 at 16:58
One issue which is not sufficiently highlighted is that your method is not doing argument reduction so the performance comparison with standard library is not fair. — truthseeker, Nov 15 '14 at 17:00
An obvious optimization would be to incorporate the `* 0.000015625` into the table. That would eliminate four floating point multiplications per call. — Ray Butterworth, Mar 30 '22 at 03:06

Lior Kogan · Answer 3 · 2011-04-25T15:09:42.820

If you want to use a custom implementation, look here, here and here

Also here (scroll to Universal SIMD-Mathlibrary) if you need to calculate sin/cos for large arrays

You can also try to use the C++ SSE intrinsics. Look here

Note that most modern compilers support SSE and SSE2 optimizations. For Visual Studio 2010, for example, you'll need to manually enable it. Once you do this, a different implementation will be used for most standard math functions.

One more option is to use DirectX HLSL. Look here. Note that there is a nice sincos functions which return both sin and cos.

Usually, I use IPP (which is not free). For details, look here

Interesting links. Thanks! Unfortunately IPP is not available to me, but I will read some more on the other solutions. — janitor048, Apr 25 '11 at 16:14

score 3 · Answer 4 · answered Apr 25 '11 at 13:57

Quake 3's source has some code for precomputed sine/cos aimed at speed over precision, its not sse based that thus quite portable(both on architecture and intrinsic api). You might also find this summary of sse and sse2 based functions very interesting: http://gruntthepeon.free.fr/ssemath/

score 3 · Answer 5 · answered Feb 11 '12 at 16:48

I've implemented a fast sine function on cpu side which is at least two times faster than math.h ' s sine function however I used a very small lookup table(20 floats). it's accuracy is also not bad at all; average relative error rate is 0.095%. you can check it out from http://www.hevi.info/tag/fast-sine-function/

Explanation of the method is quite simple and relies on the fact that for small a's sin(a) = a * pi / 180 (see the link above for the proof)

enter image description here

Some Trigonometry

Although it is possible to achieve relatively accurate results with the formula shown above for angles between 0 and 10, as the angle gets wider as it loses accuricy. Therefore we should use the formula for angles less than 10 but how?!

The answer comes from the trigonometric sine addition formula;

sin(a+b) = sin(a) cos(b) + sin(b) cos(a)

If we can keep the ‘b’ less than 10 then we will be able to use our formula in order to find the sine with a couple of aritchmetic operations.

Let’s say we are asked the sine value for 71.654, then;

a = 70

b = 1.654

and,

sin(71.654) = sin(70 + 1.654) = sin(70) cos(1.654) + sin(1.654) cos (70)

In this formula we are able to use the fast calculation for the sin(1.654) part and for the rest unfortunately we need to have sine and cosine tables. The good thing is we only need the multiply of tens for sine and natural number angles between 0 and 10 for cosine.

score 2 · Answer 6 · edited May 23 '17 at 12:33

2

A) Trying to save small percents will not be very satisfying. Finishing in 97 instead of 100 hours is still a long time.

B) You say you profiled, and that the trig functions take more time than you would like. How much? and what about all the remaining time? It's quite possible you have bigger fish to fry. Most profilers based on the gprof concepts do not tell you about mid-stack calls that you could focus on to save larger amounts of time. Here's an example.

edited May 23 '17 at 12:33

Community

1
1

answered Apr 25 '11 at 14:02

Mike Dunlavey

40,059
14
91
135

Definitely, there are bigger fish swimming around in my code. And I'm working on some changes in the structure and algorithms that will hopefully lead to more significant improvement. But while I was out fishing for the big ones, I put some minor issues on my list that might be worth looking into. This is one of them.. BTW I'm using callgrind (valgrind) and AMD CodeAnalyst – janitor048 Apr 25 '11 at 15:55
@janitor048: Good. The problem with these tools is too often they focus your attention on small/irrelevant stuff. Whenever I go after performance problems I rely on [this method](http://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux/378024#378024). It's not a tool. It's a technique, and it's as effective as any. – Mike Dunlavey Apr 25 '11 at 18:54
Yeah, I've read that post of yours.. :-) Very interesting argumentation and quite an intuitive method. I thought that the "time based profiling" scheme of AMD's CodeAnalyst (which I use) basically is an automated version of your approach. But I've certainly merely scratched the surface of this (very complex) field.. – janitor048 Apr 25 '11 at 19:12

score 1 · Answer 7 · answered Jan 10 '14 at 09:17

1

You can look at this. It talks about optimizing sin, cos.

answered Jan 10 '14 at 09:17

mAc

199
1
3
9

score 1 · Answer 8 · answered Apr 25 '11 at 09:49

1

Long time ago on slow machines people used an arrays with precomputed values. another option to calculate with your own precision like this: (look for "Series definitions")

answered Apr 25 '11 at 09:49

Yuriy Vikulov

2,469
5
25
32

score 0 · Answer 9 · edited May 23 '17 at 11:45

For 2-3% gain, this is almost certainly not worth the risk of inaccuracy, error, assumptions no longer being true (e.g. never falling outside of [-1,-1]), etc., unless you are planning on running this on a huge number of machines (where 2-3% represents thousands or millions of dollars in electricity and amortized cost of the machine).

That said, if you have domain-specific knowledge about what you are trying to accomplish, you may be able to speed up your computations by a factor of two or more. For example, if you always need sin and cos of the same value, calculate them close to each other in the code and make sure that your compiler translates them into a FSINCOS assembly instruction (see this question). If you need only a small portion of the full range of the function, you can potentially use a set of low-order polynomials followed by an iteration of Newton's method to get full machine precision (or as much as you need). Again, this is much more powerful if you know that you only need some values--e.g. if you can use that sin(x) is close to x near zero, and you will only be needing values near zero, then you can dramatically decrease the number of terms you need.

But, again, my primary advice is: 2-3% is not worth it. Think harder about the algorithms used and other potential bottlenecks (e.g. is malloc eating too much time?) before you optimize this.

No, it won't be millions of dollars :-) But the code runs on some university computing clusters. And the faster it is, the better slots it gets.. And of course you're right. I won't focus on this issue, there are more severe bottlenecks - this sin/cos business was rather a minor issue I put on my "might be worth looking into" list and I've wanted to get some ideas whether there is potential for improvement. And there are some interesting suggestions made here.. — janitor048, Apr 25 '11 at 16:02

Fast implementation of trigonometric functions for c++

9 Answers9

Linked