Calculation sine and cosine in one shot

Question

I have a scientific code that uses both sine and cosine of the same argument (I basically need the complex exponential of that argument). I was wondering if it were possible to do this faster than calling sine and cosine functions separately.

Also I only need about 0.1% precision. So is there any way I can find the default trig functions and truncate the power series for speed?

One other thing I have in mind is, is there any way to perform the remainder operation such that the result is always positive? In my own algorithm I used x=fmod(x,2*pi); but then I would need to add 2pi if x is negative (smaller domain means I can use a shorter power series)

EDIT: LUT turned out to be the best approach for this, however I am glad I learned about other approximation techniques. I will also advise using an explicit midpoint approximation. This is what I ended up doing:

const int N = 10000;//about 3e-4 error for 1000//3e-5 for 10 000//3e-6 for 100 000
double *cs = new double[N];
double *sn = new double[N];
for(int i  =0;i<N;i++){
    double A= (i+0.5)*2*pi/N;
    cs[i]=cos(A);
    sn[i]=sin(A);
}

The following part approximates (midpoint) sincos(2*pi*(wc2+t[j]*(cotp*t[j]-wc)))

double A=(wc2+t[j]*(cotp*t[j]-wc));
int B =(int)N*(A-floor(A));
re += cs[B]*f[j];
im += sn[B]*f[j];

Another approach could have been using the chebyshev decomposition. You can use the orthogonality property to find the coefficients. Optimized for exponential, it looks like this:

double fastsin(double x){
    x=x-floor(x/2/pi)*2*pi-pi;//this line can be improved, both inside this 
                              //function and before you input it into the function

    double x2 = x*x;
    return (((0.00015025063885163012*x2- 
   0.008034350857376128)*x2+ 0.1659789684145034)*x2-0.9995812174943602)*x;} //7th order chebyshev approx

I would need to call sqrt, and then I would still need to examine the argument to see if the result should be positive or negative. I am not sure if this will be much faster? — Gappy Hilmore, Aug 04 '15 at 16:07
Many libraries, such as Intel's MKL, implement a function called something like `sincos` which does what you want. But at the same kind of precision as you'd expect from a call to `sin` or `cos` so perhaps not meeting your requirements. I think this is provides programming-language-level access to a member of the x86 instruction set. — High Performance Mark, Aug 04 '15 at 16:07
@HighPerformanceMark well how do they do it? I can implement my own power series for the sufficient precision. Only thing I'd need after that is to learn how to handle the domain as best as possible. technically I only need to know the values in (0,Pi/2), but rem(x,2*pi) gives me something in between (-2pi,2pi) and perhaps having a longer power series is better than mapping (-2pi,2pi) into (0,Pi/2). sine is such an elementary algorithm, I'm sure people know exactly how this is done. — Gappy Hilmore, Aug 04 '15 at 16:11
@n.m. the argument's will have double precision though. How do I best map the arguments? Using linear interpolation would probably take longer than an ideal solution. — Gappy Hilmore, Aug 04 '15 at 16:13
@n.m. also where can I learn how to create a lookup table? It is going to have thousands of entries so I am guessing copy pasting it in the beginning of the table isn't the way to go? — Gappy Hilmore, Aug 04 '15 at 16:16
Compute the entries at the beginning of the program. Then, assuming you have 1000 entries, scale the argument to `[0,1000)` and use it as the index. No interpolation necessary. — n. m. could be an AI, Aug 04 '15 at 16:19
@n.m. do you know if there is any way to perform the remainder operation so that it will only give positive output? Is this merely what the programmers chose to do, or is it something inherent in how cpu's do this computation? — Gappy Hilmore, Aug 04 '15 at 16:22
@grdgfgr You can do the remainder thing with `x -= floor(x/2pi) 2pi` (or something close to this, math it out on a piece of paper). Not really a programming problem but rather a simple calculus exercise really. — Baum mit Augen, Aug 04 '15 at 16:28
+1 for lookup table. Might help: [compile-time lookup-table/array with constexpr](http://cplusadd.blogspot.fr/2013/02/c11-compile-time-lookup-tablearray-with.html) — Ivan Aksamentov - Drop, Aug 04 '15 at 16:29
there are several similar questions, but with higher precision [What is the fastest way to compute sin and cos together?](http://stackoverflow.com/q/2683588/995714), [c++ libstd compute sin and cos simultaneously](http://stackoverflow.com/q/24328173/995714), but as you don't need much precision, a simple table lookup or approximation is enough http://stackoverflow.com/q/18662261/995714 http://stackoverflow.com/q/2088194/995714 — phuclv, Aug 04 '15 at 17:04

Bert te Velde · Accepted Answer · 2015-08-04T17:30:56.253

If you seek fast evaluation with good (but not high) accuracy with powerseries you should use an expansion in Chebyshev polynomials: tabulate the coefficients (you'll need VERY few for 0.1% accuracy) and evaluate the expansion with the recursion relations for these polynomials (it's really very easy).

References:

Tabulated coefficients: http://www.ams.org/mcom/1980-34-149/S0025-5718-1980-0551302-5/S0025-5718-1980-0551302-5.pdf
Evaluation of chebyshev expansion: https://en.wikipedia.org/wiki/Chebyshev_polynomials

You'll need to (a) get the "reduced" argument in the range -pi/2..+pi/2 and consequently then (b) handle the sign in your results when the argument actually should have been in the "other" half of the full elementary interval -pi..+pi. These aspects should not pose a major problem:

determine (and "remember" as an integer 1 or -1) the sign in the original angle and proceed with the absolute value.
use a modulo function to reduce to the interval 0..2PI
Determine (and "remember" as an integer 1 or -1) whether it is in the "second" half and, if so, subtract pi*3/2, otherwise subtract pi/2. Note: this effectively interchanges sine and cosine (apart from signs); take this into account in the final evaluation.

This completes the step to get an angle in -pi/2..+pi/2 After evaluating sine and cosine with the Cheb-expansions, apply the "flags" of steps 1 and 3 above to get the right signs in the values.

I am so glad I learned about this. I was too lazy to read the pdf but I am guessing they used the orthogonality property to decompose into ChebyshevU's. Which is what I did https://i.imgur.com/eSNoMYO.png — Gappy Hilmore, Aug 04 '15 at 20:43
Also, what sort of functions are approximated well by chebyshev polynomials? They (for sincos) seem to be quite better than pade approximants. Or does it talk about it in the pdf? — Gappy Hilmore, Aug 04 '15 at 20:50
Chebyshev polynomials are suitable for functions that are not "too wild". This, of course, is loose language. Think: having asymptotes, discontinuities and the like and you'll get the idea. Furthermore: the approximation should apply on a finite interval. — Bert te Velde, Aug 05 '15 at 08:21

Louis Ricci · Answer 2 · 2015-08-04T20:03:17.750

1

Just create a lookup table. The following will let you lookup the sin and cos of any radian value between -2PI and 2PI.

// LOOK UP TABLE
var LUT_SIN_COS = [];
var N = 14400;
var HALF_N = N >> 1;
var STEP = 4 * Math.PI / N;
var INV_STEP = 1 / STEP;
// BUILD LUT
for(var i=0, r = -2*Math.PI; i < N; i++, r += STEP) {
    LUT_SIN_COS[2*i] = Math.sin(r);
    LUT_SIN_COS[2*i + 1] = Math.cos(r);
}

You index into the lookup table by:

var index = ((r * INV_STEP) + HALF_N) << 1;
var sin = LUT_SIN_COS[index];
var cos = LUT_SIN_COS[index + 1];

Here's a fiddle that displays the % error you can expect from different sized LUTS http://jsfiddle.net/77h6tvhj/

EDIT Here's an ideone (c++) with a ~benchmark~ vs the float sin and cos. http://ideone.com/SGrFVG For whatever a benchmark on ideone.com is worth the LUT is 5 times faster.

edited Aug 04 '15 at 20:03

answered Aug 04 '15 at 18:58

Louis Ricci

20,804
5
48
62

IIRC, for a fixed error, linear interpolation allows the use of a _much_ smaller LUT. sin and cos are very well-behaved functions. In particular, if you can fit the smaller LUT in L1 cache instead of L2, it's a lot faster. Sure, you need two LUT entries, but they're adjacent and therefore usually on the same cache line. – MSalters Aug 05 '15 at 11:19
@MSalters - The difference between L1 and L2 cache hits on a modern processor are about 2.5x (4 cycles vs 10 cycles), I would think the logic to perform the interpolation would make up the difference. Also, if you were really constraining the LUT to say -pi/2..pi/2 then you'd also have the extra overhead of extrapolating the final sign. But if you have code on hand feel free to paste it into the ideone I linked. – Louis Ricci Aug 05 '15 at 12:52
Did some micro benchmark. 0.43 seconds sinf + cosf, 0.24 seconds lookup table (257 values from -PI/2 to +PI/2 for both sine and cosine, linear interpolation), 0.16 seconds XMScalarSinCos (the time is for 10M calculations). XMScalarSinCos uses high-degree polynomial approximation. As you see, on modern hardware floating-point multiplications and additions are faster than even L1 cache lookups. – Soonts Jul 25 '16 at 15:43
@Soonts - SIMD (SSE instruction sets) are very impressive. And while I can't deny that in the general case the XM* (or ay other suitable SSE math libs) functions are faster than a LUT, the bottleneck is *not* the L1 or L2 cache lookup. On a modern CPU an L1 cache read is ~4 cycles (L2 ~10 cycles) the XM* function is roughly ~~30cyles; I'd wager the bottleneck in the LUT code is the overhead massaging the fp value into an int index (this code probably uses the slower x87 fp instructions depending on compiler optimization settings), and possible misalignment of the memory that makes up the LUT. – Louis Ricci Jul 26 '16 at 12:12
@LouisRicci XMScalarSinCos ain’t SSE intrinsic, it’s regular inline function that compiles to x87 instructions. For example, here’s main formula for sine from it: ( ( ( ( (-2.3889859e-08f * y2 + 2.7525562e-06f) * y2 - 0.00019840874f ) * y2 + 0.0083333310f ) * y2 - 0.16666667f ) * y2 + 1.0f ) * y. – Soonts Jul 26 '16 at 12:49
@Soonts - According to the documentation the Direct X math functions will use SSE/SSE2 intrinisics unless you specifically direct te library not to. ""By default, _XM_SSE_INTRINSICS_ is defined when users compile for a Windows platform"" https://msdn.microsoft.com/en-us/library/windows/desktop/ee415579(v=vs.85).aspx Without seeing the ASM output I'd guess the XMScalarSinCos dupes the value into the XMM register and performs the formula you pasted above for Sine and Cos at the same time using the packed values then unpacks them into the output references. – Louis Ricci Jul 27 '16 at 11:41
@LouisRicci Just because a lot of things in DirectX use SSE, doesn’t mean this particular function does. Here’s the complete source code for you: http://const.me/tmp/XMScalarSinCos.png – Soonts Jul 27 '16 at 12:43
@Soonts - XMScalarSinCos does indeed use SSE/SSEx instructions. I think your confusion stems from stepping into the code while in a DEBUG build (this will show the C++ code from the INL file in the image you shared). However, in an optimized RELEASE build when you step into the function (and view the ASM) you'll see the SSE/SSEx instructions..http://pasted.co/9be8bb49 – Louis Ricci Jul 27 '16 at 18:57
@LouisRicci You see the result of automatic vectorization. It’s not controlled by that macros you linked, only by /arch compiler switch and corresponding project property. And it equally applies to XMScalarSinCos and to lookup tables code: http://pasted.co/8e1dbc3e When disabled with /arch:IA32, lookup table becomes even slower than sinf+cosf from CRT: stdlib 0.41 seconds, XMScalarSinCos 0.30 seconds, lookup 0.44 seconds. – Soonts Jul 27 '16 at 20:14
@Soonts - Good call. My RELEASE build optimized for speed and used SSE inplace of slower x87 instructions (to that end). But looking at the asm op-code output that was generated it really is a naive 1-to-1 translation/vectorization. So it seems to really be based on the source code from the INL file that contains (as you said) no intrinsics/macros. – Louis Ricci Jul 28 '16 at 19:02
Question is tagged C++, but that is definitely not C++. – user16217248 Mar 07 '23 at 20:47

score 0 · Answer 3 · edited May 23 '17 at 12:08

0

One way to go would be to learn how to implement the CORDIC algorithm. It is not difficult and pretty interesting intelectually. This gives you both the cosine and the sine. Wikipedia gives a MATLAB example that should be easy to adapt in C++.

Note that you can augment speed and reduce precision simply by lowering the parameter n.

About your second question, it has already been asked here (in C). It seems that there is no simple way.

edited May 23 '17 at 12:08

Community

1
1

answered Aug 04 '15 at 16:17

styko

641
3
14

Looks promising. I would still be interested in learning (if it is possible) to do float remainder operation which gives only positive values. – Gappy Hilmore Aug 04 '15 at 16:24
I'm not sure I understand what you mean. – styko Aug 04 '15 at 16:26
When I tried to implement a power series, I wanted to use sine's periodicity of 2pi. But fmod/remainder operation will give -pi when the input is fmod(-3pi,2pi). This makes means the power series I tried to implement should be valid between (-2pi,2pi) or I should use an extra if condition to turn -pi into +pi. I was wondering if I could make it so that fmod always outputs positive remainder – Gappy Hilmore Aug 04 '15 at 16:31
Answer above. Will you have problems converting the MATLAB code into C++? – styko Aug 04 '15 at 16:55
This is a link-only answer, effectively, as the answer is worthless without the link. – Puppy Aug 04 '15 at 18:59
@grdgfgr: The sign isn't that important really, with sin(-x)==-sin(x) and cos(-x)==cos(x). You don't need an if condition, just [signum](http://stackoverflow.com/questions/1903954/is-there-a-standard-sign-function-signum-sgn-in-c-c) and abs. – MSalters Aug 05 '15 at 11:23

score 0 · Answer 4 · answered Aug 05 '15 at 15:07

0

You can also calculate sine using a square root, given the angle and the cosine.

The example below assumes the angle ranges from 0 to 2π:

 double c = cos(angle);
 double s = sqrt(1.0-c*c);
 if(angle>pi)s=-s;

answered Aug 05 '15 at 15:07

Peter O.

32,158
14
82
96

score 0 · Answer 5 · answered Jul 19 '16 at 12:16

For single-precision floats, Microsoft uses 11-degree polynomial approximation for sine, 10-degree for cosine: XMScalarSinCos. They also have faster version, XMScalarSinCosEst, that uses lower-degree polynomials.

If you aren’t on Windows, you’ll find same code + coefficients on geometrictools.com under Boost license.

Calculation sine and cosine in one shot

5 Answers5

Linked