A faster but less accurate fsin for Intel asm?

Question

Since the function fsin for computing the sin(x) function under the x86 dates back to the Pentium era, and apparently it doesn't even use SSE registers, I was wondering if there is a newer and better set of instructions for computing trigonometric functions.

I'm used to code in C++ and do some asm optimizations, so anything that fits in a pipeline starting from C++, to C to asm will do for me.

Thanks.

I'm under Linux 64 bit for now, with gcc and clang ( even tough clang doesn't really offer any FPU related optimization AFAIK ).

EDIT

I have already implemented a sin function, it's usually 2 times faster then std::sin even with sse on.
My function is never slower then fsin, even tough fsin is usually more accurate, but considering that fsin never outperforms my sin implementation, I'll keep my sin for now, also my sin is totally portable where fsin is for x86 only.
I need this for real time computation, so I'll trade precision for speed, I think that I'll be fine with 4-5 decimals of precision .
no to a table based approach, I'm not using it, it screws up the cache, makes everything slower, no algorithm based on memory access or lookup tables please.

This might prove useful: ["Fast Trigonometric Functions Using Intel's SSE2 Instructions"](http://users.ece.utexas.edu/~adnan/comm/fast-trigonometric-functions-using.pdf) — Alex Reinking, May 23 '14 at 20:36
@AlexReinking thanks but that doc looks like a recap of several options plus half page of code that I don't think will be useful, at least in my case. — user2485710, May 23 '14 at 20:42
Can you be more specific about why you think SSE2 won't help your case? — Mark Ransom, May 23 '14 at 20:46
What's wrong with the sin function provided with your C library? (usually implemented in software using SSE) — Marc Glisse, May 23 '14 at 20:47
What is your case then? Can you elaborate in your question? From my understanding from that paper: they express cosine as a function of primitive arithmetic operations, so that it can be then vectorized using SSE, allowing you to compute 4 cosines at the same time. — CygnusX1, May 23 '14 at 20:47
@MarkRansom I can't really see any fruitful expansion inside that document on how SSE2 will work out. Plus starting from `cos` sounds weird, usually I prefer to start from `sin` . — user2485710, May 23 '14 at 20:47
@MarcGlisse I need something faster for realtime computation, I care about 4-5 decimals of precision, I can accept everything else if the algorithm is fast enough. — user2485710, May 23 '14 at 20:50
@CygnusX1 CORDIC is basically the same ... I can't see the news . — user2485710, May 23 '14 at 20:51
@user2485710 You need to state those goals in your question: want better speed at the expanse of precision, or no one will be able to help... — Marc Glisse, May 23 '14 at 20:55
@MarcGlisse see my edit, I'll try to keep all the new replies to the comments in one place. — user2485710, May 23 '14 at 20:58
`double sin(double x) { return 0.5; }`; pretty imprecise in many cases, but blazingly fast. — Oliver Charlesworth, May 23 '14 at 21:33
@OliCharlesworth I like your PRNG approach but I don't think that it would be accurate to the 4-5th decimal as requested — user2485710, May 23 '14 at 21:37
The FPU availability is very old. The 386 or 486 started integrating it in a single chip. But before that you had to purchase a separate processor: the 8087. http://en.wikipedia.org/wiki/Intel_8087 — Alexis Wilke, May 24 '14 at 06:02
1) Do you want 4-5 decimals of _absolute_ precision or 4-5 decimals of _relative_ precision? 2) How flexible is the input? Can it be an scaled `int`. 3) Can the output be a scale `int`? Is the input range -1 to 1 radian or something else? — chux - Reinstate Monica, May 28 '14 at 21:46
@chux 1) I don't know what you mean with that, I need the 4-5 first numbers in the decimals to be reliable 2/3) yes, I was thinking about using unsigned integers for this, anyway I think I can translate a complete 360° angle into a set of integrals without loosing too much precision. — user2485710, May 28 '14 at 21:53
_Absolute_ precision means the result should be within +/- 0.00001 of the mathematically correct answer. The sine of .01 radian would be 0.01000 and the sine of 1 radians would be 0.84147. _Relative_ precision means the result should be within +/- 0.00001 times the mathematically correct answer. The sine of .01 radian would be 0.0099998 and the sine of 1 radians would be 0.84147. Since you are considering fixed point or using integers, sounds like you want _Absolute_ precision. — chux - Reinstate Monica, May 29 '14 at 15:25
@chux yes, I think that absolute precision is what I want, I can probably even go for 3-4 decimals of precision, I was trying to stay "safe" because I don't want to cause a propagation of the error when I multiply or divide by some quantity the result of trigonometric function. I'm waiting for your input on the algorithm. — user2485710, May 29 '14 at 16:17
@Pascal Cuoq answer is a good one. For a quick algorithm, I would use MS excel and make of list of all (x,y) pairs of whatever function you want like (angle, sine(angle)). Chart the data and add a trend line. Trend line offers options like to "best" 2nd, or 3rd, ... up to 6th polynomial fit. Have it display the trend line equation using scientific notation and use those values to quickly calculate my function approximation. Sorry do not have time for much more detail. — chux - Reinstate Monica, May 29 '14 at 16:57

Pascal Cuoq · Answer 1 · 2014-05-23T23:07:44.470

14

If you need an approximation of sine optimized for absolute accuracy over -π … π, use:

x * (1 + x * x * (-0.1661251158026961831813227851437597220432 + x * x * (8.03943560729777481878247432892823524338e-3 + x * x * -1.4941402004593877749503989396238510717e-4))

It can be implemented with:

float xx = x * x;
float s = x + (x * xx) * (-0.16612511580269618f + xx * (8.0394356072977748e-3f + xx * -1.49414020045938777495e-4f));

And perhaps optimized depending on the characteristics of your target architecture. Also, not noted in the linked blog post, if you are implementing this in assembly, do use the FMADD instruction. If implementing in C or C++, if you use, say, the fmaf() C99 standard function, make sure that FMADD is generated. The emulated version is much more expensive than a multiplication and an addition, because what fmaf() does is not exactly equivalent to multiplication followed by addition (so it would be incorrect to just implement it so).

The difference between sin(x) and the above polynomial between -π and π graphs so:

graphpipi

The polynomial is optimized to reduce the difference between it and sin(x) between -π and π, not just something that someone thought was a good idea.

If you only need the [-1 … 1] definition interval, then the polynomial can be made more accurate on that interval by ignoring the rest. Running the optimization algorithm again for this definition interval produces:

x * (1 + x * x * (-1.666659904470566774477504230733785739156e-1 + x * x *(8.329797530524482484880881032235130379746e-3 + x * x *(-1.928379009208489415662312713847811393721e-4)))

The absolute error graph:

graph11

If that is too accurate for you, it is possible to optimize a polynomial of lower degree for the same objective. Then the absolute error will be larger but you will save a multiplication or two.

edited May 23 '14 at 23:07

answered May 23 '14 at 22:28

Pascal Cuoq

79,187
7
161
281

I can't follow your reasoning, What algorithm you picked to derive the first and the other formulas ? Remember that I need to do this for all the other functions, so I need an algorithm . – user2485710 May 24 '14 at 07:32
3

@user2485710 Well, your question is about sin so I answered about sin. Anyway, the method used is the Remez algorithm, and what it provides is explained very clearly in a link my answer already provides: http://lolengine.net/blog/2011/12/21/better-function-approximations . How it works is not necessary to understand to use it (I don't). – Pascal Cuoq May 24 '14 at 08:01
2

@user2485710 What **is** necessary to understand is principles of polynomial approximation (otherwise you end up trying to approximate sin with a polynomial of the form aX^2 + bX and you have to call `abs()` everywhere and it is ridiculous, as in “Nick's version” from Xavier Holt's answer). You also need basic facts about floating-point so that you know that pinning the coefficient of X to 1 is beneficial. I used LolRemez, available from the link I already provided, but it is complicated to use it right, because of all the above – Pascal Cuoq May 24 '14 at 08:06
1) what kind of polynomial approximation are you talking about ? 2) that lolremez library is an horrible piece of code, I don't think it will pass any kind of code review 3) I still don't have any reference to any algorithm other then Remez, but Remez still needs an algorithm to be translated in something that can be computed . You keep saying this thing about polynomials, but you are not giving any specific reference, I get that those are polynomials, I get that my target are transcendental functions, the problem is how to compute the latter with the former. – user2485710 May 24 '14 at 08:49
2

@user2485710 1) http://en.wikipedia.org/wiki/Approximation_theory . There are books about this. I am not going to write you a book. 2) If you don't like it, don't use it. There are implementations available in tools like Maple, but I do not have access to these tools nor any indication that their implementations are cleaner. You understand that this code is not packaged in the end product, right? 3) I gave you the links to everything I used, but if you reject the very tools I used as too “horrible” for you, there is nothing I can do to help you further. – Pascal Cuoq May 24 '14 at 08:55
I asked you for an algorithm, your reply was that I should have noticed some kind of _polynomial approximation_, now you are suggesting that I should study an entire branch of the math to do this. If you are not willing to help, just don't start anything, it's not me not accepting that piece of code, the point is that from your answer I can't extract any name, reference, or algorithm for this, you are copy & pasting code from other sources adding nothing as far as the theory behind this is concerned. If a car breaks down I can't suggest "study mechanical engineering" if I really want to help. – user2485710 May 24 '14 at 09:10
2

“from your answer I can't extract any name, reference, or algorithm for this” **Remez algorithm**. I gave you the name and a link to an open-source implementation. “you are copy & pasting code from other sources” Actually, I am running the algorithm for you because it is complicated to use and your question is about **A faster but less accurate fsin**. You are welcome. LolRemez comes with a tutorial, http://lolengine.net/wiki/doc/maths/remez , but you already rejected that implementation as “horrible” and I don't know of any other free implementation or tutorial. – Pascal Cuoq May 24 '14 at 09:33

score 4 · Answer 2 · answered May 23 '14 at 21:02

4

If you're okay with an approximation (I'm assuming you are, if you're trying to beat hardware), you should take a look at Nick's sin implementation at DevMaster:

http://devmaster.net/posts/9648/fast-and-accurate-sine-cosine

He has two versions: a "fast & sloppy" method and a "slow & accurate" method. A couple replies down someone estimates the relative errors as 12% and 0.2% respectively. I've done an implementation myself, and find runtimes of 1/14 and 1/8 the hardware times on my machine.

Hope that helps!

PS: If you do this yourself, you can refactor the slow/accurate method to avoid a multiplication and improve slightly over Nick's version, but I don't remember exactly how...

answered May 23 '14 at 21:02

Xavier Holt

14,471
4
43
56

well that's a long read, I'm reading it but for now I think I'll need some time to process that and the related options. But it looks like those people are more or less game developer and they are pretty happy with it . – user2485710 May 23 '14 at 21:12
1

“you can refactor the slow/accurate method to avoid a multiplication and improve slightly over Nick's version” When Horner form is an improvement over one's polynomial evaluation scheme, one should avoid making bold claims about so-called “fast and accurate” implementation. The title of this blog post should be “fast and inaccurate sine”, since that's what both versions are. – Pascal Cuoq May 23 '14 at 21:34
@PascalCuoq all the approximations are by definition less accurate, plus in the computing world I don't know how things can possibly be different. – user2485710 May 23 '14 at 21:38
2

@user2485710 The title says “Fast and accurate sine/cosine”. It does not say “approximation”. Any function that returns an IEEE 754 number can be assumed to have its accuracy limited by that format. A function can be considered accurate when it produces results to within 1 ULP of the real result. What the post describes, again, is an inaccurate and fast sine function (from someone who has never heard f Horner's scheme). – Pascal Cuoq May 23 '14 at 21:45
@PascalCuoq so you are suggesting that with some fixed point computation we can do better ? There are better algorithms ? – user2485710 May 23 '14 at 21:47
@user2485710 Pascal is complaining about the terminology, not the algorithm. – Mark Ransom May 23 '14 at 21:55
1

@user2485710 Augment your question with the definition interval that you need and I'll show you a function that is likely better for absolute accuracy than the one you already wrote (I assume you are interested in absolute precision. You should also make that clearer) – Pascal Cuoq May 23 '14 at 21:55
1

@user2485710 - Pascal's saying that the post should have had the word "Approximation" in the title, as the method is _not_ accurate in the typical floating point sense. [Horner's Method](https://en.wikipedia.org/wiki/Horner%27s_method) is (among other things) a way to find computationally efficient forms of polynomials. Applying it here gave me a slight (five percent?) speedup over the original code. Minor, but certainly worthwhile if you're going for speed. – Xavier Holt May 23 '14 at 21:56
@PascalCuoq I don't think that I'll need nothing more than the 360°/2pi angle, so I think that my interval should be [1,-1] on the x axis, or even just the first half of it if the function can be easily "flipped" over. – user2485710 May 23 '14 at 21:58
http://stackoverflow.com/questions/23838410/interrupts-instruction-pointer-and-instruction-queue-in-8086 – Patt Mehta May 24 '14 at 04:06
Too bad that link is broken. I'd have liked to try it out. – mireazma Mar 14 '21 at 16:06

A faster but less accurate fsin for Intel asm?

2 Answers2