C: Improving performance of function with heavy sin() usage

Question

I have a C function that computes the values of 4 sines based on time elapsed. Using gprof, I figured that this function uses 100% (100.7% to be exact lol) of the CPU time.

void
update_sines(void)
{
    clock_gettime(CLOCK_MONOTONIC, &spec);
    s = spec.tv_sec;
    ms = spec.tv_nsec * 0.0000001;
    etime = concatenate((long)s, ms);

    int k;
    for (k = 0; k < 799; ++k)
    {
        double A1 = 145 * sin((RAND1 * k + etime) * 0.00333) + RAND5;           // Amplitude
        double A2 = 100 * sin((RAND2 * k + etime) * 0.00333) + RAND4;           // Amplitude
        double A3 = 168 * sin((RAND3 * k + etime) * 0.00333) + RAND3;           // Amplitude
        double A4 = 136 * sin((RAND4 * k + etime) * 0.00333) + RAND2;           // Amplitude

        double B1 = 3 + RAND1 + (sin((RAND5 * k) * etime) * 0.00216);           // Period
        double B2 = 3 + RAND2 + (sin((RAND4 * k) * etime) * 0.002);         // Period
        double B3 = 3 + RAND3 + (sin((RAND3 * k) * etime) * 0.00245);           // Period
        double B4 = 3 + RAND4 + (sin((RAND2 * k) * etime) * 0.002);         // Period

        double x = k;                                   // Current x

        double C1 = 0.6 * etime;                            // X axis move
        double C2 = 0.9 * etime;                            // X axis move
        double C3 = 1.2 * etime;                            // X axis move
        double C4 = 0.8 * etime + 200;                          // X axis move

        double D1 = RAND1 + sin(RAND1 * x * 0.00166) * 4;               // Y axis move
        double D2 = RAND2 + sin(RAND2 * x * 0.002) * 4;                 // Y axis move
        double D3 = RAND3 + cos(RAND3 * x * 0.0025) * 4;                // Y axis move
        double D4 = RAND4 + sin(RAND4 * x * 0.002) * 4;                 // Y axis move

        sine1[k] = A1 * sin((B1 * x + C1) * 0.0025) + D1;
        sine2[k] = A2 * sin((B2 * x + C2) * 0.00333) + D2 + 100;
        sine3[k] = A3 * cos((B3 * x + C3) * 0.002) + D3 + 50;
        sine4[k] = A4 * sin((B4 * x + C4) * 0.00333) + D4 + 100;
    }

}

And this is the output from gprof:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
100.07      0.04     0.04

I'm currently getting a frame rate of roughly 30-31 fps using this. Now I figure there as to be a more efficient way to do this.

As you noticed I already changed all the divisions to multiplications but that had very little effect on performance.

How could I increase the performance of this math heavy function?

considering that you are doing this in 4's it might be a good fit for translating to SIMD (e.g. SSE intrinsics), especially if you dont mind using float precision. — Preet Kukreti, Dec 31 '13 at 01:27
What are the `RANDn` values? As an aside, not a performance issue, but the assignment of `x = k` appears to be superfluous since neither variable is changed inside the loop. — lurker, Dec 31 '13 at 01:29
It seems to me that A, B and C values are completely independent. Why not spawn 4 threads, one thread for each sineX? — cen, Dec 31 '13 at 01:29
Your replacing `/ 300` by `* 0.00333` seems to show that you have extremely low expectations of accuracy. Make the entire computation single-precision by using `sinf` (a well-written `sinf` is faster than `sin`) and single-precision variables and constants. It will still be more accurate than `0.00333` is an accurate approximation of `1/300`. — Pascal Cuoq, Dec 31 '13 at 01:30
An aside, the compiler probably already converts the `/300` to `* 0.00333` and is going to be more accurate than you in doing so. Divisions by constants are multiplications already, you don't need to worry about it. Hence, you didn't see any performance increase. — nonsensickle, Dec 31 '13 at 01:37
Operations on integers (fixed point arithmetic) would be faster if you can afford them, otherwise, more threads or the GPU are your best options as was already mentioned. — nonsensickle, Dec 31 '13 at 01:39
Out of interest, what on earth is the purpose of the function? `sin * sin(sin + sin) + sin` is weird enough from a dimensionality POV, but it's all dependent on the *current system time*? What does that do? — Oliver Charlesworth, Dec 31 '13 at 01:43
@OliCharlesworth yeah, you're right. Probably more effort than it's worth. http://stackoverflow.com/questions/1571502/trigonometric-functions-on-embedded-system and http://www-personal.umich.edu/~johannb/Papers/paper07.pdf — nonsensickle, Dec 31 '13 at 01:43
@Oli Charlesworth: You ever played on PlayStation 3? There is an animation in the background of the operating system. Looks like a couple of sine waves blurring and waving together. That's basically what I'm doing and I'm randomizing the amplitude, period, x offset and y offset according to the time elapsed since the start of the program. I hope I'm making sense. — ReX357, Dec 31 '13 at 01:45
@ReX357: Ah ok, that makes sense I guess! That being the case, presumably you have the option to *significantly* reduce the precision of these calculations without any noticeable effect? — Oliver Charlesworth, Dec 31 '13 at 01:46
Yes I do. How would I go about that? Would switching everything to a float help? — ReX357, Dec 31 '13 at 01:46
Also I'm using OpenGL (linux glx) in this application. Would using GLFloats and OpenGL functions work to offload everything to the GPU? I'm new to all of this. — ReX357, Dec 31 '13 at 01:48
@ReX357: For basic algebra, usually no. But Pascal's observation that trig function on floats being faster than doubles is probably valid. Also, the crude lookup-based trig approximation mentioned in one of the answers is probably a great approach here. — Oliver Charlesworth, Dec 31 '13 at 01:48
I'm not sure why my question has been put on hold? I have a bottle neck in my code because I'm calling the sin function too often, what are the solutions to fix that? I don't see how this is too broad. But I'm sorry if it is. — ReX357, Dec 31 '13 at 04:28

Oliver Charlesworth · Accepted Answer · 2014-01-02T12:48:58.693

12

Besides all the other advice given in other answers, here is a pure algorithmic optimization.

In most cases, you're computing something of the form sin(k * a + b), where a and b are constants, and k is a loop variable. If you were also to compute cos(k * a + b), then you could use a 2D rotation matrix to form a recurrence relationship (in matrix form):

|cos(k*a + b)| = |cos(a)  -sin(a)| * |cos((k-1)*a + b)|
|sin(k*a + b)|   |sin(a)   cos(a)|   |sin((k-1)*a + b)|

In other words, you can calculate the value for the current iteration in terms of the value from the previous iteration. Thus, you only need to to do the full trig calculation for k == 0, but the rest can be calculated via this recurrence (once you have calculated cos(a) and sin(a), which are constants). So you eliminate 75% of the trig function calls (it's not clear the same trick can be pulled for the final set of trig calls).

edited Jan 02 '14 at 12:48

answered Dec 31 '13 at 01:57

Oliver Charlesworth

267,707
33
569
680

Oh god, I'm in over my head. OK, let me see if I understand. You're basically saying I can figure out the next value if I know the value for k == 0, I can figure out the value for k == 1 without computing the whole equation again? – ReX357 Dec 31 '13 at 02:09
@ReX357: Absolutely. To be fair, you need both `cos` and `sin` for `k == 0`, but once you have those, you can derive `cos` and `sin` for `k == 1`, and so on. I'm too tired to do the derivation here, but take a look at the first two equations in the table I linked to. Substitute `k * a + b` for `alpha`, and `-1 * a` for `beta`, and work it through! – Oliver Charlesworth Dec 31 '13 at 02:11
@ReX357: In fact, a much more intuitive way to think about it is that if you have both `sin` and `cos` and you plot them as 2D coordinates, you end up tracing out a circle. To move a point slightly further round a circle (which is what you're doing from iteration to iteration), you need to apply a rotation. The `M` and `N` are two elements of the appropriate 2D [rotation matrix](http://en.wikipedia.org/wiki/Rotation_matrix). – Oliver Charlesworth Dec 31 '13 at 02:24
OK so let me see if I get this right. I would get the value of A1 with both cos and sin when k == 0 and store them in let's say a1_sin and a1_cos. Now when k == 1, I can simply find the value of A1 by exchanging the sin by the value of M * a1_sin + N * a1_cos? – ReX357 Dec 31 '13 at 02:26
@ReX357: Yup, it would be something like `a1_sin = M * a1_sin_old + N * a1_cos_old; a1_cos = -N * a1_sin_old + M * a1_cos_old;` (I've probably got the minus sign in the wrong place; work it through from the wiki page on rotation matrices I linked to). – Oliver Charlesworth Dec 31 '13 at 02:28
OK one last part that is unclear to me at the moment. How do I decide what M and N should be? – ReX357 Dec 31 '13 at 02:39
Sorry my math skills are lacking in this department and I have no formal trig training, I'm a self taught hobbyist programmer. That wikipedia page is chinese to me haha. I do understand fairly quick once I'm presented with a concrete example though. – ReX357 Dec 31 '13 at 02:42
@ReX357: They'll be something like `sin(a)` and `cos(a)`, respectively. I'm afraid you'll need to work the maths through yourself, as I'm going to bed! – Oliver Charlesworth Dec 31 '13 at 02:45
is the accepted answer to this http://stackoverflow.com/questions/19909501/calculate-the-function-sin what you're talking about here? – ReX357 Dec 31 '13 at 06:01
@OliCharlesworth, that's a clever solution. I think you could combine that with SSE by using a -> 4*a (4 times the rotation angle) and then setting the initial value of the SSE register to the value of sin(ak*+b) at k = 0, 1, 2, and 3. – Z boson Jan 02 '14 at 12:36
I gave up on this problem but this equation helped while I worked on it. sin(a + b) = sinA * cosB + sinB * cosA. Since B was etime and A was relative to the current x, I was able to precompute A into a lookup table and then simply calculate the sin and the cos of B once per frame. I ran into another bottleneck because my lookup tables wouldn't fit into the cache memory and it actually slowed down my FPS. But his solution was right on the algebric level. I was basically able to reduce my sin() calls from 12800 per frame to 8 per frame with this equation. – ReX357 Jan 05 '14 at 09:43

Glenn Teitelbaum · Answer 2 · 2013-12-31T05:03:47.620

If you don't need all that precision, create a lookup for the sin() values you need, so if 1 degree is enough, use double sin_lookup[360], etc.. And possibly float sin_lookup[360] if float precision is sufficient.

Also, as noted in comments, at a certain point as per Keith, "You might also consider using linear interpolation between lookup values, which should give you substantially better accuracy (a reasonably continuous function rather than a step function) at a fairly small cost in performance"

EDIT: also consider changing the hardcoded A1,A2,A3,A4 pattern to arrays of size[4], and looping from 0 to 3 - should allow vectorization on many platforms and allow parrellism without needing to manage threads

EDIT2: some code and results

(Coded in C++ just to make comparisons easy between precisions, calcs are the same in C)

class simple_trig
{
public:
        simple_trig(size_t prec) : precision(prec)
        {
                static const double PI=3.141592653589793;
                const double dprec=(double)prec;
                const double quotient=(2.0*PI)/dprec;
                rev_quotient=dprec/(2.0*PI);
                values.reserve(prec);

                for (int i=0; i < precision; ++i)
                {
                        values[i]=::sin(quotient*(double)i);
                }
        }

        double sin(double x) const
        {
                double cvt=x*rev_quotient;
                int index=(int)cvt;
                double delta=cvt-(double)index;
                int lookup1=index%precision;
                int lookup2=(index+1)%precision;
                return values[lookup1]*(1.0-delta)+values[lookup2]*delta;
        }

        double cos(double x) const
        {
                double cvt=x*rev_quotient;
                int index=(int)cvt;
                double delta=cvt-(double)index;
                int lookup1=(index+precision/4)%precision;
                int lookup2=(index+precision/4+1)%precision;
                return values[lookup1]*(1.0-delta)+values[lookup2]*delta;
        }

private:
        const size_t precision;
        double rev_quotient;
        std::vector<double> values;
};

Examples Low is 100, Med is 1000 and High is 10,000

X=0 Sin=0 Sin Low=0 Sin Med=0 Sin High=0
X=0 Cos=1 Cos Low=1 Cos Med=1 Cos High=1
X=0.5 Sin=0.479426 Sin Low=0.479389 Sin Med=0.479423 Sin High=0.479426
X=0.5 Cos=0.877583 Cos Low=0.877512 Cos Med=0.877578 Cos High=0.877583
X=1.33333 Sin=0.971938 Sin Low=0.971607 Sin Med=0.971935 Sin High=0.971938
X=1.33333 Cos=0.235238 Cos Low=0.235162 Cos Med=0.235237 Cos High=0.235238
X=2.25 Sin=0.778073 Sin Low=0.777834 Sin Med=0.778072 Sin High=0.778073
X=2.25 Cos=-0.628174 Cos Low=-0.627986 Cos Med=-0.628173 Cos High=-0.628174
X=3.2 Sin=-0.0583741 Sin Low=-0.0583689 Sin Med=-0.0583739 Sin High=-0.0583741
X=3.2 Cos=-0.998295 Cos Low=-0.998166 Cos Med=-0.998291 Cos High=-0.998295
X=4.16667 Sin=-0.854753 Sin Low=-0.854387 Sin Med=-0.854751 Sin High=-0.854753
X=4.16667 Cos=-0.519036 Cos Low=-0.518818 Cos Med=-0.519034 Cos High=-0.519036
X=5.14286 Sin=-0.90877 Sin Low=-0.908542 Sin Med=-0.908766 Sin High=-0.90877
X=5.14286 Cos=0.417296 Cos Low=0.417195 Cos Med=0.417294 Cos High=0.417296
X=6.125 Sin=-0.157526 Sin Low=-0.157449 Sin Med=-0.157526 Sin High=-0.157526
X=6.125 Cos=0.987515 Cos Low=0.987028 Cos Med=0.987512 Cos High=0.987515
X=7.11111 Sin=0.73653 Sin Low=0.736316 Sin Med=0.736527 Sin High=0.73653
X=7.11111 Cos=0.676405 Cos Low=0.676213 Cos Med=0.676403 Cos High=0.676405
X=8.1 Sin=0.96989 Sin Low=0.969741 Sin Med=0.969887 Sin High=0.96989
X=8.1 Cos=-0.243544 Cos Low=-0.24351 Cos Med=-0.243544 Cos High=-0.243544
X=9.09091 Sin=0.327701 Sin Low=0.327558 Sin Med=0.3277 Sin High=0.327701
X=9.09091 Cos=-0.944782 Cos Low=-0.944381 Cos Med=-0.944779 Cos High=-0.944782
X=10.0833 Sin=-0.611975 Sin Low=-0.611673 Sin Med=-0.611973 Sin High=-0.611975
X=10.0833 Cos=-0.790877 Cos Low=-0.790488 Cos Med=-0.790875 Cos High=-0.790877

This assumes calls to `sin()` are the bottleneck (which seems likely, but it's worth checking). You might also consider using linear interpolation between lookup values, which should give you substantially better accuracy (a reasonably continuous function rather than a step function) at a fairly small cost in performance. — Keith Thompson, Dec 31 '13 at 01:34
@KeithThompson Good idea, that would depend on how much precision is needed and how much space is allocated, it is likely that it is not needed for degrees, or even 1/10 degrees, but beyond that it would very likely pay off — Glenn Teitelbaum, Dec 31 '13 at 01:42
And depending on the application (I don't pretend to know what the OP is doing), making your pseudo-`sin` function more or less continuous rather than a step function could be a very good thing, beyond just getting nearly correct values. — Keith Thompson, Dec 31 '13 at 01:49
@KeithThompson Been a while, but isn't there also a certain point below which x can be used instead of sin(x) — Glenn Teitelbaum, Dec 31 '13 at 02:10
Not a *certain* point, but yes, lim(n -> 0) sin(x)/x = 1 (I hope my ersatz mathematical notation is clear enough). — Keith Thompson, Dec 31 '13 at 02:12
I'm sorry but my math skills are very limited. How would I populate this lookup table? — ReX357, Dec 31 '13 at 02:16
If RANDX values are constants you simply calculate all possible sin() and cos() that you use in your function and store results in table. Then instead of calculating these results you "lookup" for the result in that table. But if RANDX are random variables which are changing through time it is not possible to create such a table. Unless they are random on a very small scope that is.. if it's not the table would become huge. — cen, Dec 31 '13 at 02:26
A and B calculations use `etime` in sin() so you can't make lookup table for that. However you can make a lookup table for D. For example, for D1 calculate sin(RAND1 * x * 0.00166) for all x and store results in an array. Then, in your for loop, instead of calculating this just use yourarray[k] as precalculated value. Someone correct me about A and B if I'm wrong please. @Oli Charlesworth: yep.. just noticed — cen, Dec 31 '13 at 02:31
@KeithThompson -- is that what you meanty by linear interpolation ? — Glenn Teitelbaum, Dec 31 '13 at 04:22
@GlennTeitelbaum Just seen your second edit, I'm working in C not C++. — ReX357, Dec 31 '13 at 04:22
@ReX357 More organized in C++ (easier to do compparisons)- but you see the calcs? Basically decide your precision - use the simple_trig code to initialize values - and then use the code for sin or cos (replace vector values with double values[x]) — Glenn Teitelbaum, Dec 31 '13 at 04:24
@GlennTeitelbaum I see the calculations. Is that what he was referring to with linear interpolation on the lookup tables? — ReX357, Dec 31 '13 at 04:26

score 1 · Answer 3 · answered Dec 31 '13 at 01:36

It seems to me that sine1, sine2, sine3 and sine4 arrays are completely independent from eachother. So you are basically running a single for loop for 4 different arrays which have no dependency.

Spawn 4 threads, 1 for each, so you have 4 for loops running at the same time. On multicore machine this should speed up your function dramatically. As a matter of fact, it should be a perfect 4x speedup (+- ...).

score 0 · Answer 4 · answered Dec 31 '13 at 01:48

0

Actually combining the use of threads (consider this with OpenMP) and the use of a table for the sin is a good idea. If possible use float instead of double and, depending on the platform, you could also use simd instructions, but the later would make the use of threads unnecessary.

Cheers

answered Dec 31 '13 at 01:48

prmottajr

1,816
1
13
22

OpenMP is potentially helpful here, but I'd be worried about the setup overhead incurred every time this function is called. Presumably the run-time for a single call to this function is pretty small. – Oliver Charlesworth Dec 31 '13 at 01:51
@OliCharlesworth you are right, but my suggestion is threads ;) OpenMP is just if using threads was considered too difficult. – prmottajr Dec 31 '13 at 01:53
@OliCharlesworth - it loops 800 times over 4 variable sets – Glenn Teitelbaum Dec 31 '13 at 01:57
@GlennTeitelbaum: Indeed, but I would hope that takes a few milliseconds, at most? Speed that up by a factor of 4 (or more, if you start throwing in some of the other optimisations suggested here), and the thread startup time is going to be significant. I guess you could have a thread pool sitting there ready to go, but now things start to get complicated ;) – Oliver Charlesworth Dec 31 '13 at 02:06

score 0 · Answer 5 · answered Sep 01 '21 at 16:11

Here is a C++ snippet to use the rotation matrix suggested in the accepted answer.

   float a = 0.343;
   float b = 2.3232;
   float sina{};
   float cosa{};
   sincosf(a, &sina, &cosa);
   float resSin{};
   float resCos{};

   for (int k = 0; k < 5; k++) {
     if (k == 0) {
       sincosf(b, &resSin, &resCos);
     } else {
       float newResCos, newResSin;
       newResCos = cosa * resCos - sina * resSin;
       newResSin = sina * resCos + cosa * resSin;
       resCos = newResCos;
       resSin = newResSin;
     }
   }

C: Improving performance of function with heavy sin() usage

5 Answers5

Linked