2

for signal processing I need to compute relatively large C arrays as shown in the code part below. This is working fine so far, unfortunately, the implementation is slow. The size of "calibdata" is arround 150k and needs to be calculated for different frequencies/phases. Is there a way to improve speed significantly? Doing the same with logical indexing in MATLAB is way faster.

What I tried already:

  • using taylor approximation of sine: no siginificant improvement.
  • using std::vector, also no siginificant improvement.

code:

double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier){
for (int i = 0; i < size; i++)
    result += calibdata[i] * cos((2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180) - (PI / 2)));

result = fabs(result / size);

return result;}

Best regards, Thomas

T.TM
  • 31
  • 3
  • 2
    Just to be sure, did you try turning on optimizations in your compiler options? – dascandy Feb 19 '16 at 19:40
  • 1
    Perhaps compute stuff off-line and use a look up table (swap CPU for memory!) – Ed Heal Feb 19 '16 at 19:43
  • Did you compile in release or debug mode? – NathanOliver Feb 19 '16 at 19:44
  • 1
    Please choose one of C and C++ for this question. – fuz Feb 19 '16 at 19:45
  • 2
    Apart from performance issues, it might be advisable to initialize `result`. – RHertel Feb 19 '16 at 19:47
  • 1
    What is your accuracy requirement? – Frank Puffer Feb 19 '16 at 19:47
  • 1
    Lookup tables are the best. But I am pretty sure you can take larger part to offline computation than just sines and cosines. Just look closely at you formulae and factor them out. – Eugene Sh. Feb 19 '16 at 19:47
  • 1
    I would precompute `(PI / 180)` and `(PI / 2)` as floating point division is expensive. hopefully the compiler is doing it but it never hurts to try. – NathanOliver Feb 19 '16 at 19:48
  • @EugeneSh. I thought compilers use lookup tables for trigonometric functions and interpolate the values. Am I mistaken? – RHertel Feb 19 '16 at 19:52
  • @RHertel I would not make any assumptions about it. To provide a good accuracy it would require a large memory space, so it shoud, at the very least, give an option not to. Anyway, [here](http://stackoverflow.com/questions/2284860/how-does-c-compute-sin-and-other-math-functions) is the topic discussing it. – Eugene Sh. Feb 19 '16 at 19:55
  • @EugeneSh. Thank you! – RHertel Feb 19 '16 at 19:56
  • 1
    Can you precalculate the `phase` (I mean the whole `phase*(PI/180) -(PI/2)`) and `2*PI*...`? you say that needs to be calculatd for different phase and frequencies but maybe are known. – Bob__ Feb 19 '16 at 20:07
  • 1
    size and currentcarrier should better be `size_t`. – Daniel Jour Feb 19 '16 at 20:31
  • 1
    This is the weak spot of C++ because compiler thinks `freqscale[currentcarrier]` needs to be loaded every time. Trying `double *__restrict__ freqscale` and compiler will treat `freqscale[currentcarrier] ` as a constant, which is only loaded once. Or as @user3386109 implied in hits answer: move `double delta = 2 * PI * freqscale[currentcarrier] / fs; ` out of the loop. – user3528438 Feb 21 '16 at 01:05

6 Answers6

4

Okay, I'm probably going to get flogged for this answer, but I would use the GPU for this. Because your array doesn't appear to be self-referential, the best speedup you're going to get for large arrays is through parallelization... by far. I don't use MATLAB, but I just did a quick search for GPU utilization on the MathWorks site:

http://www.mathworks.com/company/newsletters/articles/gpu-programming-in-matlab.html?requestedDomain=www.mathworks.com

Outside of MATLAB you could use OpenCL or CUDA yourself.

yapdog
  • 106
  • 4
  • Honestly, I think this is the best answer. There's only so far you can go with Compiler Optimizations. I would maybe advise the OP to first attempt to Multi-Thread the code though (Reductions can be easily made multi-threaded) and see if that improves things before moving to GPU processing, especially since GPU processing usually guarantees an overhead of 100-1000 milliseconds regardless of the actual operations/data size involved. – Xirema Feb 19 '16 at 20:24
  • @Xirema Is milliseconds correct? That's 0.1 to 1.0 seconds then, right? – user3386109 Feb 19 '16 at 20:26
  • @user3386109 Milliseconds is correct, in my experience. It's not obvious what the OP means when they say "the implementation is slow". If you're working with network/socket data, then a delay of 100-1000 milliseconds is nothing. If you're working with realtime data on an embedded processor, 100-1000 milliseconds can kill your application. So I'm hedging my suggestion on the uncertainty that 100-1000 milliseconds represents a significant delay. – Xirema Feb 19 '16 at 20:30
  • @Xirema Thanks! That's a good number to know before thinking about moving code to the GPU. – user3386109 Feb 19 '16 at 20:38
  • @Xirema Oops! I was going to mention breaking up the array for mutlithreading, but got caught up in searching for MATLAB stuff. Good catch. Also, I was aware there was a delay, but those numbers are insane! If that's the case, then they must be doing some special voodoo with OpenGL compute shaders; turning off VSync, I can get 200+ fps (GTX 780) on large arrays (+10K). I'd just as soon run some tests using compute shaders. Much easier setup, anyway. – yapdog Feb 19 '16 at 21:42
  • @yapdog My estimates for overhead include compiling the kernel and setting up the context/buffers/etc. The overhead of actually submitting the kernel to the GPU is very low, though it depends on what you're actually doing. – Xirema Feb 19 '16 at 21:48
  • @Xirema Once again, you are faster than I :^) I was about to write that those times likely included compiling, which **should** be done in advance. Even so, you've raised a very good point, since we don't really know the nature of user3386109's implementation needs. – yapdog Feb 19 '16 at 22:05
4

When optimizing code for speed, step 1 is to enable compiler optimizations. I hope you've done that already.

Step 2 is to profile the code and see exactly how the time is being spent. Without profiling, you're just guessing, and you could end up trying to optimize the wrong thing.

For example, your guess seems to be that the cos function is the bottleneck. But the other possibility is that the calculation of the angle is the bottleneck. Here's how I would refactor the code to reduce the time spent calculating the angle.

double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier)
{
    double result = 0;
    double angle = phase * (PI / 180) - (PI / 2);
    double delta = 2 * PI * freqscale[currentcarrier] / fs;
    for (int i = 0; i < size; i++)
    {
        result += calibdata[i] * cos( angle );
        angle += delta;
    }
    return fabs(result / size);
}
user3386109
  • 34,287
  • 7
  • 49
  • 68
  • 2
    I was thinking at something similar, my doubt is on the accumulation of numerical error using `angle += delta` instead of computing `i*...`. – Bob__ Feb 19 '16 at 20:43
  • 1
    @Bob__ That's a good point. It's possible get a rough estimate of the error given that `size` is 150K. Assuming that each addition introduces half an LSB of error in the `angle`, after 150K additions the error accumulates to about 17 LSBs. Since a double has 52 bits of precision, the error leaves 35 bits correct. Thus the error is 1 part in 34 billion, which is more than adequate for most signal processing tasks. – user3386109 Feb 19 '16 at 20:53
1

You can try to use the definition of cosine based on the complex exponential:

where j^2=-1.

Store exp((2 * PI*freqscale[currentcarrier] / fs)*j) and exp(phase*j). Evaluating cos(...) then resumes to a couple of products and additions in the for loops, and sin(), cos() and exp() are only called a couple of times.

Here goes the implementation:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <complex.h>
#include <time.h> 

#define PI   3.141592653589

typedef struct cos_plan{
    double complex* expo;
    int size;
}cos_plan;

double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier){
    double result=0;  //initialization
    for (int i = 0; i < size; i++){

        result += calibdata[i] * cos ( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.) - (PI / 2.)) );

        //printf("i %d cos %g\n",i,cos ( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.) - (PI / 2.)) ));
    }
    result = fabs(result / size);

    return result;
}

double phase_func2(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier, cos_plan* plan){

    //first, let's compute the exponentials:
    //double complex phaseexp=cos(phase*(PI / 180.) - (PI / 2.))+sin(phase*(PI / 180.) - (PI / 2.))*I;
    //double complex phaseexpm=conj(phaseexp);

    double phasesin=sin(phase*(PI / 180.) - (PI / 2.));
    double phasecos=cos(phase*(PI / 180.) - (PI / 2.));

    if (plan->size<size){
        double complex *tmp=realloc(plan->expo,size*sizeof(double complex));
        if(tmp==NULL){fprintf(stderr,"realloc failed\n");exit(1);}
        plan->expo=tmp;
        plan->size=size;
    }

    plan->expo[0]=1;
    //plan->expo[1]=exp(2 *I* PI*freqscale[currentcarrier]/fs);
    plan->expo[1]=cos(2 * PI*freqscale[currentcarrier]/fs)+sin(2 * PI*freqscale[currentcarrier]/fs)*I;
    //printf("%g %g\n",creall(plan->expo[1]),cimagl(plan->expo[1]));
    for(int i=2;i<size;i++){
        if(i%2==0){
            plan->expo[i]=plan->expo[i/2]*plan->expo[i/2];
        }else{
            plan->expo[i]=plan->expo[i/2]*plan->expo[i/2+1];
        }
    }
    //computing the result
    double result=0;  //initialization
    for(int i=0;i<size;i++){
        //double coss=0.5*creall(plan->expo[i]*phaseexp+conj(plan->expo[i])*phaseexpm);
        double coss=creall(plan->expo[i])*phasecos-cimagl(plan->expo[i])*phasesin;
        //printf("i %d cos %g\n",i,coss);
        result+=calibdata[i] *coss;
    }

    result = fabs(result / size);

    return result;
}

int main(){
    //the parameters

    long n=100000000;
    double* calibdata=malloc(n*sizeof(double));
    if(calibdata==NULL){fprintf(stderr,"malloc failed\n");exit(1);}

    int freqnb=42;
    double* freqscale=malloc(freqnb*sizeof(double));
    if(freqscale==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
    for (int i = 0; i < freqnb; i++){
        freqscale[i]=i*i*0.007+i;
    }

    double fs=n;

    double phase=0.05;

    //populate calibdata
    for (int i = 0; i < n; i++){
        calibdata[i]=i/((double)n);
        calibdata[i]=calibdata[i]*calibdata[i]-calibdata[i]+0.007/(calibdata[i]+3.0);
    }

    //call to sample code
    clock_t t;
    t = clock();
    double res=phase_func(calibdata,n, freqscale, fs, phase, 13);
    t = clock() - t;

    printf("first call got %g in %g seconds.\n",res,((float)t)/CLOCKS_PER_SEC);


    //initialize
    cos_plan plan;
    plan.expo=malloc(n*sizeof(double complex));
    plan.size=n;

    t = clock();
    res=phase_func2(calibdata,n, freqscale, fs, phase, 13,&plan);
    t = clock() - t;

    printf("second call got %g in %g seconds.\n",res,((float)t)/CLOCKS_PER_SEC);




    //cleaning

    free(plan.expo);

    free(calibdata);
    free(freqscale);

    return 0;
}

Compile with gcc main.c -o main -std=c99 -lm -Wall -O3. Using the code you provided, it take 8 seconds with size=100000000 on my computer while the execution time of the proposed solution takes 1.5 seconds... It is not so impressive, but it is not negligeable.

The solution that is presented does not involve any call to cos of sin in the for loops. Indeed, there are only multiplications and additions. The bottleneck is either the memory bandwidth or the tests and access to memory in the exponentiation by squaring (most likely first issue, since i add to use an additional array of complex).

For complex number in c, see:

If the problem is memory bandwidth, then parallelism is required... and directly computing cos would be easier. Additional simplifications coud have be performed if freqscale[currentcarrier] / fs were an integer. Your problem is really close to the computation of Discrete Cosine Transform, the present trick is close to the Discrete Fourier Transform and the FFTW library is really good at computing these transforms.

Notice that the present code can produce innacurate results due to loss of significance : result can be much larger than cos(...)*calibdata[] when size is large. Using partial sums can resolve the issue.

Community
  • 1
  • 1
francis
  • 9,525
  • 2
  • 25
  • 41
  • Hm... nice theory, but how exactly to implement it? – Eugene Sh. Feb 19 '16 at 19:57
  • Probably will do nothing for you. The cost of evaluating the two exp() expressions is likely to be greater than the cost of just evaluating cos() – FredK Feb 19 '16 at 20:01
  • Introducing complex numbers to accelerate the code? – RHertel Feb 19 '16 at 20:03
  • @EugeneSh. : implementing this would require the use of the `pow()` to compute the exponentials... Since the power of the exponential is required for all `i`, a limited number of calls to `pow()` are required. The problem is that is could lead to precision issues... I'm gonna try it... – francis Feb 19 '16 at 20:04
  • Complex exponentials, no? – Eugene Sh. Feb 19 '16 at 20:05
  • @FredK : if the use of `pow` works to compute `exp(i*x)` for all `i`, then `exp` is only called twice for the whole array ! – francis Feb 19 '16 at 20:10
  • @EugeneSh. : Yes, exactly, complex exponentials ! @RHertel : Indeed, in this particular problem, introducing complex numbers helped. @FredK : the proposed solution does not involve the use of `pow` of `exp` in the for loops, only multiplications and additions ! As a result, the wall-clock time is divided by 5 compared to the evaluation of cos(), without significant loss on precision. – francis Feb 19 '16 at 23:07
1

Your enemies in execution time are:

  • Division
  • Function calls (including implicit ones in loops)
  • Accessing data from diffent areas
  • Operating dissimilar instructions

You should research on Data Driving programming and using the data cache effectively.

Division

Whether with hardware support or software support division takes a long time by its very nature. Eliminate if possibly by changing the numeric base or factoring out of the loop (if possible).

Function Calls

The most efficient method of execution is sequential. Processors are optimized for this. A branch may require the processor perform some additional calculation (branch prediction) or reloading of the instruction cache / pipeline. A waste of time (that could be spent executing data instructions).

The optimization for this is to use techniques like loop unrolling and inlining of small functions. Also reduce the quantity of branches by simplifying expressions and using Boolean algebra.

Accessing data from different areas Modern processors are optimized to operate on local data (data in one area). One example is loading an internal cache with data. Specifically, loading a cache line with data. For example, if the data from your arrays is in one location and the cosine data in another, this may cause the data cache to be reloaded, again wasting time.

A better solution is to place all data contiguously or to contiguously access all the data. Rather than making many discontiguous accesses to the cosine table, look up a batch of cosine values sequentially (without any other data accesses between).

Dissimilar Instructions

Modern processors are more efficient at processing a batch of similar instructions. For example the pattern load, add, store is more efficient for blocks when all the loading is performed, then all adding, then all storing.

Summary

Here's an example:

register double result = 0.0;
register unsigned int i = 0U;
for (i = 0; i < size; i += 2)
{
    register double cos_angle1 = /* ... */;
    register double cos_angle2 = /* ... */;
    result += calibdata[i + 0] * cos_angle1;
    result += calibdata[i + 1] * cos_angle2;
}

The above loop is unrolled and like operations are performed in groups.
Although the keyword register may be deprecated, it is a suggestion to the compiler to use dedicated registers (if possible).

Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154
  • Two questions: Shouldn't the compiler be able to unroll this loop in its own? And do you have any source for your claim that modern processors are better with a batch of similar instructions. Having finished a university course about this very subject about half a year ago, I'm a bit sceptical about that. In your example, performing all the loading / storing in batches would possibly be limited by memory bandwidth / cache availability. And performing adds in a batch is only efficient as long as you've got enough ALU units available. – Daniel Jour Feb 19 '16 at 21:44
  • The compiler may unroll the loop depending on the optimizations, but it won't group the instructions (unless it is given more information through specific, specialized, pragmas). I have performed this operation on a waveform smoothing function running on an ARM7 embedded system. I measured using an oscilloscope and notices that it improved execution time by over 100 microseconds. – Thomas Matthews Feb 20 '16 at 00:28
  • I have also benchmarked a Cortex A8 where the data was manipulated inlined (the array was local) vs. the data being in another module. The inlined data had an incredible faster performance time. Organizing the data to be more lined up with the data cache also saved more execution time. All profilings were performed by using an oscilloscope. – Thomas Matthews Feb 20 '16 at 00:30
  • Thanks a lot. Reading up on it on Wikipedia, it seems most ARM7 systems aren't using out of order execution, which could be a reason for the difference between your practical results and my theoretical assumptions. – Daniel Jour Feb 21 '16 at 18:50
  • If you think the answer was helpful, please click on the check mark. – Thomas Matthews Feb 21 '16 at 18:51
0
  1. Simple trig identity to eliminate the - (PI / 2). This is also more accurate than attempting the subtraction which uses machine_PI. This is important when values are near π/2.

    cosine(x - π/2) == -sine(x)
    
  2. Use of const and restrict: Good compilers can perform more optimizations with this knowledge. (See also @user3528438)

    // double phase_func(double* calibdata, long size, 
    //     double* freqscale, double fs, double phase, int currentcarrier) {
    double phase_func(const double* restrict calibdata, long size, 
        const double* restrict freqscale, double fs, double phase, int currentcarrier) {
    
  3. Some platforms perform faster calculations with float vs double with a tolerable loss of precision. YMMV. Profile code both ways.

    // result += calibdata[i] * cos(...
    result += calibdata[i] * cosf(...
    
  4. Minimize recalculations.

    double angle_delta = ...;
    double angle_current = ...;
    for (int i = 0; i < size; i++) {
      result += calibdata[i] * cos(angle_current);
      angle_current += angle_delta;
    }
    
  5. Unclear why code uses long size and and int currentcarrier. I'd expect the same type and to use type size_t. This is idiomatic for array indexing. @Daniel Jour

  6. Reversing loops can allow a compare to 0 rather than compare to variable. Sometimes a modest performance gain.

  7. Insure compiler optimizations are well enabled.

All together

double phase_func2(const double* restrict calibdata, size_t size,
    const double* restrict freqscale, double fs, double phase,
    size_t currentcarrier) {

  double result = 0.0;
  double angle_delta = 2.0 * PI * freqscale[currentcarrier] / fs;
  double angle_current = angle_delta * (size - 1) + phase * (PI / 180);
  size_t i = size;
  while (i) {
    result -= calibdata[--i] * sinf(angle_current);
    angle_current -= angle_delta;
  }
  result = fabs(result / size);
  return result;
}
Community
  • 1
  • 1
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • Sum of products of a function values and sine/cosine values with scaled periods looks like convolution. I bet the `phase_func2` function is called repeatedly for every values of `phase`. The author is probably solving an XY problem, where Y is the `phase_func2` optimization problem, and X is kind of a wavelet transform. – mbaitoff Jul 12 '16 at 04:33
  • @mbaitoff Suspect you are correct.. As usual, the more info, from OP, the better optimization possibilities. – chux - Reinstate Monica Jul 12 '16 at 05:37
0

Leveraging the cores you have, without resorting to the GPU, use OpenMP. Testing with VS2015, the invariants are lifted out of the loop by the optimizer. Enabling AVX2 and OpenMP.

double phase_func3(double* calibdata, const int size, const double* freqscale, 
    const double fs, const double phase, const size_t currentcarrier)
{
    double result{};
    constexpr double PI = 3.141592653589;

#pragma omp parallel
#pragma omp for reduction(+: result)
    for (int i = 0; i < size; ++i) {
        result += calibdata[i] *
            cos( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.0) - (PI / 2.0)));
    }
    result = fabs(result / size);
    return result;
}

The original version with AVX enabled took: ~1.4 seconds
and adding OpenMP brought it down to: ~0.51 seconds.

Pretty nice return for two pragmas and a compiler switch.

David Thomas
  • 778
  • 5
  • 10