How to improve performance of following loop

Question

I have a simple loop in C where I convert magnitude and angle to real and imaginary parts. I have two versions of the loop as. Version 1 is a simple for loop where I perform the conversion using following code

for(k = 0; k < n; k++){
    xReal[k] = Mag[k] * cos(Angle[k]);
    xImag[k] = Mag[k] * sin(Angle[k]);
}

An Version 2 where Intrinsics are used to vectorize the loop.

__m256d cosVec, sinVec;
__m256d resultReal, resultImag;
__m256d angVec, voltVec;
for(k = 0; k < SysData->totNumOfBus; k+=4){

    voltVec = _mm256_loadu_pd(volt + k);
    angVec = _mm256_loadu_pd(theta + k);

    sinVec = _mm256_sincos_pd(&cosVec, angVec);

    resultImag = _mm256_mul_pd(voltVec, sinVec);
    resultReal = _mm256_mul_pd(voltVec, cosVec);

    _mm256_store_pd(xReal+k, resultReal);
    _mm256_store_pd(xImag+k, resultImag);

}

On a Core i7 2600k @3.4GHz processor, these loops give following results:

Version 1: n = 18562320, Time: 0.2sec
Version 2: n = 18562320, Time: 0.16sec

A simple calculations with these values show that in version 1, each iteration takes almost 36 cycles to be completed whereas it takes 117 cycles for Version 2 to be completed. Considering the fact that calculation of sine and cosine functions is naturally expensive, these number seems not to be terrible. However, This loop is a serious bottleneck of my function as profiling shows that almost 1/3 of the time is spent inside the loop. So, I am wondering if there is any way to expedite this loop (e.g. calculating sine and cosine functions differently). It is appreciated if help me work around this problem and let me know whether there is room to improve the performance of this loop.

Thanks in advance for your help

PS: I am using icc to compile the code. Also, I should mention that data are not aligned (and cannot be). However, aligning data only leads to a minor performance improvement (Less than 1 percent).

How accurate do you need your results to be? If you're willing to accept a certain level of error, you can replace sin and cos with a lookup table. This is one of the most common (and old-school) approaches to accelerating trig functions. — Sniggerfardimungus, Aug 12 '13 at 23:29
Take a look at this question [Fast Sin/Cos using a pre computed translation array](http://stackoverflow.com/questions/2088194/fast-sin-cos-using-a-pre-computed-translation-array) — Ebbe M. Pedersen, Aug 12 '13 at 23:32
Should you want to trade speed for precision, please advise about the precision needed. Also, what is the type of `Angle[k]`? — chux - Reinstate Monica, Aug 13 '13 at 04:32
Are you using `-O3` ? Also can you check the generated code for your scalar loop and see whether the compiler is doing some auto-vectorization ? — Paul R, Aug 13 '13 at 05:53
You could have a carried loop dependency in Version 2. Try unrolling the loop — Z boson, Aug 13 '13 at 08:28
You could try another math library (you're using Intel's SVML). For example [avx_mathfun](http://software-lisc.fbk.eu/avx_mathfun/). There are fast math libraries as well when you don't need the precision [fastapprox](https://code.google.com/p/fastapprox/) (works for SSE you with have to augment the code for AVX). Also consider using float instead of double. — Z boson, Aug 13 '13 at 08:44
Also, why do you say the data can't be aligned but then claim it only gives a 1% improvement? Did you actually test it (i.e. did you use `_mm256_load/store_pd` instead of `_mm256_loadu/storeu_pd`)? — Z boson, Aug 13 '13 at 08:46
@Sniggerfardimungus, This is a part of larger project which performs integration. If a less accurate method is selected, the error accumulates the results may not be valid. Also, the results are supposed to be compared with a commercial package so I prefer not to use lookup tables as much as possible, or sacrifice the accuracy for 10 or 20 percent improvement. — Pouya, Aug 13 '13 at 15:37
@chux All of the variables (including `Angle[k]`) are double precision. — Pouya, Aug 13 '13 at 15:38
@PaulR Yes, I am using `-O3` and it seems that the compiler does pretty good job. In the report it says that the scalar loop (`version 1`) has been vectorized. Also, I inspected the assembly code and saw that the compiler does not calculate `sine` and `cosine` functions separately. Instead compiler calculate these function using a single call (just like what I've done in second version). — Pouya, Aug 13 '13 at 15:40
@redrum I unrolled the loop 4 times and did not see any improvement. — Pouya, Aug 13 '13 at 15:41
@redrum The data are from passed from another function. All the angles and magnitudes are packed in a vector along with many other variables. So, even if I align magnitude, angles will necessarily be aligned (or vice versa). However, in order to investigate effect of data alignment, I prepared aligned data and found out that it does not affect the performance significantly. — Pouya, Aug 13 '13 at 15:49
@redrum Thanks for your comment about `fastapprox` library. I'll check it out and see how it affects the program. — Pouya, Aug 13 '13 at 15:51
Perhaps you're memory bound. Your running over several megabytes. See [this](http://stackoverflow.com/questions/18159455/why-vectorizing-the-loop-does-not-have-performance-improvement) discussion. — Z boson, Aug 13 '13 at 17:42
Just as an extra test you can use SVML in GCC. You just have to link in `libsvml.a` (choose the correct version of 32bit or 64bit). I have had better results with AVX intrinsics on GCC than ICC (surprisingly). — Z boson, Aug 13 '13 at 17:50

zupet · Answer 1 · 2013-08-21T11:56:10.683

I recommend to make tayler series based sin/cos function and _mm256_stream_pd() to store data. Here is base sample code.

    __m256d sin_req[10];
    __m256d cos_req[10];
    __m256d one_pd =  _mm256_set1_pd(1.0);

    for(int i=0; i<10; ++i)
    {
        sin_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/Factorial((i+1)*2+1) ) : _mm256_set1_pd(+1.0/Factorial((i+1)*2+1) );
        cos_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/Factorial((i+1)*2+0) ) : _mm256_set1_pd(+1.0/Factorial((i+1)*2+0) );
    }

    for(int i=0; i<count; i+=4)
    {
            __m256d voltVec = _mm256_load_pd(volt + i);
            __m256d angVec = _mm256_load_pd(theta + i);

            // sin/cos by taylor series
            __m256d angleSq = angVec * angVec;
            __m256d sinVec = angVec;
            __m256d cosVec = one_pd;
            __m256d sin_serise = sinVec;
            __m256d cos_serise = one_pd;
            for(int j=0; j<10; ++j)
            {
                sin_serise = sin_serise * angleSq; // [1]
                cos_serise = cos_serise * angleSq;
                sinVec = sinVec + sin_serise * sin_req[j];
                cosVec = cosVec + cos_serise * cos_req[j];
            }

            __m256d resultReal = voltVec * sinVec;
            __m256d resultImag = voltVec * cosVec;

            _mm256_store_pd(xReal + i, resultReal);
            _mm256_store_pd(xImag + i, resultImag );
    }

I could get 57~58 CPU cycles for 4 components calculation.

I searched google and ran some tests for accuracy of my sin/cos. Some articles say 10 iteration is double-precision accurate while -M_PI/2 < angle < +M_PI/2. And my test result show it's more accurate than math.h's sin/cos at -M_PI < angle < +M_PI range. You can increase iteration for more accuracy for large angle valued if needed.

However I'll go deeper optimizing this code. This code have latency problem calculation tayor series. AVX's multiply latency is 5 CPU cycle, this means we can't run one iteration faster than 5 cycle because [1] uses result from previous iteration's result.

We can simply unroll it like this.

    for(int i=0; i<count; i+=8)
    {
        __m256d voltVec0 = _mm256_load_pd(volt + i + 0);
        __m256d voltVec1 = _mm256_load_pd(volt + i + 4);
        __m256d angVec0  = _mm256_load_pd(theta + i + 0);
        __m256d angVec1  = _mm256_load_pd(theta + i + 4);
        __m256d sinVec0;
        __m256d sinVec1;
        __m256d cosVec0;
        __m256d cosVec1;

        __m256d angleSq0 = angVec0 * angVec0;
        __m256d angleSq1 = angVec1 * angVec1;
        sinVec0 = angVec0;
        sinVec1 = angVec1;
        cosVec0 = one_pd;
        cosVec1 = one_pd;
        __m256d sin_serise0 = sinVec0;
        __m256d sin_serise1 = sinVec1;
        __m256d cos_serise0 = one_pd;
        __m256d cos_serise1 = one_pd;

        for(int j=0; j<10; ++j)
        {
            sin_serise0 = sin_serise0 * angleSq0;
            cos_serise0 = cos_serise0 * angleSq0;
            sin_serise1 = sin_serise1 * angleSq1;
            cos_serise1 = cos_serise1 * angleSq1;
            sinVec0 = sinVec0 + sin_serise0 * sin_req[j];
            cosVec0 = cosVec0 + cos_serise0 * cos_req[j];
            sinVec1 = sinVec1 + sin_serise1 * sin_req[j];
            cosVec1 = cosVec1 + cos_serise1 * cos_req[j];
        }

        __m256d realResult0 = voltVec0 * sinVec0;
        __m256d imagResult0 = voltVec0 * cosVec0;
        __m256d realResult1 = voltVec1 * sinVec1;
        __m256d imagResult1 = voltVec1 * cosVec1;

        _mm256_store_pd(xReal + i + 0, realResult0);
        _mm256_store_pd(xImag + i + 0, imagResult0);
        _mm256_store_pd(xReal + i + 4, realResult1);
        _mm256_store_pd(xImag + i + 4, imagResult1);
    }

This result 51~51.5 cycles for 4 component calculation. (102~103 cycle for 8 component)

It eliminated mutiply latency in taylor calculation loop and uses 85% of AVX multiply unit. Unrolling will solve lots of latency problems while it does not swap registers to memory. Generate asm file while compiling and see how your compiler handle your code. I tried unroll more but it resulted bad because it could not fit in 16 AVX registers.

Now we go with memory optmize. replace _mm256_store_ps() to _mm256_stream_ps().

    _mm256_stream_pd(xReal + i + 0, realResult0);
    _mm256_stream_pd(xImag + i + 0, imagResult0);
    _mm256_stream_pd(xReal + i + 4, realResult1);
    _mm256_stream_pd(xImag + i + 4, imagResult1);

Replacing memory write code result 48 cycles for 4 component calculation.

_mm256_stream_pd() is always faster if you are not going to read it back. It skip cache system and send data direct to memory controller and does not pollute your cache. You will get more databus/cache space for reading datas by using _mm256_stream_pd().

Let's try prefetch.

    for(int i=0; i<count; i+=8)
    {
    _mm_prefetch((const CHAR *)(volt + i + 5 * 8), _MM_HINT_T0);
    _mm_prefetch((const CHAR *)(theta + i + 5 * 8), _MM_HINT_T0);

            // calculations here.
    }

Now I got 45.6~45.8 CPU cycles per calculation. 94% busy of AVX multiply unit.

Prefech hints cache for faster read. I recommend to prefech before 400~500 CPU cycles based on physical memory's RAS-CAS latency. Physical memory latency can take up to 300 cycles on worst case. may vary by hardware configuration, will not be smaller than 200 cycle even if you use expensive low RAS-CAS delay memory.

0.064 sec (count = 18562320)

End of sin/cos optimization. :-)

score 1 · Answer 2 · edited Aug 13 '13 at 01:59

1

please check:

whether the start address of the array is aligned to 16byte. i7 support high latency unaligned avx load without complain "bus error"
please check the cache hit and miss rate using a profile tool. It seems memory access is the bottleneck of version 2 of the loop
you can downgrade the precision, or using a result table for sin and cos calculation.
please consider how much performance improvement you plan to achieve. Since the version 1 of the loop take only 1/3 of total running time. if you optimize the loop to zero, the performance only improve 30%

edited Aug 13 '13 at 01:59

Casey

41,449
7
95
125

answered Aug 12 '13 at 23:33

Kun Ling

2,211
14
22

Particularly note #4. There's a cap based on how much of the time that that code is running. *Even if you did nothing but return*, you'd only see a 30% boost. You're doing actual work here, though -- and rather efficiently already, from the looks of it -- so you're going to be capped even further. – cHao Aug 13 '13 at 07:46

JackCColeman · Answer 3 · 2013-08-13T07:29:28.637

0

The timing results that you list show that version 2 runs faster (by 20%) compared to version 1.

Version 1: n = 18562320, Time: 0.2sec
Version 2: n = 18562320, Time: 0.16sec

Not sure how you are calculating the cycles used in each version? There is a lot of work going on in a processor and cache fetches might cause a difference in the times, even though v1 is using less cycles (again not knowing how you counted the cycles).

Or another way to explain it, is that with vectorization the data elements are available without any wait time for memory fetches.

edited Aug 13 '13 at 07:29

answered Aug 13 '13 at 06:10

JackCColeman

3,777
1
15
21

v1 is not using less cycles. Loop step is `4` in `v2 (k+=4)` and `1` in `v1 (k++)`. So, v2 is indeed faster. The processor runs at `3.4GHz', so: number of cycles for each iteration is: `(.2*3.4GHz)/18562320 = 36`. For `version 2`: `(.16*3.4GHz)/(18562320/4) = 117` – Pouya Aug 13 '13 at 15:32
@Pouya, why are you multiplying v2 by 4? – JackCColeman Aug 13 '13 at 18:28
@JackCCeman because number of iterations is n/4. Note that loop step is 4 not 1. – Pouya Aug 13 '13 at 22:06
@Pouya, then what you are saying is that the vectorization code operates on 4 numeric values at a time? If that is the case then one machine cycle is running one operation on four data elements at the same time! Hence the efficiency of vectorization and this explains why v2 is running faster. – JackCColeman Aug 13 '13 at 22:18
Yes, `_mm256_sincos_pd` is able to work on four `double` numbers simultaneously just the same as `_mm256_mul_pd` which performs multiplication using `256-bit` registers. I inspected the assembly code for `version 1` and it seems that compiler has chosen `XMM` registers which are `128-bit` registers. – Pouya Aug 13 '13 at 23:03

How to improve performance of following loop

3 Answers3