I recommend to make tayler series based sin/cos function and _mm256_stream_pd() to store data. Here is base sample code.
__m256d sin_req[10];
__m256d cos_req[10];
__m256d one_pd = _mm256_set1_pd(1.0);
for(int i=0; i<10; ++i)
{
sin_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/Factorial((i+1)*2+1) ) : _mm256_set1_pd(+1.0/Factorial((i+1)*2+1) );
cos_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/Factorial((i+1)*2+0) ) : _mm256_set1_pd(+1.0/Factorial((i+1)*2+0) );
}
for(int i=0; i<count; i+=4)
{
__m256d voltVec = _mm256_load_pd(volt + i);
__m256d angVec = _mm256_load_pd(theta + i);
// sin/cos by taylor series
__m256d angleSq = angVec * angVec;
__m256d sinVec = angVec;
__m256d cosVec = one_pd;
__m256d sin_serise = sinVec;
__m256d cos_serise = one_pd;
for(int j=0; j<10; ++j)
{
sin_serise = sin_serise * angleSq; // [1]
cos_serise = cos_serise * angleSq;
sinVec = sinVec + sin_serise * sin_req[j];
cosVec = cosVec + cos_serise * cos_req[j];
}
__m256d resultReal = voltVec * sinVec;
__m256d resultImag = voltVec * cosVec;
_mm256_store_pd(xReal + i, resultReal);
_mm256_store_pd(xImag + i, resultImag );
}
I could get 57~58 CPU cycles for 4 components calculation.
I searched google and ran some tests for accuracy of my sin/cos. Some articles say 10 iteration is double-precision accurate while -M_PI/2 < angle < +M_PI/2. And my test result show it's more accurate than math.h's sin/cos at -M_PI < angle < +M_PI range. You can increase iteration for more accuracy for large angle valued if needed.
However I'll go deeper optimizing this code. This code have latency problem calculation tayor series. AVX's multiply latency is 5 CPU cycle, this means we can't run one iteration faster than 5 cycle because [1] uses result from previous iteration's result.
We can simply unroll it like this.
for(int i=0; i<count; i+=8)
{
__m256d voltVec0 = _mm256_load_pd(volt + i + 0);
__m256d voltVec1 = _mm256_load_pd(volt + i + 4);
__m256d angVec0 = _mm256_load_pd(theta + i + 0);
__m256d angVec1 = _mm256_load_pd(theta + i + 4);
__m256d sinVec0;
__m256d sinVec1;
__m256d cosVec0;
__m256d cosVec1;
__m256d angleSq0 = angVec0 * angVec0;
__m256d angleSq1 = angVec1 * angVec1;
sinVec0 = angVec0;
sinVec1 = angVec1;
cosVec0 = one_pd;
cosVec1 = one_pd;
__m256d sin_serise0 = sinVec0;
__m256d sin_serise1 = sinVec1;
__m256d cos_serise0 = one_pd;
__m256d cos_serise1 = one_pd;
for(int j=0; j<10; ++j)
{
sin_serise0 = sin_serise0 * angleSq0;
cos_serise0 = cos_serise0 * angleSq0;
sin_serise1 = sin_serise1 * angleSq1;
cos_serise1 = cos_serise1 * angleSq1;
sinVec0 = sinVec0 + sin_serise0 * sin_req[j];
cosVec0 = cosVec0 + cos_serise0 * cos_req[j];
sinVec1 = sinVec1 + sin_serise1 * sin_req[j];
cosVec1 = cosVec1 + cos_serise1 * cos_req[j];
}
__m256d realResult0 = voltVec0 * sinVec0;
__m256d imagResult0 = voltVec0 * cosVec0;
__m256d realResult1 = voltVec1 * sinVec1;
__m256d imagResult1 = voltVec1 * cosVec1;
_mm256_store_pd(xReal + i + 0, realResult0);
_mm256_store_pd(xImag + i + 0, imagResult0);
_mm256_store_pd(xReal + i + 4, realResult1);
_mm256_store_pd(xImag + i + 4, imagResult1);
}
This result 51~51.5 cycles for 4 component calculation. (102~103 cycle for 8 component)
It eliminated mutiply latency in taylor calculation loop and uses 85% of AVX multiply unit. Unrolling will solve lots of latency problems while it does not swap registers to memory. Generate asm file while compiling and see how your compiler handle your code. I tried unroll more but it resulted bad because it could not fit in 16 AVX registers.
Now we go with memory optmize. replace _mm256_store_ps() to _mm256_stream_ps().
_mm256_stream_pd(xReal + i + 0, realResult0);
_mm256_stream_pd(xImag + i + 0, imagResult0);
_mm256_stream_pd(xReal + i + 4, realResult1);
_mm256_stream_pd(xImag + i + 4, imagResult1);
Replacing memory write code result 48 cycles for 4 component calculation.
_mm256_stream_pd() is always faster if you are not going to read it back. It skip cache system and send data direct to memory controller and does not pollute your cache. You will get more databus/cache space for reading datas by using _mm256_stream_pd().
Let's try prefetch.
for(int i=0; i<count; i+=8)
{
_mm_prefetch((const CHAR *)(volt + i + 5 * 8), _MM_HINT_T0);
_mm_prefetch((const CHAR *)(theta + i + 5 * 8), _MM_HINT_T0);
// calculations here.
}
Now I got 45.6~45.8 CPU cycles per calculation. 94% busy of AVX multiply unit.
Prefech hints cache for faster read. I recommend to prefech before 400~500 CPU cycles based on physical memory's RAS-CAS latency. Physical memory latency can take up to 300 cycles on worst case. may vary by hardware configuration, will not be smaller than 200 cycle even if you use expensive low RAS-CAS delay memory.
0.064 sec (count = 18562320)
End of sin/cos optimization. :-)