Intel Xeon Phi provides using the "IMCI" instruction set ,
I used it to do "c = a*b" , like this:
float* x = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float* y = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float z[N];
_Cilk_for(size_t i = 0; i < N; i+=16)
{
__m512 x_1Vec = _mm512_load_ps(x+i);
__m512 y_1Vec = _mm512_load_ps(y+i);
__m512 ans = _mm512_mul_ps(x_1Vec, y_1Vec);
_mm512_store_pd(z+i,ans);
}
And test it's performance , when the N SIZE is 1048576,
it need cost 0.083317 Sec , I want to compare the performance with auto-vectorization
so the other version code like this:
_Cilk_for(size_t i = 0; i < N; i++)
z[i] = x[i] * y[i];
This version cost 0.025475 Sec(but sometimes cost 0.002285 or less, I don't know why?)
If I change the _Cilk_for to #pragma omp parallel for, the performance will be poor.
so, if the answer like this, why we need to use intrinsics?
Did I make any mistakes any where?
Can someone give me some good suggestion to optimize the code?