(1)For some sizes(matrix size) code work fine but for some sizes it calculates wrong matrix multiplication , although i uses Avx2 instruction set carefully but i I cannot figure out where the problem is.
(2) When i only vectorized the code with Avx2 instruction set execution time is less as compare to when i vectorized the code with avx2 and parallelize with Openmp. Although execution time should be less for when both vectorization(Avx2) and parallelization(Openmp) is used.
void mat_mul_pl(int size, double **mat1, double **mat2, double **result)
{
__m256d vec_multi_res = _mm256_setzero_pd(); //Initialize vector to zero
__m256d vec_mat1 = _mm256_setzero_pd(); //Initialize vector to zero
__m256d vec_mat2 = _mm256_setzero_pd();
int i, j, k;
// #pragma omp parallel for schedule(static)
for (i = 0; i < size; i++)
{
for (j = 0; j < size; ++j)
{
//Stores one element in mat1 and use it in all computations needed before proceeding
//Stores as vector to increase computations per cycle
vec_mat1 = _mm256_set1_pd(mat1[i][j]);
#pragma omp parallel for
for (k = 0; k < size; k += 8)
{
vec_mat2 = _mm256_loadu_pd((void*)&mat2[j][k]); //Stores row of second matrix (eight in each iteration)
vec_multi_res = _mm256_loadu_pd((void*)&result[i][k]); //Loads the result matrix row as a vector
vec_multi_res = _mm256_add_pd(vec_multi_res ,_mm256_mul_pd(vec_mat1, vec_mat2));//Multiplies the vectors and adds to th the result vector
_mm256_storeu_pd((void*)&result[i][k], vec_multi_res); //Stores the result vector into the result array
}
}
}
}