Matrix multiplication paralellize with openmp and vectorized with avx2

Question

(1)For some sizes(matrix size) code work fine but for some sizes it calculates wrong matrix multiplication , although i uses Avx2 instruction set carefully but i I cannot figure out where the problem is.

(2) When i only vectorized the code with Avx2 instruction set execution time is less as compare to when i vectorized the code with avx2 and parallelize with Openmp. Although execution time should be less for when both vectorization(Avx2) and parallelization(Openmp) is used.

void mat_mul_pl(int size, double **mat1, double **mat2, double **result)
{
__m256d vec_multi_res = _mm256_setzero_pd(); //Initialize vector to zero
__m256d vec_mat1 = _mm256_setzero_pd(); //Initialize vector to zero
__m256d vec_mat2 = _mm256_setzero_pd();


int i, j, k;

// #pragma omp parallel for schedule(static)
for (i = 0; i < size; i++)
{
    for (j = 0; j < size; ++j)
    {
        //Stores one element in mat1 and use it in all computations needed before proceeding
        //Stores as vector to increase computations per cycle
        vec_mat1 = _mm256_set1_pd(mat1[i][j]);
#pragma omp parallel for
        for (k = 0; k < size; k += 8)
        {
            vec_mat2 = _mm256_loadu_pd((void*)&mat2[j][k]); //Stores row of second matrix (eight in each iteration)
            vec_multi_res = _mm256_loadu_pd((void*)&result[i][k]); //Loads the result matrix row as a vector
            vec_multi_res = _mm256_add_pd(vec_multi_res ,_mm256_mul_pd(vec_mat1, vec_mat2));//Multiplies the vectors and adds to th the result vector

            _mm256_storeu_pd((void*)&result[i][k], vec_multi_res); //Stores the result vector into the result array
        }
    }
    }
}

_Although execution time should be less for when both vectorization(Avx2) and parallelization(Openmp) is used._ Not always true. For small enough vectors, the overhead of starting up [and stopping] the threads is higher than the non-threaded case. Also, [assuming `x86` cpu], some of the SMT cpu cores on some processor models may [have to] share H/W and may have to serialize access to the SIMD H/W used. More threads means more memory bus contention [and too many threads saturates the memory bus]. Does `openmp` understand `AVX2` et. al. What about cache locality of reference? — Craig Estey, Apr 26 '20 at 20:29
Your multiplication code assumes the matrix size is a multiple of 8. Generally the OpenMP directives should be on the outermost loop (the `i` loop in this case). — 1201ProgramAlarm, Apr 26 '20 at 20:57
@CraigEstey: "serialize" is not really an accurate description of SMT / hyperthreads competing for cycles on the load/store and FMA execution units of a physical core. Yes contention, but no more so than contending for front-end cycles and integer ALU ports. But serialize would imply some ordering. But yes, this is probably not bound on latency or branch misses so SMT is not super helpful; a single HW thread can probably keep that core's memory pipeline full. With 2 loads and 1 store per FMA, that will be the bottleneck even if it all hits in L1d cache. — Peter Cordes, Apr 26 '20 at 22:13
@PeterCordes Hi Peter. Yes, "serialize" was a bit loose. Maybe "stall" would be better? If [two] sibling hyperthreads share the same FPU/SIMD H/W (model specific?), one would have to wait for the other to complete. Or, if the H/W has N pipeline stages, and 1st sibling has completed stage 0 [working on stage 1], can 2nd sibling immediately load into stage 0? Does (how does) SIMD H/W load balance to prevent a sibling from monopolizing it? — Craig Estey, Apr 26 '20 at 22:27
@CraigEstey: I'd say "contend for" or "compete for" access to those execution resources. "stall" usually implies a larger wait for something, like a cache miss, that stalls the whole pipeline, not execution of one instruction or losing a cycle on one ALU port. OoO exec means that delaying one instruction by a cycle doesn't delay independent work. BTW, sibling hyperthreads compete for *everything* including integer ALUs; that's the point of hyperthreading / SMT. Perhaps you're thinking of Bulldozer-style CMT where two weak integer cores share an FPU/SIMD unit. — Peter Cordes, Apr 26 '20 at 22:32
@CraigEstey: Fine-grained SMT like Intel Hyperthreading, or whatever AMD Zen calls it, alternates cycles between HW threads in the front-end, unless one is stalled. Out-of-order exec in the back-end still goes on an oldest-ready-first basis. The ROB is usually statically partitioned to stop one thread from filling up the back-end with instructions that are stuck waiting for a cache miss. Further reading: [Modern Microprocessors A 90-Minute Guide!](http://www.lighterra.com/papers/modernmicroprocessors/) is excellent and covers SMT. Also https://www.realworldtech.com/sandy-bridge/ — Peter Cordes, Apr 26 '20 at 22:36
@CraigEstey: Also https://www.realworldtech.com/alpha-ev8-smt/ from 2000 is an intro to SMT in one of the first microarchitectures to use it, Alpha EV8. It's not super detailed about front-end vs. later parts of the pipeline, though. Coarser implementations are possible, like https://en.wikipedia.org/wiki/Simultaneous_multithreading mentions: an in-order pipeline might switch to another HW context on a long-duration stall like a cache miss. — Peter Cordes, Apr 26 '20 at 22:42
@CraigEstey: [Is duplication of state resources considered optimal for hyper-threading?](https://stackoverflow.com/a/39984055) talks some about when code does/doesn't benefit from HT. — Peter Cordes, Apr 26 '20 at 22:56
@PeterCordes Thanks. I've perused some of the links, and I'm _reading_ the realworldtech sandy bridge architecture [Fascinating]. It is answering a number of questions that started to float into my head. — Craig Estey, Apr 27 '20 at 00:09
@CraigEstey: cool :) I meant to link https://www.realworldtech.com/bulldozer/ for comparison: instead of SMT they do CMP with two integer cores sharing an FPU/SIMD unit, and sharing some front-end resources. Downside: a single thread can't take advantage of as many integer resources. Upside: with both cores active, often more total performance than from one wider SMT core with the same caches that the pair of cores share. — Peter Cordes, Apr 27 '20 at 00:25
@PeterCordes The sandy bridge article did mention some of that as comparison against bulldozer. I'll have to finish tomorrow, my girlfriend is "requesting" my presence. A while back, I showed your xkcd avatar to her. She thinks it applies to me as well :-) — Craig Estey, Apr 27 '20 at 00:43
@1201ProgramAlarm yes i know it assumes size is multiple of 8 but it calculating wrong for some multiples of 8 also. Secondly on putting Open mp directives in outermost loop further destroy the matrix multiplication answer. — FarhanIoTDeveloper, Apr 27 '20 at 05:58
@CraigEstey yes for small vectors it may create overhead , but issue remains same with large vector sizes , when i try to execute with max hardware threads of my CPU i.e 2. — FarhanIoTDeveloper, Apr 27 '20 at 06:03
If this is not a learning exercise, then stop now and go find an existing, free, well tuned (cache-blocked) matrix multiplication routine in a standard library, because you are wasting your time. ("The best code is the code I do not have to write"). — Jim Cownie, Apr 27 '20 at 11:05
by using k+=4 did the matrix multiplication problem solved with Avx2. But when i am trying to insert openmp directives (#pragma omp parallel for) the matrix multiplication goes wrong. I almost tried every position i.e inner loop, outer loop but openmp loop not properly working with avx2 instructions. Can anyone suggest suitable position for Openmp directives? — FarhanIoTDeveloper, Apr 27 '20 at 14:58
You should move the definitions of your 3 variables used in the innermost loop - `vec_multi_res `, `vec_mat1`, and `vec_mat2` - inside the innermost loop where they are used so that each thread will have its own copy of them. Otherwise they have one definition shared among all threads which can cause your OpenMP calcs to be incorrect. — 1201ProgramAlarm, Apr 27 '20 at 19:51
@1201ProgramAlarm Your answer is approximately correct and it helps me , however i m facing another problem, so i m going to post about it in the answer section below. Kindly look at it. — FarhanIoTDeveloper, Apr 27 '20 at 22:11
@PeterCordes Can you weigh in here: https://stackoverflow.com/questions/61470328/is-an-extra-move-somehow-faster-when-doing-division-by-multiplication — Craig Estey, Apr 28 '20 at 01:01

score 0 · Answer 1 · answered Apr 27 '20 at 22:25

I am not facing any errors in compilation but when i execute the programm i get a segmentation fault(core dumped). By debugging i noticed that the core dumped at a certain size(greater than 100)at vec_mat2 = _mm256_load_pd((void*)&mat2[j][k]) this instruction.Please see the updated code below as per your instructions.

void mat_mul_pl(int size, double **mat1, double **mat2, double **result)
{

int i, j, k;

#pragma omp Parallel shared(vec_mat1,vec_mat2,vec_multi_res) private(i,j,k)
{

#pragma omp for schedule(static)
    for (i = 0; i < size; i++)
    {       
        for (j = 0; j < size; ++j)
        {
            __m256d vec_mat1 = _mm256_setzero_pd(); 
            vec_mat1 = _mm256_set1_pd(mat1[i][j]);
            for (k = 0; k < size; k += 4)
            { 
                __m256d vec_multi_res = _mm256_setzero_pd(); 
                __m256d vec_mat2 = _mm256_setzero_pd();
                vec_mat2 = _mm256_load_pd((void*)&mat2[j][k]); 
                vec_multi_res = _mm256_load_pd((void*)&result[i][k]); 
                vec_multi_res = _mm256_add_pd(vec_multi_res ,_mm256_mul_pd(vec_mat1, vec_mat2));
                _mm256_store_pd((void*)&result[i][k], vec_multi_res);
            } 
        }
    }
    }
}

Matrix multiplication paralellize with openmp and vectorized with avx2

1 Answers1