Efficient parallelisation of a linear algebraic function in C++ OpenMP

Question

I have little experience with parallel programming and was wondering if anyone could have a quick glance at a bit of code I've written and see, if there are any obvious ways I can improve the efficiency of the computation.

The difficulty arises due to the fact that I have multiple matrix operations of unequal dimensionality that I need to compute, so I'm not sure the most condensed way of coding the computation.

Below is my code. Note this code DOES work. The matrices I am working with are of dimension approx 700x700 [see int s below] or 700x30 [int n].

Also, I am using the armadillo library for my sequential code. It may be the case that parallelizing using openMP but retaining the armadillo matrix classes is slower than defaulting to the standard library; does anyone have an opinion on this (before I spend hours overhauling!)?

double start, end, dif;

int i,j,k;      // iteration counters
int s,n;        // matrix dimensions

mat B; B.load(...location of stored s*n matrix...) // input objects loaded from file
mat I; I.load(...s*s matrix...);
mat R; R.load(...s*n matrix...);
mat D; D.load(...n*n matrix...);

double e = 0.1; // scalar parameter

s = B.n_rows; n = B.n_cols;

mat dBdt; dBdt.zeros(s,n); // object for storing output of function

// 100X sequential computation using Armadillo linear algebraic functionality

start = omp_get_wtime();

for (int r=0; r<100; r++) {
    dBdt = B % (R - (I * B)) + (B * D) - (B * e);
}

end = omp_get_wtime();
dif = end - strt;
cout << "Seq computation: " << dBdt(0,0) << endl;
printf("relaxation time = %f", dif);
cout << endl;

// 100 * parallel computation using OpenMP

omp_set_num_threads(8);


for (int r=0; r<100; r++) {

//          parallel computation of I * B 
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j, k) schedule(static)
    for (i = 0; i < s; i++) {
        for (j = 0; j < n; j++) {
            for (k = 0; k < s; k++) {
                dBdt(i, j) += I(i, k) * B(k, j);
            }
        }
     }

//          parallel computation of B % (R - (I * B)) 
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j) schedule(static)
    for (i = 0; i < s; i++) {
        for (j = 0; j < n; j++) {
            dBdt(i, j)  = R(i, j) - dBdt(i, j);
            dBdt(i, j) *= B(i, j);
            dBdt(i, j) -= B(i, j) * e;
        }
    }

//          parallel computation of B * D 
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j, k) schedule(static)
   for (i = 0; i < s; i++) {
        for (j = 0; j < n; j++) {
            for (k = 0; k < n; k++) {
                dBdt(i, j) += B(i, k) * D(k, j);
            }
        }
    }    
}

end = omp_get_wtime();
dif = end - strt;
cout << "OMP computation: " << dBdt(0,0) << endl;
printf("relaxation time = %f", dif);
cout << endl;

If I hyper-thread 4 cores I get the following output:

Seq computation: 5.54926e-10
relaxation time = 0.130031
OMP computation: 5.54926e-10
relaxation time = 2.611040

Which suggests that although both methods produce the same result, the parallel formulation is roughly 20 times slower than the sequential.

It is possible that for matrices of this size, the overheads involved in this 'variable-dimension' problem outweighs the benefits of parallelizing. Any insights would be much appreciated.

Thanks in advance,

Jack

It's unclear to me what exactly you are trying to do? So the armadillo code works and does what it should do? Then use it. If you setup armadillo correctly it's as fast as possible (cache, simd, multithreading; as using [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) internally), maybe ooonly missing matrix-reordering (not sure if that's done or supported). — sascha, Sep 06 '17 at 14:38
after each parallel loop there is a fence, ie all threads wait until the last one finishes with the loop. You have 3 seperate parallel sections which smells like a big effective non-parallel part of execution time. Also 700x700 is not a very big matrix, maybe you need to increase the size to see a benefit from running in parallel — 463035818_is_not_an_ai, Sep 06 '17 at 14:50
I'm hoping to increase the speed of the computation by running the simulation on a HPC with more than 4 cores. Does the internal BLAS implementation automatically employ all available cores? — jack.l, Sep 06 '17 at 14:52
In the actual simulation I am running I need to repeat this computation many thousands of times, I would like to scale up but approx 700x700 already takes several days to complete a single run so even if I were to gain on the parallelisation it would still be quite an exhaustive process. Thanks for your comments by the way! — jack.l, Sep 06 '17 at 14:54
BLAS is more of a standard with many implementations. Did you read the wiki-link? Yes sure, it can also use 32 cores (depending on the implementation in use), but who knows what kind of scaling/speedup will be achieved (dimension-dependent). But probably something better as you would achieve (as highly optimized). Maybe checkout [Matrix chain multiplication too](https://en.wikipedia.org/wiki/Matrix_chain_multiplication) (i already mentioned it; it's usually a good idea with big data, not sure if the setup-phase would pay-off in your case) — sascha, Sep 06 '17 at 15:08
Could you kindly **present your HPC-domain insights** why trying to use 8 OMP-threads in situation, where your hardware has 4 CPU-cores? — user3666197, Sep 07 '17 at 00:42
Thanks @sascha, I'll look into the matrix chain multiplication. — jack.l, Sep 07 '17 at 09:38
@jack.l with all due respect, there ought be clear, **there is not any Matrix-chain** present in the process. **Neither `BD` is, nor `IB` is a Matrix-chain**. Had better to focus rather on the core issues. — user3666197, Sep 07 '17 at 18:04
@user3666197 Okay. That's on me, as i did not evalute his code carefully and mentioned it twice. — sascha, Sep 07 '17 at 19:08

score 0 · Answer 1 · answered Sep 06 '17 at 15:32

0

If you use a compiler which corrects your bad loop nests and fuses loops to improve memory locality for non parallel builds, openmp will likely disable those optimizations. As recommended by others, you should consider an optimized library such as mkl or acml. Default gfortran blas typically provided with distros is not multithreaded.

answered Sep 06 '17 at 15:32

user8566534

1

Sorry I had to look up a few of those terms. Is it correct that the Armadillo BLAS does support MKL functionality? (https://software.intel.com/en-us/articles/intelr-mkl-and-c-template-libraries) In which case do you think the most optimized solution is to simply use the in built arma functionality? – jack.l Sep 07 '17 at 09:35

user3666197 · Answer 2 · 2017-12-08T18:51:49.397

The Art of HPC is right the efficiency^{( poor grants never get HPC cluster quota )}

so first hope is your process will never re-read from file

Why? This would be an HPC-killer:

I need to repeat this computation many thousands of times

Fair enough to say, this comment has increased the overall need to completely review the approach and to re-design the future solution not to rely on a few tricks, but to indeed gain from your case-specific arrangement.

Last but not least - the [PARALLEL] scheduling is not needed, as a "just"-[CONCURRENT]-process scheduling is quite enough here. There is no need to orchestrate any explicit inter-process synchonisation or any message-passing and the process could just get orchestrated for the best performance possible.

No "...quick glance at a bit of code..." will help

You need to first understand both your whole process and also the hardware resources, it will be executed on.

CPU-type will tell you the available instruction set extensions for advanced tricks, L3- / L2- / L1-cache sizes + cache-line sizes will help you decide on best cache-friendly re-use of cheap data-access ( not paying hundreds [ns] if one can operate smarter on just a few [ns] instead, from a not-yet-evicted NUMA-core-local copy )

The Maths first, implementation next:

As given dBdt = B % ( R - (I * B) ) + ( B * D ) - ( B * e )

On a closer look, anyone ought be ready to realise HPC/cache-alignment priorities and wrong-looping traps:

dBdt = B % ( R - ( I * B ) )   ELEMENT-WISE OP B[s,n]-COLUMN-WISE
     +               ( B * D ) SUM.PRODUCT  OP B[s,n].ROW-WISE MUL-BY-D[n,n].COL
     -               ( B * e ) ELEMENT-WISE OP B[s,n].ROW-WISE MUL-BY-SCALAR

 ROW/COL-SUM.PRODUCT OP -----------------------------------------+++++++++++++++++++++++++++++++++++++++++++++
 ELEMENT-WISE OP ---------------------------------------------+  |||||||||||||||||||||||||||||||||||||||||||||
 ELEMENT-WISE OP ----------------------+                      |  |||||||||||||||||||||||||||||||||||||||||||||
                                       |                      |  |||||||||||||||||||||||||||||||||||||||||||||
                                       v                      v  vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
 dBdt[s,n]        =   B[s,n]           % /   R[s,n]           - /  I[s,s]                  . B[s,n]           \ \
     _________[n]         _________[n]   |       _________[n]   |      ________________[s]       _________[n]  | |
    |_|       |          |_|       |     |      |_|       |     |     |________________|        | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
    | .       |   =      | .       |   % |      | .       |   - |     |                |   .    | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
 [s]|_________|       [s]|_________|     |   [s]|_________|     |  [s]|________________|     [s]|_|_______|    | |
                                         \                      \                                             / /

                      B[s,n]              D[n,n]          
                          _________[n]        _________[n]
                         |_________|         | |       |   
                         | .       |         | |       |  
                         | .       |         | |       |  
                         | .       |         | |       |  
                  +      | .       |   .  [n]|_|_______|   
                         | .       |      
                         | .       |      
                         | .       |             
                      [s]|_________|      


                      B[s,n]                
                          _________[n]      
                         |_| . . . |        
                         | .       |        
                         | .       |        
                         | .       |        
                  -      | .       |    *  REGISTER_e
                         | .       |        
                         | .       |        
                         | .       |        
                      [s]|_________|

Having this in mind, efficient HPC loops will look much different

Depending on real-CPU-caches,
the loop may very efficiently co-process naturally-B-row-aligned ( B * D ) - ( B * e )
in a single phase and also the highest-re-use-efficiency based part of the elementwise longest-pipeline B % ( R - ( I * B ) ) here having a chance to re-use ~ 1000 x ( n - 1 ) cache-hits of B-column-aligned, which ought quite well fit into L1-DATA-cache footprints, so achieving savings in the order of seconds just from a cache-aligned loops.

After this cache-friendly loop-alignment is finished,

next may a distributed processing help, not before.

So, an experimentation plan setup:

Step 0: The Ground-Truth: ~ 0.13 [s] for dBdt[700,30] using armadillo in 100-test-loops

Step 1: The manual-serial: - test the rewards of the best cache-aligned code ( not the posted one, but the math-equivalent, cache-line re-use optimised one -- where there ought be not more than just 4x for(){...} code-blocks 2-nested, having the rest 2 inside, to meet the Linear Algebra rules without devastating benefits of cache-line alignments ( with some residual potential to benefit yet a bit more in [PTIME] from using a duplicated [PSPACE] data-layout ( both a FORTRAN-order and a C-order, for respective re-reading strategies ), as matrices are so miniature in sizes and L2- / L1-DATA-cache available per CPU-core enjoy cache sizes well grown in scale )

Step 2: The manual-omp( <= NUMA_cores - 1 ): - test if omp can indeed yield any "positive" Amdahl's Law speedup ( beyond the omp-setup overhead costs ). A carefull process-2-CPU_core affinity-mapping may help avoid any possible cache-eviction introduced by any non-HPC thread spoiling the cache-friendly layout on a set of configuration-"reserved"-set of ( NUMA_cores - 1 ), where all other ( non-HPC processes ) ought be affinity-mapped onto the last ( shared ) CPU-core, thus helping to prevent those HPC-process-cores retain their cache-lines un-evicted by any kernel/scheduler-injected non-HPC-thread.

_{( As seen in (2), there are arangements, derived from HPC best-practices, that none compiler ( even a magic-wand equipped one ) would ever be able to implement, so do not hesitate to ask your PhD Tutor for a helping hand, if your Thesis needs some HPC-expertise, as it is not so easy to build on trial-error in this quite expensive experimental domain and your primary domain is not the Linear Algebra and/or ultimate CS-theoretic / HW-specific cache-strategy optimisations. )}

Epilogue:

Using smart tools in an inappropriate way does not bring anything more than additional overheads ( task-splits/joins + memory-translations ( worse with atomic-locking ( worst with blocking / fence / barriers ) ) ).

Efficient parallelisation of a linear algebraic function in C++ OpenMP

2 Answers2

The Art of HPC is right the efficiency^{( poor grants never get HPC cluster quota )}

No "...quick glance at a bit of code..." will help

The Maths first, implementation next:

Having this in mind, efficient HPC loops will look much different

After this cache-friendly loop-alignment is finished,

Epilogue:

Linked

Efficient parallelisation of a linear algebraic function in C++ OpenMP

2 Answers2

The Art of HPC is right the efficiency ( poor grants never get HPC cluster quota )

No "...quick glance at a bit of code..." will help

The Maths first, implementation next:

Having this in mind, efficient HPC loops will look much different

After this cache-friendly loop-alignment is finished,

Epilogue:

Linked

The Art of HPC is right the efficiency^{( poor grants never get HPC cluster quota )}