Why is an openMP/MPI distributed process slower than in MATLAB?

Question

I wrote a hybrid OpenMP/MPI program that basically distributes the iterations of a for(){...} loop to non-shared memory systems through MPI. Within one machine, I call openMP specifically with a static schedule.

My code abstractly looks like the following:

#pragma omp parallel for schedule(static) collapse(2) nowait
for(j=1;j>-2;j-=2){
    for(i=0;i<n;i++){
        ...                    // nested loop code here
    }
}

I compared the running times of the exact same piece of code on MATLAB with a parfor loop and I get a consistently 30% slower running time with my C code.

I expect at least equal running times, if not faster for the C code.

I monitor running times through the shell function time like the following

time matlab script.m

time mpirun -np 1 --bind-to none -x OMP_NUM_THREADS=32 ./script

I am using openmp 3.1 with gcc 4.7.3 and openMPI v1.10.3

I call the program with --bind-to none option in openMPI and OMP_PROC_BIND=TRUE for openMP

Any ideas why does that happen?

EDIT

Assuming 32 threads, in MATLAB, the loop looks like this:

parfor k=1:nWorkers
    for j=[-1,1]
       for i=1:n
           ...                 % nested loop here
       end
    end
end

score 1 · Answer 1 · edited May 12 '17 at 13:30

1

The loop you paralellize is for(j=1;j>-2;j-=2), which is only j=1 and j=-1. Therefore you only get two threads doing n loops. I can imagine you are doing something else in MATLAB, but you did not provide any code, so I can't say anything about your MATLAB code.

Also you are combining MPI (with just one thread) with openMP, are you sure that is what you are looking for?

edited May 12 '17 at 13:30

user3666197

1
6
50
92

answered May 01 '17 at 13:48

RedPixel

1,904
1
11
11

Isn't collapse(2) supposed to merge both loops into one? like in [here](http://stackoverflow.com/questions/28482833/understanding-the-collapse-clause-in-openmp). I set -np 1 to be able to compare with matlab in a shared-memory system. In matlab I have the same loop but instead of for I have a parfor. – Marouen May 01 '17 at 13:58

user3666197 · Answer 2 · 2017-12-08T18:34:03.040

TL;DR; - best practices included below for both MATLAB and C-lang.

(Entering the HPC and distributed computing domains requires some new sort of self-discipline, as the toys get more and more complicated and the net-effects are not easily deconstructed to their respective root-causes if relying just on our previous purely [SERIAL] scheduling experience from common programming languages).

Never use shell `time` to seriously measure / compare a performance:

the more once your distributed concurrent processes reach some high number of openMP threads / MPI processes, setting the divisor N in the Amdahl's Law denominator.

                                1
processSPEEDUP = _______________________________
                               ( 1 - SEQ_part )     <---- CONCURRENCY MAY HELP
                  SEQ_part +  _________________
                  ^^^                 N             <---- CONCURRENCY HARNESSED
                  |||
                  SEQ____________________________________ CONCURRENCY IMMUNE PART

Anticipations ought be realistic:

Never expect a SPEEDUP-dinner to be as FREE
as an overhead-naive
formulation of Amdahl's Law may seemed to have promised

Never use shell time to seriously measure / compare a performance:

the more once your distributed concurrent processes reach some high number of openMP threads / MPI processes, setting the divisor N in the Amdahl's Law denominator.

                               1
processSPEEDUP = ___________________________________
                             ( 1 - SEQ_part )        <-- CONCURRENCY MAY HELP
                 SEQ_part + _________________ + CoST 
                 ^^^                 N          ^^^
                 |||                 ^          |||
                 |||                 |          |||        
                 |||                 +------------------ CONCURRENCY HARNESSED
                 |||                            |||
                 |||                            |||      A GAIN WITHOUT PAIN?
                 |||                            |||      
                 |||                            |||      NEVER, SORRY,
                 |||                            |||      COMMUNISM DOES NOT WORK,
                 |||                            |||      ALL GOT AT THE COST OF
                 |||                            +++----- COSTS-OF-ALL-OVERHEADS
                 |||                                     ++++++++++++++++++++++
                 |||                                     +N job SETUPs
                 |||                                     +N job DISTRIBUTIONs
                 |||                                     +N job COLLECT RESULTs
                 |||                                     +N job TERMINATIONs
                 |||
                 |||
                 SEQ____________________________________ CONCURRENCY IMMUNE PART

For more details on the impacts from add-on costs-of-overheads, may like to read this, or may jump straight into an interactive GUI-tool for a live, quantitative reality-based illustration on how small speedup any amount of CPU will bring on "expensively" distributed jobs, referenced in the trailer part of this post, where SPEEDUPs are shown
WHY they actually turn out to become SLOWDOWNs, on any amount of N
Q.E.D..

Plus there are some more details on actual constraints ( threading model related, hardware domain related, NUMA specific ), that altogether decide about the resulting scheduling and the achievable speed-up of the such declared flow of execution :

Independently from the details how a process may get accelerated by a distributed processing concurrency, the quality of measurement is discussed here.

Processes in C shall always use :

any high resolution timer available in the system. As a fast mock-up example, one may:

#include <time.h>

struct timespec {
                 time_t tv_sec;  /* seconds */
                 long   tv_nsec; /* nanoseconds */
       };

timespec diff( timespec start, timespec end ) {
         timespec temp;
         if ( ( end.tv_nsec - start.tv_nsec ) <  0 ) {
               temp.tv_sec  = end.tv_sec  - start.tv_sec  - 1;
               temp.tv_nsec = end.tv_nsec - start.tv_nsec + 1000000000;
         } else {
               temp.tv_sec  = end.tv_sec  - start.tv_sec;
               temp.tv_nsec = end.tv_nsec - start.tv_nsec;
         }
         return temp;
       }

// /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
       struct timespec                           start_ts, end_ts, duration_ts;
       clock_gettime( CLOCK_THREAD_CPUTIME_ID,  &start_ts );
//     clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &start_ts );
//     clock_gettime( CLOCK_MONOTONIC,          &start_ts );
// ____MEASURED-SECTION_START____________________
                ...
                ..
                .
// ____MEASURED-SECTION_END______________________
       clock_gettime( CLOCK_THREAD_CPUTIME_ID,  &end_ts );
//     clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &end_ts );
//     clock_gettime( CLOCK_MONOTONIC,          &end_ts );
        // _____SECTION__________
        //         duration_ts = diff( start_ts, end_ts );
// \/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/

Processes in MATLAB shall always use `tic; ...; delta = toc;`

MathWorks Technical Article arguments quite clearly on this subject:

In summary, use tic and toc to measure elapsed time in MATLAB, because the functions have the highest accuracy and most predictable behavior. The basic syntax is

tic;
    ... % … computation …
    ..
    .
toc;

where the tic and toc lines are recognized by MATLAB for minimum overhead.