How to parallelize a while loop in openmp - conjugate gradient

Question

I have a while loop and want to parallelize it on 2 threads using OpenMP. Variables inside the loop don't depend on their values from previous iteration, so I figured there has to be some way of parallelizing it. I have 2 threads, so there could be 2 iterations of while loop happening simultaneously each time with each loop performing their own calculations. Goal of this loop is to find the value of alfa, which is a step size in conjugate gradient method for finding optimal point.

I guess I have to somehow utilize alfa, alfaIter variables and OpenMP statements to make this parallel loop work, but have no idea how.

    #pragma omp parallel num_threads(2)
    {
    while (alfaSet == false) {
        alfaIter++;
        alfa = pow(gamma, alfaIter - 1);

        b = 0;

        for (i = 0; i < dim; i++) {
            testX[i] = x[i] + alfa * d[i];
        }
        for (i = 0; i < dim; i++) {
            b += d[i] * g[i];
        }
        if (shanno(testX, dim) - shanno(x, dim) <= delta * alfa * b) {
            alfaIter = 0;
            alfaSet = true;
        }
    }
    }

EDIT 1: This implementation seems ok:

    #pragma omp parallel num_threads(alfaThreads)
    {
    int alfaIter = omp_get_num_threads();
    int step = omp_get_num_threads();
    double localAlfa = alfa;
    double *testX = (double *) malloc(dim * sizeof(double));
    while (!alfaSet) {
        #pragma omp barrier
        alfaIter += step;
        localAlfa = pow(gamma, alfaIter - 1);
        for (i = 0; i < dim; i++) {
            testX[i] = x[i] + localAlfa * d[i];
        }
        if (func(testX, dim) - delta * localAlfa * b <= oldFunc) {
            #pragma omp critical
            {
                if (!alfaSet) {
                    alfaSet = true;
                    alfaIter = 0;
                    alfa = localAlfa;
                }
            }
        } 
    }
    free(testX);
    }

So after playing with this code for a while I figured there wasn't any synchronization, so threads didn't wait for each other and they reached parts of code in a unpredictable manner. OpenMP barrier syncs them now and I always get the same number of iterations plus performance gain. However, sometimes program crashes now. Deadlocks? How to check what causes the crash and how to prevent it?

Here is the whole implementation of algorithm: https://gist.github.com/mazury/394adc82ab51ce36acfae297ae0555ce

As it stands you can't parallelise such a loop in a simple fashion. OpenMP requires that the run time be able to distribute loop iterations across threads when the loop begins; as things stand the system can't figure out how many iterations of the loop there will be. You could either revise the logic of your loop to meet OpenMP's requirements, or perhaps make an implementation using the explicit `task` construct. I think this matter has been covered in several Qs and As on SO, so your next step might be to do some further research. — High Performance Mark, Jan 21 '19 at 17:45
Related: https://stackoverflow.com/q/42535647/620382 https://stackoverflow.com/q/41377464/620382 https://stackoverflow.com/q/47349885/620382 I suggest if you read them carefully, you have also the answer to your question. — Zulan, Jan 22 '19 at 09:10
try the auto parallelization option with your compiler. https://stackoverflow.com/questions/40088101/can-gcc-make-my-code-parallel — ron, Jan 22 '19 at 19:17
intel *icc* compiler is *-Qparallel* I think; gcc maybe *-floop-parallelize-all* and *-ftree-parallelize-loops=n* — ron, Jan 22 '19 at 19:22
the compiler can sometimes *auto parallelize* your code if loops are simple enough and written well enough. The auto parallelization is done by the compiler, uses openmp, does the work for you. but don't expect miracles. you will need to read the man page of the version compiler you are using to see what is available and what the syntax of the compiler option is, it used to be just *-parallel* — ron, Jan 22 '19 at 19:26

score 0 · Accepted Answer · answered Jan 21 '19 at 19:23

#pragma omp parallel runs the following code in parallel on several threads. Hence several loops will run simultaneously. All these versions will fetch global variables and update them more or less simultaneously and you cannot simply control what happens.

For instance, it is quite possible that alfaIter is modified in an uncontrolled way leading to an undefined behavior.

Here is how the first lines of you code can be executed by the processor

1 read alfaIter in local var (register)
2 var++
3 write register var in alfaIter
4 fetch alfaIter to call pow and put it in stack or register
5 call pow(...)

Lets say that these instructions are 1a 2a 3a 4a 5a in thread A and 1b 2b 3b 4b 5b in thread B.

Now what can be the actual order of execution?

Assume it is

1a 2a 3a 4a 5a 1b 2b 3b 4b 5b.

The behavior will be as expected. Pow is called in thread A with alfaIter=1 and in thread B with alfaIter=2

But other ordering can lead to a different behavior

For instance

1a 1b (both local regs in thrd A and B have initial value of 0)
2a 3a (alfaIter=1 written back to memory by thead A)
2b 3b (alfaIter=1 written back to memory by thead B)
4a 4b 5a 5c (pow called by both threads with the same value of alfaIter=1)

As any ordering is possible, the behavior of your loop is unpredictable.

A solution to make it predictable is by means of atomic operations. In that case, you can ensure that access to memory is sequential and that the behavior of the while loop will be as expected.

But this has a major drawback. Atomic operations are very long and typically require ~100 cycles on modern processors. This will slow down your code dramatically and make it much slower than the sequential version.

In general the best is to use for loop, but it does not seem to be possible for you.

What I would suggest is to render most var local, to run parallel thread that increment alfaIter by 2 (or by the number of threads) and to just use global actions for the terminating condition.

Sample code:

#pragma omp parallel num_threads(2)
{
  int alfaIter=omp_get_thread_num();
  int step=omp_get_num_threads();
  float alfa;
  float testX[dim],b; 
      // and maybe d[] and g[] but I do not understand what they do
  while (alfaSet == false) { // no problem to read a global var
    alfaIter+=step;
    alfa = pow(gamma, alfaIter - 1);
    b = 0;
    for (i = 0; i < dim; i++) {
        testX[i] = x[i] + alfa * d[i];
    }
    for (i = 0; i < dim; i++) {
        b += d[i] * g[i];
    }
    if (shanno(testX, dim) - shanno(x, dim) <= delta * alfa * b) {
     #pragma omp critical
      if (! alfaSet) { // you can do safe operations here
        alfaIter = 0;
        alfaSet = true;
      }
    }
  }
}

It is untested, but can be a starting point.

Thank you. I just quickly tested it. With this implementation the program is unstable - sometimes it doesm't find alfa at all, plus alfa variable has always differnet final value; vectors d and g are used somewhere else, can't make them local; here is full code: https://gist.github.com/mazury/09a3bca73c5c70269283722c7f4a56e6 — Adam Mazurkiewicz, Jan 21 '19 at 20:10
Your implementation is incorrect. line 59 *must* be ` int alfaIter = omp_get_thread_num();` This way thread 0 will do alfaIter=0,2,4, etc and thread 1 alfaIter=1, 3, 5... — Alain Merigot, Jan 21 '19 at 20:21
I understand. However, corrected code has still the same problems. — Adam Mazurkiewicz, Jan 21 '19 at 20:26
I do not see the problem. The omp code seems correct. Though, it will find the first solution with alfaIter. But this solution is not necessarily the one with the lowest alfaIter if several solutions exist and the threads run at different speed. BTW, your code can be optimized. Put b computation out of the loop, it is constant during the while. Also, you must be aware that mixing int and double slows down you code. int to double conversions are more expensive than a mult. Change all int constant to double (1 -> 1.0). — Alain Merigot, Jan 21 '19 at 20:55
Unfortunately the code is incorrect. As per the memory model of OpenMP you **must not** read a location (`alfaSet`) in an unprotected way that is written by another thread. It is not sufficient to protect only writing the location. — Zulan, Jan 22 '19 at 09:05
To solve this, you must make a manual loop break which reads `alfaSet` into a private variable before checking it. See also https://stackoverflow.com/a/47416607/620382 — Zulan, Jan 22 '19 at 09:08
@Zulan You are right, but for this specific program, I do not think it is a problem if I am not missing something. The worse that can happen is that alfaSet will be read (to false) while being set by another thread. This can lead occasionally to an extra iteration, but it should not modify any global variable and change the behavior of the program. And it saves an atomic access. — Alain Merigot, Jan 22 '19 at 09:26
As per OpenMP's memory model, the temporary view of a thread on `alfaIter` may continue to remain `false` until the implicit flush caused by the critical section. At that point, the thread would have found a solution itself. — Zulan, Jan 22 '19 at 09:38

How to parallelize a while loop in openmp - conjugate gradient

1 Answers1