Parallelizing an algorithm with many exit points?

Question

I'm faced with parallelizing an algorithm which in its serial implementation examines the six faces of a cube of array locations within a much larger three dimensional array. (That is, select an array element, and then define a cube or cuboid around that element 'n' elements distant in x, y, and z, bounded by the bounds of the array.

Each work unit looks something like this (Fortran pseudocode; the serial algorithm is in Fortran):

do n1=nlo,nhi
  do o1=olo,ohi          
    if (somecondition(n1,o1) .eq. .TRUE.) then
       retval =.TRUE.
       RETURN
    endif    
  end do 
end do

Or C pseudocode:

for (n1=nlo,n1<=nhi,n++) {
  for (o1=olo,o1<=ohi,o++) {
    if(somecondition(n1,o1)!=0) {
      return (bool)true;
    }
  }
}

There are six work units like this in the total algorithm, where the 'lo' and 'hi' values generally range between 10 and 300.

What I think would be best would be to schedule six or more threads of execution, round-robin if there aren't that many CPU cores, ideally with the loops executing in parallel, with the goal the same as the serial algorithm: somecondition() becomes True, execution among all the threads must immediately stop and a value of True set in a shared location.

What techniques exist in a Windows compiler to facilitate parallelizing tasks like this? Obviously, I need a master thread which waits on a semaphore or the completion of the worker threads, so there is a need for nesting and signaling, but my experience with OpenMP is introductory at this point.

Are there message passing mechanisms in OpenMP?

EDIT: If the highest difference between "nlo" and "nhi" or "olo" and "ohi" is eight to ten, that would imply no more than 64 to 100 iterations for this nested loop, and no more than 384 to 600 iterations for the six work units together. Based on that, is it worth parallelizing at all?

Unfortunately, OpenMP isn't well suited for this kind of task. You can't have any exit points inside a parallel for loop for obvious reasons. So you'll have to consider a different approach to what you're doing. — Mysticial, Mar 31 '12 at 01:42
Hmm. Any alternatives using Intel Fortran? I'd hate to have to build a C-based static lib for just one function. — Rob Perkins, Mar 31 '12 at 02:33
There are most likely strong parallels between Intel's C++ and Fortran compiler suites. — Rob Perkins, Mar 31 '12 at 21:30
Is evaluating `somecondition(n1, o1)` a time-consuming operation? — Tudor, Apr 01 '12 at 09:06
No, it's two comparisons in an OR relationship. The time consumption comes from the fact that every element in the array must be tested like this. The arrays commonly have between four million and twenty million elements. — Rob Perkins, Apr 02 '12 at 01:47
If there are just 6 threads, I think it would probably be best to just let them all run to completion, setting 6 separate booleans, and then check if any of those are .TRUE. -- OpenMP would be the easiest way to do that -- just create an extra dummy loop. — laxxy, Apr 12 '12 at 17:31
btw: `if(somecondition(n1,o1) .eq. .TRUE.)` is not legal Fortran (ifort allows it, but e.g. gfortran does not) -- either use `if(somecondition(n1,o1))`, or `if(somecondition(n1,o1) .EQV. .true.)` — laxxy, Apr 17 '12 at 14:53
I've been sidelined by an aggressively annoying VB6 problem involving some idiot's neglect of `option explicit` and customer errors. It's not comforting that the idiot was me. Stay tuned.... — Rob Perkins, Apr 30 '12 at 20:17
I'm still working through my backlog, but this question is still active and I still plan to try the suggestions y'all have made. — Rob Perkins, Jul 06 '12 at 01:17

score 3 · Answer 1 · answered Apr 08 '12 at 02:48

3

Would it be better to parallelize the loop over the array elements and leave this algorithm serial, with multiple threads running the algorithm on different array elements? I'm thinking this from your comment "The time consumption comes from the fact that every element in the array must be tested like this. The arrays commonly have between four million and twenty million elements." The design of implementing the parallelelization of the array elements is also flexible in terms of the number threads. Unless there is a reason that the array elements have to be checked in some order?

It seems that the portion that you are showing us doesn't take that long to execute so making it take less clock time by making it parallel might not be easy ... there is always some overhead to multiple threads, and if there is not much time to gain, parallel code might not be faster.

answered Apr 08 '12 at 02:48

M. S. B.

28,968
2
46
73

That's also been my thought; that's why I added the edit with the normative loop count cases; all the work units together might iterate between 600 and 1000 times. What I've found, though, is that the compiler won't parallelize the outer-outer loops I haven't shown. A "write(*,*)" statement appears to interfere and the compiler chooses a serial implementation in every case. – Rob Perkins Apr 09 '12 at 20:54
Are you using OpenMP or auto-parallization? Your statement about the "compiler chooses a serial implementation" seems strange to me. If in OpenMP you request a parallel loop, you get that, even if it is introduces a bug. IO might behave in strange ways ... OpenMP doesn't specify the behavior of IO. A standard example is to include a write statement and see the threads output in random order. You could protect the write statement or you could store the information in an array and move the write statement out of the parallel region. Parallelizing the outer loop is prob the better design. – M. S. B. Apr 10 '12 at 01:24
My current thinking is to try and rewrite the short loops so that they can be auto-vectorized, and use OpenMP more coarsely. But the fact is that right now the center coarse loop is PARALLEL DO and I'm still not seeing execution time improvements. – Rob Perkins Apr 10 '12 at 15:39
I think this is a good advice, and the way to go; one problem is that you can't have a `return` statement in a parallel loop -- see my answer for a possible workaround: http://stackoverflow.com/a/10129497/1024514 – laxxy Apr 12 '12 at 18:33

Jason · Answer 2 · 2012-04-10T22:03:48.323

You already mentioned the obvious way to stop all threads as soon as any thread finds the ending condition: have each check some shared variable which gives the status of the ending condition, thereby determining whether to break out of the loops. Obviously this is an overhead, so if you decide to take this approach I would suggest a few things:

Use atomics to check the ending condition, this avoids expensive memory flushing as just the variable in question is flushed. Move to OpenMP 3.1, there are some new atomic operations supported.
Check infrequently, maybe like once per outer iteration. You should only be parallelizing large cases to overcome the overhead of multithreading.
This one is optional, but you can try adding compiler hints, e.g. if you expect a certain condition to be false most of the time, the compiler will optimize the code accordingly.

Another (somewhat dirty) approach is to use shared variables for the loop ranges for each thread, maybe use a shared array where index n is for thread n. When one thread finds the ending condition, it changes the loop ranges of all the other threads so that they stop. You'll need the appropriate memory synchronization. Basically the overhead has now moved from checking a dummy variable to synchronizing/checking loop conditions. Again probably not so good to do this frequently, so maybe use shared outer loop variables and private inner loop variables.

On another note, this reminds me of the classic polling versus interrupt problem. Unfortunately I don't think OpenMP supports interrupts where you can send some kind of kill signal to each thread.

There are hacking work-arounds like using a child process for just this parallel work and invoking the operating system scheduler to emulate interrupts, however this is rather tricky to get correct and would make your code extremely unportable.

Update in response to comment:

Try something like this:

char shared_var = 0;
#pragma omp parallel
{
  //you should have some method for setting loop ranges for each thread
  for (n1=nlo; n1<=nhi; n1++) {
    for (o1=olo; o1<=ohi; o1++) {
      if (somecondition(n1,o1)!=0) {
        #pragma omp atomic write
        shared_var = 1;  //done marker, this will also trigger the other break below
        break;           //could instead use goto to break out of both loops in 1 go
      }
    }
    #pragma omp atomic read
    private_var = shared_var;
    if (private_var!=0) break;
  }
}

Portability wasn't specified; I'm happy to use Windows' WaitForSingleObject and its cousins, but I've never attempted that in a Fortran program before. Which atomic operations in OpenMP 3.1 are useful? — Rob Perkins, Apr 09 '12 at 20:51

laxxy · Accepted Answer · 2012-04-12T18:36:50.523

1

One possibility is to use OpenMP to parallelize over the 6 loops -- declare logical :: array(6), allow each loop to run to completion, and then retval = any(array). Then you can check this value and return outside the parallelized loop. Add a schedule(dynamic) to the parallel do statement if you do this. Or, have a separate !$omp parallel and then put !$omp do schedule(dynamic) ... !$omp end do nowait around each of the 6 loops.

Or, you can follow the good advice by @M.S.B. and parallelize the outermost loop over the whole array. The problem here is that you cannot have a RETURN inside a parallel loop -- so label the second outermost loop (the largest one within the parallel part), and EXIT that loop -- smth like

retval = .FALSE.
!$omp parallel do default(private) shared(BIGARRAY,retval) schedule(dynamic,1)
do k=1,NN
   if(.not. retval) then
      outer2: do j=1,NN
         do i=1,NN
            ! --- your loop #1
            do n1=nlo,nhi
               do o1=olo,ohi
                  if (somecondition(BIGARRAY(i,j,k),n1,o1)) then
                     retval =.TRUE.
                     exit outer2
                  endif
               end do
            end do
            ! --- your loops #2 ... #6 go here
         end do
      end do outer2
   end if
end do 
!$omp end parallel do

[edit: the if statement is there presuming that you need to find out if there is at least one element like that in the big array. If you need to figure the condition for every element, you can similarly either add a dummy loop exit or goto, skipping the rest of the processing for that element. Again, use schedule(dynamic) or schedule(guided).]

As a separate point, you might also want to check if it may be a good idea to go through the innermost loop by some larger step (depending on float size), compute a vector of logicals on each iteration and then aggregate the results, eg. smth like if(count(somecondition(x(o1:o1+step,n1,k)))>0); in this case the compiler may be able to vectorize somecondition.

edited Apr 12 '12 at 18:36

answered Apr 12 '12 at 18:23

laxxy

1,148
8
17

Hmm... this makes me thoughtful. I attempted forms of this without a lot of success, but I don't think I specifically tries this form, and I didn't specify shared and private variables. My assumption was that OpenMP defaults all outer variables to shared; not true? – Rob Perkins Apr 12 '12 at 22:40
Also, do you think there is any need to define a critical section for the `retval=.TRUE.` line? – Rob Perkins Apr 12 '12 at 22:40
I think the default is to share all except the parallel loop counter. So you need to make other loop counters private (or firstprivate), and possibly any other variables that may be set inside the loops. I am not really an OpenMP guru but I don't see any potential data race here, as the retval can only be set to TRUE, and once it is, we are done -- worst case, a couple extra loops would run, I would think. I just tested a small example of this, and the things do speed up quite a bit this way. If you can make a compileable minimal example, it may be easier to suggest what else could work. – laxxy Apr 13 '12 at 16:12
BTW, `schedule(dynamic)` /is/ essential for this to work; it is not the default. – laxxy Apr 13 '12 at 16:15
That's a nuance I hadn't seen. Mind if I edit your code sample to showcase the actual wider algorithm? – Rob Perkins Apr 13 '12 at 20:31
sure, please go ahead :) schedule(static,1) or the like might also work, but the default would allocate large chunks at the beginning, and hence the thing would work at the speed of the slowest chunk. – laxxy Apr 14 '12 at 02:47
looking further at the OMP specification, you can do `!$omp flush(retval)` after assignment and before the test to make sure all threads see the new value. in my test, it did not affect the performance either way though. – laxxy Apr 14 '12 at 17:49
OK, I've edited the code a bit,thanks. Basically what you're seeing is that the six inner loops are serial at this point and the processing terminates the innermost iteration when any loop gets a match. One thing I'm thinking to do is reconstruct each of those inner loops so that they can be vectorized on the SSSE instructions. I'll be testing all these recommendations tomorrow. – Rob Perkins Apr 17 '12 at 02:37
I don't see your edits :( It does definitely seem to me like you should be looking at parallelizing at the outermost level though. For the SSE vectorization, see the last paragraph of my answer. Try different step sizes, like 4, 8 or 16. Also, I think for each of the individual 6 loops it may very well be faster to let each run to completion (e.g. accumulating counts), and check the condition before starting the next one. Finally, for this type of code compiler versions seems to matter sometimes, sometimes older (11.0/11.1) working better, sometimes newer, 12.0 tends to be slowest. – laxxy Apr 17 '12 at 14:10
Just noticed your comment on the `write` statement in @m-s-b's answer. Try to get rid of that! :) Why do you need it there? Is the array so big that it is impossible to save the results to output them all after the end of the loop? Also, I think it's not the `write` itself that is the problem, but that it introduces some sort of synchronization -- I have working code that has `write`'s within OMP loops and they parallelize just fine, but there loop count is low, and the loops themselves very expensive. Perhaps you can try not doing `write`'s on each iteration? – laxxy Apr 17 '12 at 14:21
might check this question also (regarding some differences in compiler versions; i have just recently seen this again): http://stackoverflow.com/questions/8893192/puzzling-performance-difference-between-ifort-and-gfortran/8995589#8995589 – laxxy Apr 17 '12 at 14:22
Edits are still in queue, I think. I have the `write` in the outermost loop to see if it's parallelizing at all. It wasn't. But, that could be due to the defaults on `!$omp parallel do`; I have yet to check your recommendations against what I have done so far. That's today! :-) – Rob Perkins Apr 17 '12 at 16:42
This has languished long enough. I'm going to mark this as "the answer" and take this advice, @laxxy. I just hope I can get to it soon. – Rob Perkins Dec 03 '12 at 23:26

score 1 · Answer 4 · answered Mar 31 '12 at 08:19

1

I believe you can do what you want with the task construct introduced in OpenMP 3; Intel Fortran supports tasking in OpenMP. I don't use tasks often so I won't offer you any wonky pseudocode.

answered Mar 31 '12 at 08:19

High Performance Mark

77,191
7
105
161

This could be it; I've been reading off the OpenMP 2.0 spec, not 3.x. I'll consider this as an approach... – Rob Perkins Mar 31 '12 at 21:29
Actually I could use some wonky pseudocode. No approach I've taken so far outstrips the serial algorithm. – Rob Perkins Apr 03 '12 at 18:15

score 0 · Answer 5 · answered Apr 04 '12 at 08:54

A suitable parallel approach might be, to let each worker examine a part of the overall problem, exactly as in the serial case and use a local (non-shared) variable for the result (retval). Finally do a reduction over all workers on these local variables into a shared overall result.

Parallelizing an algorithm with many exit points?

5 Answers5