OpenMP and turning on parallelization at specific level of DFS tree

Question

I have a recursive function that is searching through all tuples. The full code is here. I would like to be able to specify the depth in the tree to invoke parallelization with OpenMP. An outline of what I am trying is

int main(int argc, char *argv[])
{
 omp_set_num_threads(2);
 backtrack(size,alphabet,tuple,0,threading_depth);
}

void backtrack(unsigned size, unsigned alphabet, unsigned *tuple, unsigned ell, unsigned t_depth)
{

 if(ell==size)
    {
     #pragma omp critical
      {
       fprintf(stdout,"solution from thread #%d = ",omp_get_thread_num());
       for(i=0;i<size;i++)
          fprintf(stdout,"%3d ",tuple[i]);
       fprintf(stdout,"\n");
      }
    }
else
    {
    #pragma omp parallel for if(ell == t_depth) default(none) shared(alphabet,tuple,size,ell,t_depth) private(j,unused)
     for(i=0;i<alphabet;i++)
        {
         unsigned *tuple_to_send;
         if(ell == t_depth)
            {
             unsigned *local_tuple;
             local_tuple = (unsigned *) calloc(size,sizeof(unsigned));
             for(j=0;j<ell;j++) local_tuple[j] = tuple[j];
             tuple_to_send = local_tuple;
            }
          else
            {
             tuple_to_send = tuple;
            }

          tuple_to_send[ell] = i;
          backtrack(size,alphabet,tuple_to_send,ell+1,t_depth);

       }

    }

}

I am running this on

$ uname -a
Linux 4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

which has a dual core processor (I have successfully run OpenMP code with two threads before). I compile with

gcc -O3 -fopenmp -o selective_threading_not_working selective_threading_not_working.c

When I run the code I expect to see something like

$ selective_threading_not_working 
solution from thread #0 =   0   0   0 
solution from thread #0 =   0   0   1 
solution from thread #0 =   0   0   2 
solution from thread #0 =   0   1   0 
solution from thread #0 =   0   1   1 
solution from thread #0 =   0   1   2 
solution from thread #0 =   0   2   0 
solution from thread #0 =   0   2   1 
solution from thread #0 =   0   2   2 
solution from thread #1 =   1   0   0 
solution from thread #1 =   1   0   1 
solution from thread #1 =   1   0   2 
solution from thread #1 =   1   2   0 
solution from thread #1 =   1   1   0 
solution from thread #1 =   1   1   1 
solution from thread #1 =   1   1   2 
solution from thread #1 =   1   2   1 
solution from thread #1 =   1   2   2 
solution from thread #0 =   2   0   0 
solution from thread #0 =   2   0   1 
solution from thread #0 =   2   0   2 
solution from thread #0 =   2   2   0 
solution from thread #0 =   2   2   1 
solution from thread #0 =   2   2   2 
solution from thread #0 =   2   1   0 
solution from thread #0 =   2   1   1 
solution from thread #0 =   2   1   2

but what I see is

$ selective_threading_not_working 
solution from thread #0 =   0   0   0 
solution from thread #0 =   0   0   1 
solution from thread #0 =   0   0   2 
solution from thread #0 =   0   1   0 
solution from thread #0 =   0   1   1 
solution from thread #0 =   0   1   2 
solution from thread #0 =   0   2   0 
solution from thread #0 =   0   2   1 
solution from thread #0 =   0   2   2 
solution from thread #0 =   1   0   0 
solution from thread #0 =   1   0   1 
solution from thread #0 =   1   0   2 
solution from thread #0 =   1   2   0 
solution from thread #0 =   1   1   0 
solution from thread #0 =   1   1   1 
solution from thread #0 =   1   1   2 
solution from thread #0 =   1   2   1 
solution from thread #0 =   1   2   2 
solution from thread #0 =   2   0   0 
solution from thread #0 =   2   0   1 
solution from thread #0 =   2   0   2 
solution from thread #0 =   2   2   0 
solution from thread #0 =   2   2   1 
solution from thread #0 =   2   2   2 
solution from thread #0 =   2   1   0 
solution from thread #0 =   2   1   1 
solution from thread #0 =   2   1   2

I have found suggestions of using omp_set_nested(1) but this had no effect.

Take a look at [this question](http://stackoverflow.com/q/36808397/620382) and the linked code in the update. For specific help I suggest you updated your question to include an [mcve] and your computer specifications / `OMP_NUM_THREADS` settings. Please make sure you clearly describe the expected and the observed behavior. — Zulan, Jun 13 '16 at 21:46
Thanks Zulan. I have made changes to my post as requested. I have read the link to the question you posted and I will comment and post more source code tomorrow. In short: the post's solution works but I would like to fully understand why the code I post does not work. I am sure that as I do more OpenMP programming I will encounter similar, but more complex issues again and would like to understand the if clause in OpenMP better. — brett stevens, Jun 14 '16 at 02:33
The update makes it much more clear and possible to answer. Nevertheless please note your code is not easily verifiable, because a couple of definitions and headers are missing. Also `clang-format` is your friend if you feel lazy about formatting. It's unnecessary hard to read code with seemingly random indentation. — Zulan, Jun 14 '16 at 08:39
Zulan, In the full code that I link at the top of my post I cannot find any missing definitions. I wanted the code given in the post itself to be more abbreviated, to give the idea, but I can put the entire code there if desired. In the linked code I do not think that there are any inconsistent indentations but when I copied this code into the post I lost the formatting. I went to edit my post and looked for clang-format but I could not find it even in the advanced help. Can you please point me towards how to invoke clang-format from within the post editor. — brett stevens, Jun 14 '16 at 12:41
Sorry, I missed the gist link, no worries. `clang-format` is an external tool you can run on your machine, not integrated in the SO-editor (actually that sounds like a nice idea). It is one of many tools and IDEs that help you for format code consistently. Don't despair: writing "the perfect" stack overflow question can be challenging - you did a good job for a first question. — Zulan, Jun 14 '16 at 14:14
@Zulan, My code was well formatted when I pasted it into my post. Is there a way to keep the formatting when I do this so I don't have to manually reformat it? — brett stevens, Jun 14 '16 at 15:05
Usually it should just work to paste it and then press ctrl+k. From the gist I think your code mixes spaces and tabs for indentation, which often ends up being shown inconsistently depending on the editor. — Zulan, Jun 14 '16 at 15:35

Zulan · Answer 1 · 2016-06-14T13:45:48.853

0

The if clause in the omp parallel doesn't work like you might expect. First of all:

When a thread encounters a parallel construct, a team of threads is created to execute the parallel region [...]. The thread that encountered the parallel construct becomes the master thread of the new team, with a thread number of zero for the duration of the new parallel region.

This is independent of the if clause. The if clause only controls the number of threads - forces it to be 1 if true.

So technically, your approach is working, and the execution is indeed done in parallel. You can verify that by printing the OS thread ID (e.g. gettid).

Nevertheless I would not recommend doing it like that, as the parallel region will incur some overhead. For instance it seems that GCC's OMP runtime doesn't reuse the OS threads when encountering the inner parallel region. Usually with DFS want to avoid any overhead of that kind.

You also have a load imbalance in that case because you can only split the work 1/3 - 2/3 between the two threads. In non-obvious DFS with cutoffs this can become a huge issue.

Therefore I would recommend to go with a task-based approach where you use tasks recursively, but after some recursion depth you switch to a completely serial version of the recursive function. It is actually pretty straight forward, but you do have some code duplication:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

void backtrack_serial(unsigned size, unsigned alphabet, unsigned* tuple, unsigned ell)
{
    int i, j;
    if (ell == size)
    {
        #pragma omp critical
        {
            fprintf(stdout, "solution from thread #%d", omp_get_thread_num());
            for (i = 0; i < size; i++)
                fprintf(stdout, "%3d ", tuple[i]);
            fprintf(stdout, "\n");
        }
    }
    else
    {
        for (i = 0; i < alphabet; i++)
        {
            tuple[ell] = i;
            backtrack_serial(size, alphabet, tuple, ell + 1);
        }
    }
}

void backtrack(unsigned size, unsigned alphabet, unsigned* tuple, unsigned ell, unsigned t_depth)
{
    int i, j;
    for (i = 0; i < alphabet; i++)
    {
        if (ell < t_depth)
        {
            #pragma omp task
            {
                unsigned* local_tuple;
                local_tuple = (unsigned*)calloc(size, sizeof(unsigned));
                for (j = 0; j < ell; j++)
                    local_tuple[j] = tuple[j];
                local_tuple[ell] = i;
                backtrack(size, alphabet, local_tuple, ell + 1, t_depth);
                free(local_tuple);
            }
        }
        else
        {
            tuple[ell] = i;
            backtrack_serial(size, alphabet, tuple, ell + 1);
        }

    }
    #pragma omp taskwait
}

int main(int argc, char* argv[])
{
    unsigned tuple[3];
    omp_set_num_threads(2);
    #pragma omp parallel
    {
        #pragma omp serial
        backtrack(3, 3, tuple, 0, 1);
    }
}

edited Jun 14 '16 at 13:45

answered Jun 14 '16 at 08:36

Zulan

21,896
6
49
109

Thanks Zulan. First, I see that using v=3 was a bad example when my machine only has two cores. The real-life, more complicated, version of the code will not likely encounter the profound load imbalance of my example; there will me more cores and a much higher number of parallel regions. – brett stevens Jun 14 '16 at 12:48
in the example you give with tasks, all nodes of the DFS tree above t_depth get their own task. In my real-life version of the code, I am doing the DFS in a very large tree which has depth and degrees about 20 or a bit more. My intention was to run serially to a small, fixed, depth and then spawn each child node at this level in its own thread which would proceed down its subtree serially. This raises two questions: – brett stevens Jun 14 '16 at 13:04
1) Once these have been spawned I do not need the node of the DFS tree that called them to wait so is it safe to get rid of the `#pragma omp taskwait`? 2) does that fact that each node of the DFS tree at levels t_depth and **lower** (towards the root) have its own task mean I have created a lot of tasks that do not do very much to do so I am wasting the processing power on the cores assigned these tasks? – brett stevens Jun 14 '16 at 13:04
Finally, in your explanation of my if clause, am I understanding correctly: when `ell < t_depth` since the if clause evaluates to false, all work is given to thread 0 (this is what I want). Then when `ell = t_depth` the different `i` from the for loop are given to different threads. When each of those threads recursively calls `backtrack` it is still processed by the thread that called it _until_ it encounters the `#pragma omp parallel for if` and now, since the if clause evaluates to false again, it collapses the parallelization and hands everythign back to thread 0. – brett stevens Jun 14 '16 at 13:45
1) Removing the `taskwait` is going to make it more difficult to `free(local_tuple)`. While I think the standard doesn't say anything about it, in practice I could imagine there being an implicit `taskwait` at the end of the parent task, so I wouldn't necessarily assume removing the wait improves the performance. – Zulan Jun 14 '16 at 13:47
2) The tasks are quite lightweight. There can be many more tasks than threads - so there will be no harmful oversubscription of CPU resources. Of course there is some overhead - that means you will have to carefully tune `t_depth` to be large enough to avoid load-imbalances and small enough to avoid excessive overhead. – Zulan Jun 14 '16 at 13:51
Finally, No: In the inner encounter of the `#pragma omp parallel for if`, it is not given back to the original *thread 0*. The current thread *becomes* the *thread 0* of a new team. It is still a different thread than the original *thread 0*. You can confirm that by looking at the OS thread id. Interestingly GCC's OMP runtime does something unexpected and has even more than two operating system threads (all with the thread number 0 in their current teams). Intels OMP runtime uses two OS threads. – Zulan Jun 14 '16 at 13:59
Two more short questions: 1) when a `#pragma omp task` is inside a for loop, the compiler knows that the different iterations of the for loop can be split up as different tasks? 2) how different would `if (ell == t_depth)` be instead of `if (ell < t_depth)`? – brett stevens Jun 14 '16 at 20:32
1) By the means of the `#pragma`, you are telling this the compiler, therefore he knows. It is your responsibility to ensure it is correct to do so. 2) In my `task` code, it would jump to the serial version immediately thus not parallelizing at all. You would have to handle the three cases separately. – Zulan Jun 14 '16 at 21:21

OpenMP and turning on parallelization at specific level of DFS tree

1 Answers1