0

I have this serial code:

std::vector<T> input, output;
//populate input
for (size_t i=0 ; i < input.size() ; i++){
  output.push_back(//something using input[i]);
}

Note that it's crucial that output[i] is computed using input[i]. Now I want to parallelize it. A straightforward solution is:

std::vector<T> input, output;
//populate input
#pragma omp for schedule(dynamic) ordered  
for (size_t i=0 ; i < input.size() ; i++){
  #pragma omp ordered
  output.push_back(//something using input[i]);
}

However, this could be very inefficient (ordered requires to get a lock, which generates a big overhead if performed many times, and in addition if chunks size are small this generates even more overhead). Another possible solution could be populate each chunk separately and then merge them (in an ordered way):

std::vector<T> input, output;
//populate input
#pragma omp parallel
{
  std::vector<T> myOutput;
  std::vector<int> myIndexes;
  #pragma omp for schedule(static,1)  //NOTICE STATIC!
  for (size_t i=0 ; i < input.size() ; i++){
    myOutput.push_back(//something using input[i]);
    myIndexes.push_back[i];
  }
  #pragma omp for ordered schedule(static)
  for(int i=0 ; i<omp_get_num_threads() ; i++){
    #pragma omp ordered
    {
      std::cout<<"Hello from thread "<<omp_get_thread_num()<<": ";
      for(size_t j = 0; j < myIndexes.size(); j++)
        std::cout<<indexes[j]<<" ";
      std::cout<<std::endl;
    }
  }
}

Supposing that input.size()=10 and we have 8 cores. This is what I want to be printed:

Hello from thread 0: 0 1 
Hello from thread 1: 2 3 
Hello from thread 2: 4
Hello from thread 3: 5 
Hello from thread 4: 6 
Hello from thread 5: 7 
Hello from thread 6: 8 
Hello from thread 7: 9

But this is what it's actually printed:

Hello from thread 0: 0 8 
Hello from thread 1: 1 9 
Hello from thread 2: 2  
Hello from thread 3: 3 
Hello from thread 4: 4 
Hello from thread 5: 5 
Hello from thread 6: 6 
Hello from thread 7: 7 

The code in this question could solve my case, but this is kinda cheating (it's not actually using #pragma omp parallel for).

Is it possible to obtain my desired behavior using only openmp?

Community
  • 1
  • 1
justHelloWorld
  • 6,478
  • 8
  • 58
  • 138

1 Answers1

1

The simplest solution is to use operator[] instead of push_back.

std::vector<T> input, output;
//populate input
output.resize(input.size())
#pragma omp parallel for
for (size_t i=0 ; i < input.size() ; i++){
  output[i] = //something using input[i]);
}

Now this does assumes that T is DefaultInsertable. If not, you can use a specific dummy value argument for resize.

BTW: I do not understand your expected output. Why would you expect an index to appear multiple times on different threads?

Edit:

The OpenMP specification clearly says about static scheduling (Table 2.5).

the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number.

So the actual result is conforming.

Zulan
  • 21,896
  • 6
  • 49
  • 109
  • I'm sorry, I've edited my desired output. However, I think that your solution is feasible. I've already thought about it and discarded it, but now that I read it again it's possible. However the `static` and desired output issue remains. – justHelloWorld Jan 07 '17 at 10:01
  • I added an explanation regarding the static scheduling. – Zulan Jan 07 '17 at 18:01
  • yes I've read that quote, but I hoped that someone came up with a different scheduling policy than round-robin. I think that my solution is the only one at this point. – justHelloWorld Jan 07 '17 at 18:14