Openmp for loop: static schedule with work unbalance

Question

I have a loop over a vector of ordered integers. Inside the loop, I do some operations on the integers to determine if they are kept or discarded for the rest of the program. After the loop, I need to have a vector containing the kept integers, also ordered.

Since the vector of integers is rather large (can be around 50 billions), I'm using openmp to parallelize the loop, and I came up with the following code. To keep the good integers ordered, I use the static clause and private vectors, that are merged after loop completion.

std::vector<unsigned> large_int_vector;
// vector filling and ordered...

std::vector<std::vector<unsigned> > good_int_threads;
#pragma omp parallel
{
  int num_threads = omp_get_num_threads();
  int i_thread = omp_get_thread_num();
  #pragma omp single
  {
    good_int_threads.resize(num_threads);
  }
  #pragma omp for schedule(static)
  for (std::size_t i = 0; i < large_int_vector.size(); i++) {
    bool is_good_int = true;
    // some operations on large_int_vector[i]
    if (is_good_int)
      good_int_threads[i_thread].push_back(large_int_vector[i]);
  }
}

std::vector<unsigned> good_int;
for (unsigned i = 0; i != good_int_threads.size(); i++) {
  good_int.insert(good_int.end(),good_int_threads[i].begin(),good_int_threads[i].end());
}

This works correctly, but does not scale well. The reason is that most (it depends on the number of threads, but is typically > 50% and can be up to 100% if using only 2 or 3 threads) of the good integers are located in the "begining" of large_int_vector.

More precisely, the work done in the for loop is to apply some operations to the integers. If a single one fails, the integer is discarded and we can immediately move to the next integer. On the other hand, for the integer to be kept, it has to pass all the operations. Obviously, this induces a work imbalance, which explains the poor scaling.

The size of good_int is typically ~large_int_vector/50, so around 1 billion.

How could I improve my code ?

It is impossible to give you specific advise for your code beyond *try `schedule(...)`* and *you have false sharing on `good_int_threads`, try [this](https://stackoverflow.com/questions/43168661/openmp-and-reduction-on-stdvector)* without a [mcve]. — Zulan, Jun 04 '18 at 07:41
@High Performance Mark: I can't see how dynamic(schedule) will give me an ordered vector at the end, which is the objective ? Or am I missing something obvious ? — Castou, Jun 04 '18 at 07:55
@Zulan, could you explain where is the false sharing ? I'm confused, it seems to that each thread can only access to good_int_threads[i_thread] — Castou, Jun 04 '18 at 07:57
@Castou I showed how to fill vectors ordered [here](https://stackoverflow.com/a/18671256/2542702). This will solve your false sharing problem and it will improve data locality because the memory will be allocated by each thread. — Z boson, Jun 04 '18 at 08:00
False sharing is on accessing the internals of `good_int_threads[i_thread]` during `push_back` (which are likely on the same cache line as internals of neighboring vectors). — Zulan, Jun 04 '18 at 08:00
@Zulan Ok, I see. I think I had no issue during my tests because the number of threads was too small, and likely only the first thread got good integers... — Castou, Jun 04 '18 at 08:07
@HighPerformanceMark, I don't see how `dynamic` can be used without breaking the order. — Z boson, Jun 04 '18 at 08:54
@HighPerformanceMark, that's why it's an interesting question. How do you do load balancing and maintain order? — Z boson, Jun 04 '18 at 09:56
Maybe you run two phases, first to get the good integers in any old order, second to sort them. But I don't have the time or equipment here to explore or validate this suggestion, and I'll just back quietly away from this one now. — High Performance Mark, Jun 04 '18 at 11:19

Z boson · Answer 1 · 2018-06-04T11:44:53.123

Allocated the vectors for each thread local to each thread and run over chunks like this:

std::vector<unsigned> large_int_vec;
std::size_t n = large_int_vec.size();
std::size_t chunk = n/10;  //10 chunks, adjust by performance.
#pragma omp parallel
{
  std::vector<unsigned> good_int_local;
  for(std::size_t j = 0; j < n; j += chunk) {
    std::size_t n2 = (j + chunk) >= n ? n - j : chunk;
    #pragma omp for nowait schedule(static)
    for (std::size_t i = 0; i < n2; i++) {
      bool is_good_int = true;
      // some operations on large_int_vec[i]
      if (is_good_int) good_int_local.push_back(large_int_vec[j + i]);
    }
    #pragma omp for schedule(static) ordered
    for(int i=0; i<omp_get_num_threads(); i++) {
      #pragma omp ordered
      good_int.insert(good_int.end(), good_int_local.begin(), good_int_local.end());
    }
    good_int_local.clear();
  }
}

This should balance the load better and keep the order. You might be able to come with a solution using schedule(static, chunk) which keeps the order without using the outer loop but it seems non-trivial to me. Also you might be able to come up with a custom dynamic solution but again it seems non-trivial to keep the order.

Openmp for loop: static schedule with work unbalance

1 Answers1