I have a loop over a vector of ordered integers. Inside the loop, I do some operations on the integers to determine if they are kept or discarded for the rest of the program. After the loop, I need to have a vector containing the kept integers, also ordered.
Since the vector of integers is rather large (can be around 50 billions), I'm using openmp to parallelize the loop, and I came up with the following code. To keep the good integers ordered, I use the static clause and private vectors, that are merged after loop completion.
std::vector<unsigned> large_int_vector;
// vector filling and ordered...
std::vector<std::vector<unsigned> > good_int_threads;
#pragma omp parallel
{
int num_threads = omp_get_num_threads();
int i_thread = omp_get_thread_num();
#pragma omp single
{
good_int_threads.resize(num_threads);
}
#pragma omp for schedule(static)
for (std::size_t i = 0; i < large_int_vector.size(); i++) {
bool is_good_int = true;
// some operations on large_int_vector[i]
if (is_good_int)
good_int_threads[i_thread].push_back(large_int_vector[i]);
}
}
std::vector<unsigned> good_int;
for (unsigned i = 0; i != good_int_threads.size(); i++) {
good_int.insert(good_int.end(),good_int_threads[i].begin(),good_int_threads[i].end());
}
This works correctly, but does not scale well. The reason is that most (it depends on the number of threads, but is typically > 50% and can be up to 100% if using only 2 or 3 threads) of the good integers are located in the "begining" of large_int_vector.
More precisely, the work done in the for loop is to apply some operations to the integers. If a single one fails, the integer is discarded and we can immediately move to the next integer. On the other hand, for the integer to be kept, it has to pass all the operations. Obviously, this induces a work imbalance, which explains the poor scaling.
The size of good_int is typically ~large_int_vector/50, so around 1 billion.
How could I improve my code ?