Your algorithm is O(n^2) in total work done. Using k threads, this at most makes it O(n^2/k).
std::sort
is O(n lg n). Unless k is O( n / lg n ) adding more threads won't keep up.
And even if you did have piles of threads. NUMA on most mega-thread systems means that your memory is going to be a serious pain. The threads don't access the same memory in each cycle, and in fact constantly pass data back and forth.
An example of a way that might speed up your work compared to a simple std::sort would be to split the data into K pieces, std::sort
each of the K pieces, then do a parallel merge of those pieces.
int data_count = 100000;
auto get_block_edge = [data_count](int i, int n) {
return data_count * n / i;
};
int blocks = 0;
#pragma omp parallel
{
blocks = omp_get_num_threads();
int block = omp_get_thread_num();
int start = get_block_edge( block, blocks );
int finish = get_block_edge( block+1, blocks );
std::sort( std::begin(vec_)+start, std::begin(vec_)+finish );
}
now we have a bunch of sorted blocks. You just need to merge them.
for (int merge_step = 1; merge_step < blocks; merge_step *= 2 )
{
#pragma omp parallel for
for (int i = 0; i < (blocks/2/merge_step); ++i) {
int start = get_block_edge(i*2*merge_step, blocks);
int mid = get_block_edge(i*2*merge_step+merge_step, blocks);
int finish = get_block_edge(i*2*merge_step+2*merge_step, blocks);
finish = std::min(finish, data_count);
auto b = std::begin(vec_);
std::inplace_merge( b+start, b+mid, b+finish );
}
}
I think that should be a highly parallel merge. Or, more likely, segfault because I have an off-by-1 error.
Now, this is still O(n) with an infinite number of threads, as the very last merge has to be single-threaded. Getting around that is, to say it mildly, tricky.