std::sort is way more fast than custom OpenMP parallel sorting algorithm

Question

I've been testing parallel sorting using OpenMP. I implemented odd-even sorting algorithm that is 3x faster than without OpenMP. However, std::sort is still way more faster: seq - 100s, parallel - 20s, std::sort - 0.05s

I tried moving #pragma omp parallel for to i-cycle but that worked even worse and did not sort the vector

for (int i = 0; i < 100000; i++) {
        #pragma omp parallel for
        for (int j = (i % 2) ? 0 : 1; j < 100000 - 1; j += 2) {
            if (vec_[j] > vec_[j + 1]) {
                std::swap(vec_[j], vec_[j + 1]);
            }
        }
    }

Tbh, I expected parallel odd-even sort to be the fastest but now I wonder - am I doing anything wrong or it is just std::sort is SO efficient?

It seems like `std::swap` may also run in parallel given certain circumstances. See this answer: https://stackoverflow.com/a/43162632/10411602 Besides, an improved bubble sort isn't going to beat something like an introsort, which is what STL uses. — Blaze, May 10 '19 at 13:21
how many threads? If the sequential version takes only 0.05 seconds I would not expect any speed up from putting more threads, I'd rather expect to see the overhead coming from using threads — 463035818_is_not_an_ai, May 10 '19 at 13:21
Is the complexity of your algorithm logarithmic? From the surface, it looks like a plain old (slow, O(n*n)) bubble sort with a few tweaks. `std::sort` is logarithmic in nature. — PaulMcKenzie, May 10 '19 at 13:23
more threads does not always result in faster execution. In fact problems that benefit from adding more and more threads are extremely rare. Try to increase the workload — 463035818_is_not_an_ai, May 10 '19 at 13:24
1. Cache ownership of cores most likely causes slowdown when parallelized. 2. As you noticed top cycle can't be parallelized simply because one iteration depends on result of previous 3. std::sort is efficient, that's why it's in std. usually, it's a combination of quick and insertion sort which are both algorithmically superior to even-odd — rAndom69, May 10 '19 at 13:28
@user463035818 CPU is 8 threads. It doesn't matter how many integers do I input - std::sort is still faster — Max Vlasov, May 10 '19 at 13:35
I think @PaulMcKenzie explained that well. You have a slow algorithm. Adding 8 threads does not make up for the difference between O(n long n) and O(n^2) — drescherjm, May 10 '19 at 13:41
If your _O(n^2)_ algorithm takes 100 [s] when run single-threaded, you would need 2000 threads to reduce it to 0.05 [s]. In the ideal case with perfect linear speedup. — Daniel Langr, May 10 '19 at 13:45
Also the fact that you're testing with 100k elements... Maybe bump it up to 100M elements and you'll have a significant enough workload to see a difference. — ZeroUltimax, May 10 '19 at 13:46
[Here](https://www.wolframalpha.com/input/?i=plot+n*n%2F8+vs+n*log(n)+from+1+to+1000) is a plot that shows the difference in behaviour between `n*n/8` and `n*log(n)`. — molbdnilo, May 10 '19 at 14:38
That graph shows that 8 threads with this algorithm will never be better than a single threaded std::sort. The graph may better at `n` less than 35 or so but the time to create threads is not considered and also other factors. — drescherjm, May 10 '19 at 14:45

score 7 · Accepted Answer · answered May 10 '19 at 14:03

Your algorithm is O(n^2) in total work done. Using k threads, this at most makes it O(n^2/k).

std::sort is O(n lg n). Unless k is O( n / lg n ) adding more threads won't keep up.

And even if you did have piles of threads. NUMA on most mega-thread systems means that your memory is going to be a serious pain. The threads don't access the same memory in each cycle, and in fact constantly pass data back and forth.

An example of a way that might speed up your work compared to a simple std::sort would be to split the data into K pieces, std::sort each of the K pieces, then do a parallel merge of those pieces.

int data_count = 100000;
auto get_block_edge = [data_count](int i, int n) {
  return data_count * n / i;
};
int blocks = 0;
#pragma omp parallel
{
  blocks = omp_get_num_threads();
  int block = omp_get_thread_num();
  int start = get_block_edge( block, blocks );
  int finish = get_block_edge( block+1, blocks );
  std::sort( std::begin(vec_)+start, std::begin(vec_)+finish );
}

now we have a bunch of sorted blocks. You just need to merge them.

for (int merge_step = 1; merge_step < blocks; merge_step *= 2 )
{
  #pragma omp parallel for
  for (int i = 0; i < (blocks/2/merge_step); ++i) {
    int start = get_block_edge(i*2*merge_step, blocks);
    int mid = get_block_edge(i*2*merge_step+merge_step, blocks);
    int finish = get_block_edge(i*2*merge_step+2*merge_step, blocks);
    finish = std::min(finish, data_count);
    auto b = std::begin(vec_);
    std::inplace_merge( b+start, b+mid, b+finish );
  }
}

I think that should be a highly parallel merge. Or, more likely, segfault because I have an off-by-1 error.

Now, this is still O(n) with an infinite number of threads, as the very last merge has to be single-threaded. Getting around that is, to say it mildly, tricky.

For this you will probably have to increase the size of the dataset to see the parallel gain since 0.05 seconds (time to sort the whole thing using std::sort) is already small. — drescherjm, May 10 '19 at 14:06
@drescherjm 1/20th of a second is an eon for pure data manipulation. I'd be surprised if we couldn't make it faster with careful use of threads. Maybe not the code above, but still. — Yakk - Adam Nevraumont, May 10 '19 at 15:09
I was thinking of the cost to create the threads. Although this (at least some of the answers) tells me I don't have to worry so much about that: https://stackoverflow.com/questions/3929774/how-much-overhead-is-there-when-creating-a-thread — drescherjm, May 10 '19 at 15:10
@drescherjm There is nothing requiring threads to be created there. Idle threads in a thread pool could be used (if not by OpenMP (I cannot guarantee what OpenMP does), then by a different threaded sort implementation). There is a required inter-thread communication overhead, but that is small on the scale of 0.05 seconds. — Yakk - Adam Nevraumont, May 10 '19 at 15:17

std::sort is way more fast than custom OpenMP parallel sorting algorithm

1 Answers1