0

I've been testing parallel sorting using OpenMP. I implemented odd-even sorting algorithm that is 3x faster than without OpenMP. However, std::sort is still way more faster: seq - 100s, parallel - 20s, std::sort - 0.05s

I tried moving #pragma omp parallel for to i-cycle but that worked even worse and did not sort the vector

for (int i = 0; i < 100000; i++) {
        #pragma omp parallel for
        for (int j = (i % 2) ? 0 : 1; j < 100000 - 1; j += 2) {
            if (vec_[j] > vec_[j + 1]) {
                std::swap(vec_[j], vec_[j + 1]);
            }
        }
    }

Tbh, I expected parallel odd-even sort to be the fastest but now I wonder - am I doing anything wrong or it is just std::sort is SO efficient?

Max Vlasov
  • 118
  • 2
  • 9
  • It seems like `std::swap` may also run in parallel given certain circumstances. See this answer: https://stackoverflow.com/a/43162632/10411602 Besides, an improved bubble sort isn't going to beat something like an introsort, which is what STL uses. – Blaze May 10 '19 at 13:21
  • 2
    how many threads? If the sequential version takes only 0.05 seconds I would not expect any speed up from putting more threads, I'd rather expect to see the overhead coming from using threads – 463035818_is_not_an_ai May 10 '19 at 13:21
  • 3
    Is the complexity of your algorithm logarithmic? From the surface, it looks like a plain old (slow, O(n*n)) bubble sort with a few tweaks. `std::sort` is logarithmic in nature. – PaulMcKenzie May 10 '19 at 13:23
  • more threads does not always result in faster execution. In fact problems that benefit from adding more and more threads are extremely rare. Try to increase the workload – 463035818_is_not_an_ai May 10 '19 at 13:24
  • 1. Cache ownership of cores most likely causes slowdown when parallelized. 2. As you noticed top cycle can't be parallelized simply because one iteration depends on result of previous 3. std::sort is efficient, that's why it's in std. usually, it's a combination of quick and insertion sort which are both algorithmically superior to even-odd – rAndom69 May 10 '19 at 13:28
  • @user463035818 CPU is 8 threads. It doesn't matter how many integers do I input - std::sort is still faster – Max Vlasov May 10 '19 at 13:35
  • I think @PaulMcKenzie explained that well. You have a slow algorithm. Adding 8 threads does not make up for the difference between O(n long n) and O(n^2) – drescherjm May 10 '19 at 13:41
  • 5
    If your _O(n^2)_ algorithm takes 100 [s] when run single-threaded, you would need 2000 threads to reduce it to 0.05 [s]. In the ideal case with perfect linear speedup. – Daniel Langr May 10 '19 at 13:45
  • 1
    Also the fact that you're testing with 100k elements... Maybe bump it up to 100M elements and you'll have a significant enough workload to see a difference. – ZeroUltimax May 10 '19 at 13:46
  • 1
    [Here](https://www.wolframalpha.com/input/?i=plot+n*n%2F8+vs+n*log(n)+from+1+to+1000) is a plot that shows the difference in behaviour between `n*n/8` and `n*log(n)`. – molbdnilo May 10 '19 at 14:38
  • That graph shows that 8 threads with this algorithm will never be better than a single threaded std::sort. The graph may better at `n` less than 35 or so but the time to create threads is not considered and also other factors. – drescherjm May 10 '19 at 14:45

1 Answers1

7

Your algorithm is O(n^2) in total work done. Using k threads, this at most makes it O(n^2/k).

std::sort is O(n lg n). Unless k is O( n / lg n ) adding more threads won't keep up.

And even if you did have piles of threads. NUMA on most mega-thread systems means that your memory is going to be a serious pain. The threads don't access the same memory in each cycle, and in fact constantly pass data back and forth.

An example of a way that might speed up your work compared to a simple std::sort would be to split the data into K pieces, std::sort each of the K pieces, then do a parallel merge of those pieces.

int data_count = 100000;
auto get_block_edge = [data_count](int i, int n) {
  return data_count * n / i;
};
int blocks = 0;
#pragma omp parallel
{
  blocks = omp_get_num_threads();
  int block = omp_get_thread_num();
  int start = get_block_edge( block, blocks );
  int finish = get_block_edge( block+1, blocks );
  std::sort( std::begin(vec_)+start, std::begin(vec_)+finish );
}

now we have a bunch of sorted blocks. You just need to merge them.

for (int merge_step = 1; merge_step < blocks; merge_step *= 2 )
{
  #pragma omp parallel for
  for (int i = 0; i < (blocks/2/merge_step); ++i) {
    int start = get_block_edge(i*2*merge_step, blocks);
    int mid = get_block_edge(i*2*merge_step+merge_step, blocks);
    int finish = get_block_edge(i*2*merge_step+2*merge_step, blocks);
    finish = std::min(finish, data_count);
    auto b = std::begin(vec_);
    std::inplace_merge( b+start, b+mid, b+finish );
  }
}

I think that should be a highly parallel merge. Or, more likely, segfault because I have an off-by-1 error.

Now, this is still O(n) with an infinite number of threads, as the very last merge has to be single-threaded. Getting around that is, to say it mildly, tricky.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
  • For this you will probably have to increase the size of the dataset to see the parallel gain since 0.05 seconds (time to sort the whole thing using std::sort) is already small. – drescherjm May 10 '19 at 14:06
  • @drescherjm 1/20th of a second is an eon for pure data manipulation. I'd be surprised if we couldn't make it faster with careful use of threads. Maybe not the code above, but still. – Yakk - Adam Nevraumont May 10 '19 at 15:09
  • I was thinking of the cost to create the threads. Although this (at least some of the answers) tells me I don't have to worry so much about that: https://stackoverflow.com/questions/3929774/how-much-overhead-is-there-when-creating-a-thread – drescherjm May 10 '19 at 15:10
  • @drescherjm There is nothing requiring threads to be created there. Idle threads in a thread pool could be used (if not by OpenMP (I cannot guarantee what OpenMP does), then by a different threaded sort implementation). There is a required inter-thread communication overhead, but that is small on the scale of 0.05 seconds. – Yakk - Adam Nevraumont May 10 '19 at 15:17