Is this implementation of concurrent quicksort correct?

Question

I have implemented concurrent quicksort in C++ using OpenMP.

#include <omp.h>
#include <iostream>
#include <algorithm>
using namespace std;

void sort(int *a, int low, int high);
int partition(int *a, int low, int high);

class QuickSort {
    private:
        int *arr;
        int len;

    public:
        void init();
        void Sort();
        void Display();
};

int main() {
    cout << "Program implementing Quicksort." << endl;

    QuickSort a;

    a.init();
    a.Sort();
    a.Display();
}

void sort(int *a, int low, int high) {
    if(high < low || low == high)
        return;
    if(high == low+1) {
        if(a[low] > a[high])
            swap(a[low], a[high]);
        return;
    }
    int pivotidx = partition(a, low, high);
    /*for(int i = 0; i < 5; ++i)
        cout << a[i] << " ";
    cout << endl;*/
    cout << "Pivot element has been placed at correct position: " << pivotidx << " by thread " << omp_get_thread_num() << endl;
    #pragma omp parallel sections
    {
        #pragma omp section
        {
            sort(a, low, pivotidx);
        }
        #pragma omp section
        {
            sort(a, pivotidx+1, high);
        }
    }
}

int partition(int *a, int low, int high) {
    int pivot = low;
    int pivotval = a[low];
    int leftpointer = low;
    int rightpointer = high;
    while(leftpointer < rightpointer) {
        while(a[leftpointer] <= a[pivot] && leftpointer <= high)
            ++leftpointer;
        if(leftpointer > high)
            --leftpointer;
        while(a[rightpointer] >= a[pivot] && rightpointer >= low)
            --rightpointer;
        if(rightpointer < low)
            ++rightpointer;
        if(leftpointer < rightpointer)
            swap(a[leftpointer], a[rightpointer]);
    }
    a[low] = a[rightpointer];
    a[rightpointer] = pivotval;
    return rightpointer;
}

void QuickSort::init() {
    cout << "Enter the number of elements in the array: ";
    cin >> len;

    cout << "Enter the elements of the array: ";
    arr = new int[len];
    for(int i = 0; i < len; ++i)
        cin >> arr[i];
}

void QuickSort::Sort() {
    sort(arr, 0, len-1);
}

void QuickSort::Display() {
    cout << "Sorted array is: " << endl;
    for(int i = 0; i < len; ++i)
        cout << arr[i] << " ";
    cout << endl;
}

It's sorting correctly but I am not sure if it's really running on multiple cores. How can I check this? Also, my parallel code is very similar to the one present in the top answer here. There it's mentioned at the end that this can not extract more parallelism than two threads: if it is executed with more threads, the other threads don't have any work to do and will just sit down idle. Why is it so?

Within the parallel section you have 2 sub sections. Each sub sections is executed by 1 thread. Therefore for each divide step in quick sort you would achieve a max parallelization of 2. But this is what you want. The first cut would spawn threads 1,2. 1 would in turn spawn 3 and 4. 2 would spawn 5 and 6. So on and so forth. If I'm not mistaken — Srini, Apr 05 '17 at 03:28
@SrinivasSuresh No additional spawning will happen unless nested OpenMP is enabled. — Vladimir F Героям слава, Apr 05 '17 at 06:09
If you use recursive parallelism you can easily achieve huge oversubscription and poor performance. You would be much better using tasks. (Or, come to that using Threading Building Blocks since this is C++, and there's already an example to "borrow" https://www.threadingbuildingblocks.org/docs/help/tbb_userguide/Design_Patterns/Divide_and_Conquer.html ) — Jim Cownie, Apr 05 '17 at 08:30

score 1 · Accepted Answer · answered Apr 05 '17 at 15:54

There is a subtle error in partition:

    while(a[leftpointer] <= a[pivot] && leftpointer <= high)
    ...
    while(a[rightpointer] >= a[pivot] && rightpointer >= low)

In both cases, you must change the order those checks, otherwise you sometimes access a[leftpointer] while leftpointer > high which may be out of bound. Similarly for the second while condition.

Also do not lie to the reader, leftpointer is not a pointer, but an index! There are other severe style issues, but since this is not CodeReview, I focus on the parallelization.

Using sections for parallelism here is far from ideal. For instance, you must enable nested parallelism to even have more than two threads possibly active at the same time. Instead, you should use OpenMP tasks. Now spawning a task for each and every call to sort is bad, because it will create many tiny tasks and have a bad overhead / work ratio. Instead, create tasks only for large-enough chunks of data, and make sure that no runtime overhead occurs down in the recursion. For that a sophisticated second recursive function is the best option:

void sort_serial(int* a, int low, int high)
{
    if (high < low || low == high)
        return;
    if (high == low + 1)
    {
        if (a[low] > a[high])
            swap(a[low], a[high]);
        return;
    }
    int pivotidx = partition(a, low, high);
    sort_serial(a, low, pivotidx);
    sort_serial(a, pivotidx + 1, high);
}

void sort(int* a, int low, int high)
{
    if (high < low || low == high)
        return;
    if (high == low + 1)
    {
        if (a[low] > a[high])
            swap(a[low], a[high]);
        return;
    }
    int pivotidx = partition(a, low, high);

    // This is an arbitrary threshold.
    if (high - low > 1000000)
    {
#pragma omp task
        {
            sort(a, low, pivotidx);
        }
#pragma omp task
        {
            sort(a, pivotidx + 1, high);
        }
    }
    else
    {
        sort_serial(a, low, pivotidx);
        sort_serial(a, pivotidx + 1, high);
    }
}

To get going with tasks, you must create a parallel region somewhere - and usually narrow it down for a single thread for starting, e.g. like so:

void QuickSort::Sort()
{    
#pragma omp parallel
    {
#pragma omp single
        sort(arr, 0, len - 1);
    }
}

For a large-enough input, and a good choice of a threshold, this will expose sufficient work that can be done in parallel, but not create a huge overhead.

To check how this runs in parallel, usually one uses operating specific monitoring tools, e.g. time on Linux. You can also use a sophisticated performance analysis tool, that can tell you in detail how threads are executing your tasks in parallel.

Is this implementation of concurrent quicksort correct?

1 Answers1