Increasing threads count, but program doesn't work faster C++ OpenMP Selection Sort

Question

I'm trying to write Selection Sort algorithm using C++ OpenMP. I wrote the code, it sorts, but as I increase the number of threads, the run time of SelectionSort algorithm increases also... I just started using OpenMP, so I'm not really good at it :D I can't use OpenMP reduction, because I'm using VisualStudio Community 2017 IDE.. Thanks for help ;)

#include "pch.h"
#include <iostream>
#include <omp.h>
#include <random>
#include <ctime>
#include <iomanip>
#include <algorithm>

using namespace std;

const int n = 10000;                            // count of numbers in vector

// ------------------- FUNCTIONS HEADERS -----------------------
vector<int> FillVector();
vector<int> SelectionSort(vector<int> data, int num_th);
void PrintArray(vector<int> data);

int main()
{
    std::vector<int> test_1(n);
    std::vector<int> test_2(n);
    std::vector<int> test_3(n);
    std::vector<int> test_4(n);

    cout << test_1.size() << endl;          // size of vector

    test_1 = FillVector();

    std::copy(std::begin(test_1), std::end(test_1), std::begin(test_2));    // copy vector test_1 to test_2
    std::copy(std::begin(test_1), std::end(test_1), std::begin(test_3));    // copy vector test_1 to test_3
    std::copy(std::begin(test_1), std::end(test_1), std::begin(test_4));    // copy vector test_1 to test_4

    // Testing times of sorting with different threads count


    // Number of Threads: 2
    int num_th = 2;
    clock_t begin = clock();
    test_1 = SelectionSort(test_1, num_th);     // sort vector test_1
    clock_t end = clock();
    double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
    cout << elapsed_secs << endl;

    // Number of Threads: 4
    num_th = 4;
    begin = clock();
    test_2 = SelectionSort(test_2, num_th);     // sort vector test_2
    end = clock();
    elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
    cout << elapsed_secs << endl;

    // Number of Threads: 8
    num_th = 8;
    begin = clock();
    test_3 = SelectionSort(test_3, num_th);     // sort vector test_3
    end = clock();
    elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
    cout << elapsed_secs << endl;

    // Number of Threads: 16
    num_th = 16;
    begin = clock();
    test_4 = SelectionSort(test_4, num_th);     // sort vector test_4
    end = clock();
    elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
    cout << elapsed_secs << endl;

    return 0;
}

// ----------------- METHODS --------------------------------

// Fill vector with random integers
vector<int> FillVector() {
    vector<int> temp(n);

    srand(time(NULL));
    for (int i = 0; i < n; i++)
        temp[i] = rand() % n + 1;

    return temp;
}

// EXECUTE parallel Selection Sort usin OpenMP
vector<int> SelectionSort(vector<int> data, int num_th)
{
    // start parallel method (OpenMP)
    //VEIKIA !!!!!!
    vector<int> temp(n);
    std::copy(std::begin(data), std::end(data), std::begin(temp));      // make a temporary copy of array

    omp_set_num_threads(num_th);                                        // set threads number
    for (int i = 0; i < n - 1; i++)
    {
        int min_index = i;
#pragma omp parallel                                                    // start OpenMP parallel code block
        for (int j = i + 1; j < n; j++)
            if (temp[j] < temp[min_index])
                min_index = j;
        std::swap(temp[i], temp[min_index]);                            // std swap method
    }

    return temp;
}

// PRINT vector as 2D array
void PrintArray(vector<int> data)
{
    int rows, elem = 20;                                                // declare rows variable and column count

    if (n % elem != 0)                                                  // calculate rows count
        rows = n / elem + 1;
    else
        rows = n / elem;

    int iii = 0;
    for (int i = 0; i < rows; i++) {
        for (int j = 0; j < elem; j++) {
            if (iii != n) {
                cout << setw(3) << left << data[iii] << " ";
                iii++;
            }
            else
                break;
        }
        cout << endl;
    }
    cout << endl;
}

Result (when n = 5000):

size: 5000
threads 2: 5.607
threads 4: 8.421
threads 8: 10.979
threads 16: 27.989

OPEN image

You'd normally want to use a `reduction` clause with your `omp parallel` but that doesn't work for finding the index of the smallest element. — 1201ProgramAlarm, Nov 14 '18 at 22:08
Been a while since I used OMP, but I'm pretty sure you should use `#pragma omp parallel for`, not `#pragma omp parallel`. But also, the fact that you're doing this on an inner loop means excessive thread joins. Not to mention the inherent issues with `min_index`. — paddy, Nov 14 '18 at 22:34

jdrd · Answer 1 · 2018-11-14T22:18:40.563

First, think about the selection sort algorithm. What it does is repeatedly scan a (sub)array to find the smallest element, the moves it to the front. This process is fundamentally iterative, ie. you must scan from 0 to n before you scan from 1 to n. In your case, even though you have multiple threads, you still must scan linearly from the start position to the end position, in order, hence your additional threads don't help.

Because the algorithm is not parallelizable, adding additional threads don't help. In fact, it slows the process down because the new threads must be initialized.

score 0 · Answer 2 · answered Nov 14 '18 at 23:25

Selection sort is really hard to parallelize but you assumed right that you can parallelize part where you need to find the smallest element in the rest of the array.

#pragma omp parallel for (int j = i + 1; j < n; j++) if (temp[j] < temp[min_index]) min_index = j;
The first problem I see is that you use plain omp parallel. This command starts this same section of code for every available thread (for num_th threads). This will definitely not give you any speed improvement and can cause that min_index can contain an index of value that is not minimal due to the fact that multiple threads can access the same variable concurrently meaning that they do not know if another thread that will overwrite variable right after has an index of smaller value or not.

You can repair this by using#pragma omp parallel for which divides interval [i+1, n) into num_thr parts. But concurrent access to the variable is a little bit trickier. In this case, the best solution would be to use omp reduction but like you said you cannot use it. Then there are also other options like using atomic variable, critical section (which will definitely not help you) or compare and swap method, but I doubt that these will give you any significant improvement of speed for your algorithm.

I would suggest you use some more 'parallelization friendly' sorting algorithm like merge sort or quick sort.

score 0 · Answer 3 · answered Nov 15 '18 at 09:17

Expanding on the answer from @Ramade: The problematic piece of code is:

    for (int i = 0; i < n - 1; i++)
    {
        int min_index = i;
#pragma omp parallel                                                    // start OpenMP parallel code block
        for (int j = i + 1; j < n; j++)
            if (temp[j] < temp[min_index])
                min_index = j;
        std::swap(temp[i], temp[min_index]);                            // std swap method
    }

What you need is :

a #pragma omp for to actually distribute the inner loop
some reduction scheme to avoid having each thread accessing shared variable [min_index]
and preferably, avoid spawning new threads at each iteration of the outermost loop to reduce openmp overhead

For instance :

omp_set_num_threads(num_th);                                        // set threads number

// Spawn all threads once
int min_index_all=0;
int min_value_all=INT_MAX;
#pragma omp parallel shared(temp,min_index_all,min_value_all)
{
    //these are private variables
    int min_index,min_value;

    for (int i = 0; i < n - 1; i++)
        min_index = i;
        min_value = temp[min_index];

        // OpenMP distributed loop
        // nowait because once the loop is complete the thread can do the reduction part immediately
        // static schedule because we don't expect a large unbalance of workload between threads
        #pragma omp parallel schedule(static) nowait
        for (int j = i + 1; j < n; j++)
            if (temp[j] < temp[min_index]){
                min_index = j;
                min_value = temp[min_index];
            }

        // reduction part in a critical section
        #pragma omp critical
        {
            if (min_value<min_value_all){
                min_index_all = min_index;
                min_value_all = min_value;
            }
        }

        // explicit barrier: wait for all threads to have finished the reduction
        #pragma omp barrier

        // swap done by a single thread
        // + initialization of shared variables for next iteration
        #pragma omp single
        {
            std::swap(temp[i], temp[min_index]);                            // std swap method
            min_index_all = i+1;
            min_value_all = INT_MAX;
        }

        // implicit barrier after omp single
        // all threads will wait there until the single section is done
        // then they will move together to next iteration of the outer loop
    }// end for i loop
}

Still, the loop that you are attempting to parallelize is very simple and likely to run very efficiently on a single thread up to large values of n (for which selection sort would not be an efficient algorithm). OpenMP parallelization brings some overhead and will be a liability if there are only few iterations of the inner loop.

You may need to split the outer loop so that a parallel code is used if there are many iteration of the inner loop, and a serial one if there are only few iterations.

#define SOME_VALUE 20000 // To be tweaked 
int i;
for (i = 0; i < n - 1 - SOME_VALUE; i++){
  //parallel version of the code
}

i = (n - 1 - SOME_VALUE < 0) ? 0 : n-1-SOME_VALUE;
for (; i < n - 1; i++){
  // finish with a serial version of the code
}

Increasing threads count, but program doesn't work faster C++ OpenMP Selection Sort

3 Answers3