OpenMP, parallel for loop, Large differences in processing time

Question

I've developed a program that reads numbers from .txt file where it will store into a vector to undergone a series of combinations and calculations to determine whether the result matches the number that I've wanted. These process will be done in multiple threads, where each thread will be in charge of handling various number of iterations within the parallel for loop.

Long story short, the processing time varies a lot when it comes to large number (e.g. 9 numbers) where the processing time could be as short as 3 minutes or it could be more than 10 minutes.

Here's the benchmark that I've tried so far:

8 numbers serial : 18.119 seconds
8 numbers multithread (first-try): 10.238 seconds
8 numbers multithread (second-try): 18.943 seconds
9 numbers serial : 458.980 seconds
9 numbers multithread (first-try): 172.347 seconds
9 numbers multithread (second-try): 519.532 seconds   //Seriously?
//Another try after suggested modifications
9 numbers multithread (first-try): 297.017 seconds
9 numbers multithread (second-try): 297.85 seconds
9 numbers multithread (third-try): 304.755 seconds
9 numbers multithread (fourth-try): 396.391 seconds

So the question is, is there any possible way to improve the program (multi-thread) so that it only requires the least amount of time to shuffle/calculate the numbers?

Here's a portion of the code where parallel for loop occurs (Updated with slight modifications):

#include <iostream>    
#include <fstream>
#include <string>
#include <vector>
#include <stdlib.h>
#include <algorithm>
#include <stdio.h>
#include <Windows.h>
#include <omp.h>

#define OPERATORSIZE 3

using namespace std;
int cur_target;
ofstream outFile;

string get_operator(int i) {
        switch (i) {
        case 0:
            return "+";
        case 1:
            return "-";
        case 2:
            return "*";
        case 3:
            return "/";
        default:
            return "";
        }
}

int prev_num_pos(vector<int> &cur_equation, int count) { 
    for (int i = count - 1; i >= 0; i--) { 
        if (cur_equation[i] != -1) return i + 1;
    }
    return 0;
}

bool nextoperator(int k, vector<int> &operator_array) {
        for (int i = k - 2; i >= 0; i--) {
            if (operator_array[i] < OPERATORSIZE) {
                operator_array[i] += 1;
                break;
            }
            else
                operator_array[i] = 0;
            switch (i) {
            case 0:
                return false;
            }
        }
        return true;
}

void vector_combination(vector<int> int_list) {                                             // Generate the number combinations from the number list

    bool div_remainder = false;
    int count = 0;

    #pragma omp parallel for schedule(dynamic) firstprivate(div_remainder) reduction(+:count) 
    for (int i = 0; i < int_list.size(); ++i) {

        vector<int> cur_equation, cur_temp, cur_list, operator_array;

        auto list = int_list;
        rotate(list.begin(), list.begin() + i, list.begin() + i + 1);
        do
        {
            cur_list.clear();
            operator_array.clear();
            for (auto x : list)
                cur_list.push_back(x);
            for (int i = 0; i < cur_list.size() - 1; i++)
                operator_array.push_back(0);

            do
            {
                div_remainder = false;
                count = 0;
                cur_equation = operator_array;
                cur_temp = cur_list;

                for (int i = 0; i < cur_equation.size(); ++i) {                                 // Check for equation priorities
                    if (cur_equation[i] == 3) {
                        count = i;
                        if (cur_temp[count] % cur_temp[count + 1] != 0) {
                            div_remainder = true;
                            break;
                        }
                    }
                }

                if (div_remainder)
                    continue;

                for (int i = 0; i < cur_temp.size() - 1; ++i) {
                    count = -1;
                    if (cur_equation[i] == 2 || cur_equation[i] == 3) {
                        count = prev_num_pos(cur_equation, i);
                    }
                    else
                        continue;
                    if (cur_equation[i] == 2) {
                        cur_temp[count] *= cur_temp[i + 1];
                        cur_equation[i] = -1;
                    }
                    else if (cur_equation[i] == 3) {
                        if (cur_temp[i + 1] != 0) {
                            cur_temp[count] /= cur_temp[i + 1];
                            cur_equation[i] = -1;
                        }
                        else {
                            div_remainder = true;
                            break;
                        }
                    }
                }

                if (div_remainder)
                    continue;

                for (int i = 0; i < cur_temp.size() - 1; ++i) {
                    switch (cur_equation[i]) {
                    case 0: {
                        cur_temp[0] += cur_temp[i + 1];                                             // Addition
                        cur_equation[i] = -1;
                        break;
                    }
                    case 1: {                                                                       // Subtraction
                        cur_temp[0] -= cur_temp[i + 1];
                        cur_equation[i] = -i;
                        break;
                    }
                    }
                }


                if (cur_temp[0] == cur_target) {
                    #pragma omp critical
                    {
                        for (int i = 0; i < cur_list.size(); ++i) {
                            outFile << cur_list[i];
                            if (i < cur_list.size() - 1) { outFile << get_operator(operator_array[i]); }
                        }
                        outFile << "\n";
                    }
                }

            } while (nextoperator(cur_list.size(), operator_array));

            // Send to function to undergone a list of operator combinations
        } while (next_permutation(list.begin() + 1, list.end()));
    }
}

int main(void) {
    SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
    vector<int> int_list;
    string line;
    ifstream myfile("Problem.txt");
    if (myfile.is_open()) {
        while (getline(myfile, line)) {
            int num = stoi(line);
            int_list.push_back(num);
            cur_target = num;
        }
    }
    else
        cout << "Unable to open file." << endl;

    myfile.close();
    int_list.pop_back();
    sort(int_list.begin(), int_list.end());
    outFile.open("answer.txt");
    vector_combination(int_list);
    outFile.close();

    int answer_count = 0;

    myfile.open("answer.txt");
    if (myfile.is_open()) {
        while (getline(myfile, line)) {
            ++answer_count;
            if (answer_count > 1)
                break;
        }
    }
    myfile.close();

    if (answer_count == 0) {
        outFile.open("answer.txt");
        outFile << "-1" << endl;
    }
    outFile.close();

    return 0;
}

As for the sample input, create a .txt file named "Problem.txt" with random numbers like so (The last number is the targeted result)(Updated with current sample input used for benchmark):

The hardware/software specification that the program runs on: Processor: i5 Sandy Bridge 2500K, Ram: 8GB, OS: Windows 10 Professional, IDE: Visual Studio 2015 Enterprise Edition,

Try to measure how much work is done by program and what is its own running time. In linux there are `/usr/bin/time ./your_program` and `perf stat ./your_program`. Check is your loop take most time, or there is any long I/O work by kernel or by HDD. Also check that your cpu cores are free and the cpu frequency is good (don't benchmark on battery-powered laptop). It is hard to answer without knowing sizes of data (and every loop count), full code+data example/profiling data, OS and hardware info. — osgx, Jul 23 '16 at 15:12
@osgx I've checked via using the profiler. So far the only section that consumes most of the resource are the vectors copying, the loop doesn't cost much. Anyway, I'll try my best to update the question with the requirements you've mentioned. — John Wan, Jul 23 '16 at 15:47
is the vector copying inside openmp loop? Is the vectory copy code in the fragment included in the question? What is the time of openmp loop (<5%, 20% 50% or more)? Can you post profiling results? What are sizes of vectors, loop counts, memory consumption? — osgx, Jul 23 '16 at 17:17
@osgx yes, the copy occurs within the loop, it is within the fragment I've provided above. The time of the loop were merely 6.5% where as the vectors copy used up about 11 ~ 22.5% (I'll highlight the code fragment above). The parallel for loop count are based on the sizes of the vectors which is also based on the numbers within the .txt file. The do while loops in charge of both generating different combinations of operators and operands. — John Wan, Jul 23 '16 at 17:47
John, can you post the full program source code and example inputs? — osgx, Jul 23 '16 at 18:29
Maybe you simply have exponential runtime and there is nothing wrong with your multithreading? *If" you indeed check all permutations of (something that is in length siminlar to the length of) your input, then your runtime will definitely be > exponential. — JohnB, Jul 24 '16 at 13:39
Btw: You did not write (or I did not read) whether you get strongly varying runtimes for several runs with the same input data (which would be strange), or with different sets of inputs (which might be reasonable, depending on your algorithm). — JohnB, Jul 24 '16 at 13:41
@JohnB I've checked, the permutations are indeed correct. The runtime does varies alot with the same input data, that's what concerns me. Maybe I should test it on other machines. — John Wan, Jul 24 '16 at 14:01
John, did you run first-try and second-try on the same input files? Can you add runtime of single-threaded program? — osgx, Jul 25 '16 at 12:13
@osgx I did, I've named it as serial in the question above, so far the sequential method gives me the most precise runtime with minimal variations. — John Wan, Jul 25 '16 at 14:36
To be honest, I didn't fully read your code while answering. What is your idea behind the reduction over `count`? Seems to me that `count` should actually be local (thus private) to the second `do`-loop. — Zulan, Jul 26 '16 at 06:36
@Zulan Previously I set it as private(count) instead of reduction to test the performance, but the variations are too large to decide which is better, another reason I declare it outside of the parallel for it's because I was trying to reduce declaration of variables. But you're right, I should've privatize it in the second do-loop. I'll modify the codes and have it a go. — John Wan, Jul 26 '16 at 07:00
@Zulan Came back from 2 test-run, the variations between them were approximately 80 seconds. — John Wan, Jul 26 '16 at 07:14
Is that with 1) moved `omp critical`, 2) localized `count`, 3) affinity setting? — Zulan, Jul 26 '16 at 10:43
@Zulan 1 and 2, I have no clue how to apply the third one. :/ — John Wan, Jul 26 '16 at 11:45
Apparently Visual Studio [doesn't support `OMP_PROC_BIND`](https://msdn.microsoft.com/en-us/library/6sfk977f.aspx), so you may have to resort to [more sophisticated solutions](https://stackoverflow.com/questions/24862488/thread-affinity-with-windows-msvc-and-openmp). — Zulan, Jul 26 '16 at 13:56
@Zulan That is unfortunate, I'll have a look on suggested solutions then, thanks. — John Wan, Jul 26 '16 at 14:37

score 1 · Answer 1 · answered Jul 25 '16 at 12:44

Move the #pragma omp critical inside the if condition. Since cur_temp is thread private and cur_target is global read only, it is not necessary to protect the condition with a critical section. This change drastically minimizes the direct interaction between the threads and, on my system, speeds up the parallel version consistently.

I would weakly guess the performance variations were influenced by the (seemingly random) phase shift between the loops running on different threads.

If performance variation persists, try enabling thread binding. Check the documentation of your OpenMP implementation, look for OMP_PROC_BIND, "thread pinning", "binding", or "affinity".

Hi there, I've tried your suggested method but the variation still persist. I'll have a look for the keywords you've mentioned. Thanks — John Wan, Jul 26 '16 at 06:25

score 0 · Accepted Answer · answered Aug 23 '16 at 06:31

Apparently the runtime variance was caused by the vectors. I've checked it using performance analyzer and noticed the time spent on copying the values between vectors was not consistent. I've modified it to pointer array instead and the runtime is now improved tremendously and consistent.

OpenMP, parallel for loop, Large differences in processing time

2 Answers2