7

I have an algorithm for which one goal is to fill vectors. For the sake of performance, iterations of the algorithm are spread across OpenMP threads. I was wondering which way would provide the better/safer way of filling the vectors.

Note that the ordering of the vectors must be consistent (i.e. value n of vec1 must come from the same iteration than value n of vec2.)

Hypothesis 1:

std::vector<BasicType> vec1;
std::vector<BasicType> vec2;
#pragma opm parallel for
for(...)
{
    // Do some intensive stuff to compute val1 and val2
    // ...

    #pragma omp critical
    {
        vec1.push_back(val1);
        vec2.push_back(val2);
    }
}
// Then go on to work with vec1 and vec2...

Hypothesis 2:

std::vector<BasicType> vec1;
std::vector<BasicType> vec2;
#pragma opm parallel
{
    std::vector<BasicType> vec1local;
    std::vector<BasicType> vec2local;
    #pragma omp for
    for(...)
    {
        // Do some intensive stuff to compute val1 and val2
        // ...

        vec1local.push_back(val1);
        vec2local.push_back(val2);
    }

    #pragma omp critical
    {
        // See note 1 below
        vec1.insert(vec1.end(), vec1local.begin(), vec1local.end());
        vec2.insert(vec2.end(), vec2local.begin(), vec2local.end());
    }
}
// Then go on to work with vec1 and vec2...

Note 1: This was taken from Best way to append vector to vector

Note 2: It seems val1 and val2 could be combined in some object in order to maintain consistency and work only with one vector, but for now it seems inpractical for the remainder of the algorithm.

Note 3: To give an order of magnitude, the for loop contains around 100 iterations splitted among 4 threads. Apart for very few exceptions, each iteration is expected to have the same workload (which brings the issue of the critical sections happening around the same time.)

Note 4: Just for the records, the whole thing deals with image stabilisation, and runs on a Tegra K1 architecture. Compiler used is gcc 4.8.4.

Community
  • 1
  • 1
Vincent COISY
  • 75
  • 1
  • 6
  • If you merge your vectors like that, you'll want to use move iterators to save copying so much. – OMGtechy Mar 31 '16 at 14:55
  • Does the order of the filling of the pairs matter? I mean in serial you fill (val1,val2)_0, then ),(val1,val2)_1, ...(val1,val2)_n-1. If it has to be in order like this then neither of your hypothesis will work. – Z boson Apr 01 '16 at 09:46
  • See [this](http://stackoverflow.com/questions/18669296/c-openmp-parallel-for-loop-alternatives-to-stdvector/18671256#18671256) and [this](http://stackoverflow.com/questions/35675466/reductions-in-parallel-in-logarithmic-time) for more info about filling vectors and merging in parallel. – Z boson Apr 01 '16 at 09:48
  • 1
    Also if you know how many elements of the vectors you will have ahead of time there is no reason to use a vector. Just use an array or access the vector like an array as in [Zulan's answer](http://stackoverflow.com/a/36351954/2542702). – Z boson Apr 01 '16 at 09:50
  • @Zboson No the order doesn't matter, as some sorting and filtering occurs later in the process. The latterbeing why I prefer to go with vectors than arrays. – Vincent COISY Apr 01 '16 at 13:33
  • @VincentCOISY, what I meant was can you access the vector like an array (i.e. random access with array notation) when you want to do it parallel or do you have to use `push_back`? If you can access it like an array then your question is not so interesting. – Z boson Apr 01 '16 at 14:00

2 Answers2

3

From your two suggestions, I would prefer the first. It is simpler and does not require additional memory. However, I would suggest an alternative without the critical section:

std::vector<BasicType> vec1(size);
std::vector<BasicType> vec2(size);
#pragma opm parallel for
for(...)
{
    // Do some intensive stuff to compute val1 and val2
    // ...

    vec1[i] = val1;
    vec2[i] = val2;
}

Note that this may have some performance issues due to cache line stealing. However, I wouldn't worry about that unless it turns out to be an actual problem validated by actual performance analysis. A solution could then be:

  • Go with your second solution. (which costs memory and additional post-processing)
  • Align your vector and use appropriate chunks for the loop. (such that each thread has local cache lines)
  • Use a data structure that internally contains the local vectors, but to the outside provides the necessary global vector-operations. (This is probably overall the best solution.)
Community
  • 1
  • 1
Zulan
  • 21,896
  • 6
  • 49
  • 109
  • 1
    With static OpenMP scheduling the cache contention should have only minimal or no effect on performance. – Daniel Langr Apr 01 '16 at 09:18
  • Thanks, I think this should to the trick in the simplest way. – Vincent COISY Apr 01 '16 at 13:36
  • At least for now... However I'm concerned that I may come up with the case where some iterations may not contribute to the vector filling, in which case I'd have to deal with unset elements. – Vincent COISY Apr 01 '16 at 13:43
  • I am not sure my answer is correct. It works in the tests I have done but I think maybe there may some sync issues which could give wrong results. I based my answer off your answer [here](http://stackoverflow.com/a/35805567/2542702), – Z boson Apr 01 '16 at 13:48
  • @VincentCOISY: In the 'sparse' case, just go with your solution 1, and revisit only if performance analysis indicates that this becomes an issue. [Minor disclaimer: this assumes `BasicType` is easy to move or copy so that the `push_back` is assumed to be rather inexpensive.] – Zulan Apr 01 '16 at 15:12
2

I am going to assume that you need to use a vector and cannot use an array (othrewise your question is not very interesting). Using t = omp_get_num_threads(), you fill the vectors in parallel and then merge them in log2(t) operations instead of toperations (as you are doing now) like this

void reduce(std::vector<BasicType> *v1, std::vector<BasicType> *v2, int begin, int end) {
    if(end - begin == 1) return;
    int pivot = (begin+end)/2;
    #pragma omp task
    reduce(v, begin, pivot);
    #pragma omp task
    reduce(v, pivot, end);
    #pragma omp taskwait
    v1[begin].insert(v1[begin].end(), v1[pivot].begin(), v1[pivot].end());
    v2[begin].insert(v2[begin].end(), v2[pivot].begin(), v2[pivot].end());
}

and

std::vector<BasicType> v1, v2;
std::vector<BasicType> *v1p, *v2p;
#pragma omp parallel
{
    #pragma omp single
    {
        v1p = new std::vector<BasicType>[omp_get_num_threads()];
        v2p = new std::vector<BasicType>[omp_get_num_threads()];
    }
    #pragma omp for
    for(...) {
        // Do some intensive stuff to compute val1 and val2
        // ...
       v1p[omp_get_thread_num()].push_back(val1);
       v2p[omp_get_thread_num()].push_back(val2);
    }
    #pragma omp single
    reduce(v1p, v2p, 0, omp_get_num_threads());
}
v1 = v1p[0], v2 = v2p[0];
delete[] v1p;
delete[] v2p;

For example with 8 threads this will join the vectors for threads

(0,1) (2,3) (4,5) (6,7)
(0,2) (4,6)
(0,4)

For more information about filling vectors in parallel see this. For more information about merging threads in log2(t) operations see the answer to this question.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • Yes vectors are most likeky mandatory (see my comment to Zulan's answer.) Btw thanks for the reference links. – Vincent COISY Apr 01 '16 at 13:48
  • @VincentCOISY, btw, I'm not sure my answer is 100% correct. For example if I have eight threads and thread0 processes (0,1) and finishes before thread1 finishes (2,3) it would start to process (0,2) before (2,3) was finished. I get the correct results doing some tests but I may have missed the correct unit tests. – Z boson Apr 01 '16 at 13:51
  • @VincentCOISY, I think it might be okay. The documentaion says "The TASKWAIT construct specifies a wait on the completion of child tasks generated since the beginning of the current task. " So as long as it waits on each child task I think it is okay (I am still learn about the `task` directive). – Z boson Apr 01 '16 at 13:57
  • I think your answer is fine, but it looks like premature optimization to me. Please note that you are not reducing the number of operations, you are reducing the critical path by parallelizing them. – Zulan Apr 01 '16 at 14:59
  • @Zulan, thanks for the verification. This is a premature optimization because I wanted the question to be more interesting than it is. I probably should not have given an answer and instead appended this answer to [this](https://stackoverflow.com/questions/18669296/c-openmp-parallel-for-loop-alternatives-to-stdvector/18671256#18671256) answer. By operations I was referring to the number of insert operations needed not all operatoins. – Z boson Apr 01 '16 at 19:12