C++ OpenMP Parallel For Loop - Alternatives to std::vector

Question

Based on this thread, OpenMP and STL vector, which data structures are good alternatives for a shared std::vector in a parallel for loop? The main aspect is speed, and the vector might require resizing during the loop.

Show us some code, describe your specific situation... what will be stored in the vector? What will your loop do with it? It's very probable that it will be perfectly safe to use `std::vector` anyway. — LihO, Sep 07 '13 at 03:15
As said in the linked thread, you only need to care about not using std::vector when your vector is being resized, and possibly reallocated, in your loop. If you just change objects, you can use it perfectly fine. Can you elaborate on your requirements, and why vector wouldn't suite your needs? — SinisterMJ, Sep 07 '13 at 03:29
I think it's only a problem if the `std::vector` is shared. If it's private then I don't think there is a problem to use `push_back` or `resize`. — Z boson, Sep 07 '13 at 07:23

score 54 · Answer 1 · edited May 23 '17 at 11:47

I think you can use std::vector with OpenMP most of the time and still have good performance. The following code for example fills std::vectors in parallel and then combines them in the end. As long as your main loop/fill function is the bottleneck this should work well in general and be thread safe.

std::vector<int> vec;
#pragma omp parallel
{
    std::vector<int> vec_private;
    #pragma omp for nowait //fill vec_private in parallel
    for(int i=0; i<100; i++) {
        vec_private.push_back(i);
    }
    #pragma omp critical
    vec.insert(vec.end(), vec_private.begin(), vec_private.end());
}

Edit:

OpenMP 4.0 allows user-defined reductions using #pragma omp declare reduction. The code above can be simplified with to

#pragma omp declare reduction (merge : std::vector<int> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))

std::vector<int> vec;
#pragma omp parallel for reduction(merge: vec)
for(int i=0; i<100; i++) vec.push_back(i);

Edit: What I have shown so far does not fill the vector in order. If the order matters then this can be done like this

std::vector<int> vec;
#pragma omp parallel
{
    std::vector<int> vec_private;
    #pragma omp for nowait schedule(static)
    for(int i=0; i<N; i++) { 
        vec_private.push_back(i);
    }
    #pragma omp for schedule(static) ordered
    for(int i=0; i<omp_get_num_threads(); i++) {
        #pragma omp ordered
        vec.insert(vec.end(), vec_private.begin(), vec_private.end());
    }
}

This avoids saving a std::vector for each thread and then merging them in serial outside of the parallel region. I learned about this "trick" here. ~~I'm not sure how to do this (or if it's even possible) for user-defined reductions.~~. It's not possible to do this with user-defined reductions.

I just realized that the critical section is not necessary which I figured out from this question parallel-cumulative-prefix-sums-in-openmp-communicating-values-between-thread. This method also gets the order correct as well

std::vector<int> vec;
size_t *prefix;
#pragma omp parallel
{
    int ithread  = omp_get_thread_num();
    int nthreads = omp_get_num_threads();
    #pragma omp single
    {
        prefix = new size_t[nthreads+1];
        prefix[0] = 0;
    }
    std::vector<int> vec_private;
    #pragma omp for schedule(static) nowait
    for(int i=0; i<100; i++) {
        vec_private.push_back(i);
    }
    prefix[ithread+1] = vec_private.size();
    #pragma omp barrier
    #pragma omp single 
    {
        for(int i=1; i<(nthreads+1); i++) prefix[i] += prefix[i-1];
        vec.resize(vec.size() + prefix[nthreads]);
    }
    std::copy(vec_private.begin(), vec_private.end(), vec.begin() + prefix[ithread]);
}
delete[] prefix;

As for the question in the very last sentence: "The number of times the _combiner_ is executed, and the order of these executions, for any `reduction` clause is unspecified," therefore not possible. — Hristo Iliev, Mar 26 '15 at 12:20
Just a question: if we had a nested for loop, would the code above still hold by writting: `std::vector vec; #pragma omp parallel { #pragma omp for collapse(2) nowait schedule(static) for(int i=0; i — Joachim, Feb 09 '20 at 15:25
@Joachim, I don't know. I don't normally use nested parallelism. I don't have time to look into this either. Maybe ask a question. There are many experts here that can help you. — Z boson, Feb 10 '20 at 08:36

score 16 · Accepted Answer · edited May 23 '17 at 11:47

The question you link was talking about the fact that "that STL vector container is not thread-safe in the situation where multiple threads write to a single container" . This is only true, as stated correctly there, if you call methods that can cause reallocation of the underlying array that std::vector holds. push_back(), pop_back() and insert() are examples of these dangerous methods.

If you need thread safe reallocation, then the library intel thread building block offers you concurrent vector containers . You should not use tbb::concurrent_vector in single thread programs because the time it takes to access random elements is higher than the time std::vector takes to do the same (which is O(1)). However, concurrent vector calls push_back(), pop_back(), insert() in a thread safe way, even when reallocation happens.

EDIT 1: The slides 46 and 47 of the following Intel presentation give an illustrative example of concurrent reallocation using tbb::concurrent_vector

EDIT 2: By the way, if you start using Intel Tread Building Block (it is open source, it works with most compilers and it is much better integrated with C++/C++11 features than openmp), then you don't need to use openmp to create a parallel_for, Here is a nice example of parallel_for using tbb.

C++ OpenMP Parallel For Loop - Alternatives to std::vector

2 Answers2

Linked