Parallelizing a for loop using openmp & replacing push_back

Question

I'd like to parallelize the following piece of code but am new to openmp and creating parallel code.

std::vector<DMatch> good_matches;
for (int i = 0; i < descriptors_A.rows; i++) {
   if (matches_RM[i].distance < 3 * min_dist) {
      good_matches.push_back(matches_RM[i]);
   }
}

I have tried

std::vector<DMatch> good_matches;
#pragma omp parallel for
for (int i = 0; i < descriptors_A.rows; i++) {
   if (matches_RM[i].distance < 3 * min_dist) {
      good_matches[i] = matches_RM[i];
   }
}

and

std::vector<DMatch> good_matches;
cv::DMatch temp;
#pragma omp parallel for
for (int i = 0; i < descriptors_A.rows; i++) {
   if (matches_RM[i].distance < 3 * min_dist) {
      temp = matches_RM[i];
      good_matches[i] = temp;
      // AND ALSO good_matches.push_back(temp);
   }

I have also tried

#omp parallel critical 
good_matches.push_back(matches_RM[i]);

This clause works but does not speed anything up. It may be the case that this for loop cannot be sped up but it'd be great if it can be. I'd also like to speed this up as well

std::vector<Point2f> obj, scene;
for (int i = 0; i < good_matches.size(); i++) {
   obj.push_back(keypoints_A[good_matches[i].queryIdx].pt);
   scene.push_back(keypoints_B[good_matches[i].trainIdx].pt);
}

Apologies if this question as been answered and thank you very much to anyone who can help.

https://stackoverflow.com/questions/18669296/c-openmp-parallel-for-loop-alternatives-to-stdvector/18671256#18671256 — Z boson, Jul 16 '14 at 07:11

score 8 · Answer 1 · answered Jul 16 '14 at 07:24

I showed how to do this here c-openmp-parallel-for-loop-alternatives-to-stdvector

Make private versions of the std::vector and fill the shared std::vector in a critical section like this:

std::vector<DMatch> good_matches;
#pragma omp parallel
{
    std::vector<DMatch> good_matches_private;
    #pragma omp for nowait
    for (int i = 0; i < descriptors_A.rows; i++) {
       if (matches_RM[i].distance < 3 * min_dist) {
          good_matches_private.push_back(matches_RM[i]);
       }
    }
    #pragma omp critical
    good_matches.insert(good_matches.end(), good_matches_private.begin(), good_matches_private.end());
}

score 2 · Answer 2 · answered Jul 15 '14 at 19:30

2

One possibility may be to use private vectors for each thread and combine them in the end:

#include<omp.h>

#include<algorithm>
#include<iterator>
#include<iostream>
#include<vector>

using namespace std;

int main()
{
  vector<int> global_vector;  
  vector< vector<int> > buffers;

  #pragma omp parallel
  {
    auto nthreads = omp_get_num_threads();
    auto id = omp_get_thread_num();
    //
    // Correctly set the number of buffers
    //
  #pragma omp single
    {
      buffers.resize( nthreads );
    }
    //
    // Each thread works on its chunk
    // If order is important maintain schedule static
    //
  #pragma omp for schedule(static)
    for(size_t ii = 0; ii < 100; ++ii) {      
      if( ii % 2 != 0 ) { // Any other condition will do
          buffers[id].push_back(ii);
      }
    }
    //
    // Combine buffers together
    //
    #pragma omp single
    {
      for( auto & buffer : buffers) {
        move(buffer.begin(),buffer.end(),back_inserter(global_vector));
      }
    }
  }
  //
  // Print the result
  //
  for( auto & x : global_vector) {
    cout << x << endl;
  }    
  return 0;
}

The actual speed-up depends only on the amount of work done inside each loop.

answered Jul 15 '14 at 19:30

Massimiliano

7,842
2
47
62

Although using private vectors is the way to go you make it unnecessarily complicated. There is no need to use run time functions and create the vectors in a single call. Not only that but it could have cache issues. – Z boson Jul 16 '14 at 07:24
@Zboson Usually I go for the safest solution, then for the fastest. My snippet constructs a `global_vector` which is exactly equal to the one constructed in the serial case (not a permutation) – Massimiliano Jul 16 '14 at 08:23
Oh, I see what you mean. Your solution preserves the order but mine does not necessarily. I'm not sure the order matters for the OP but it might. – Z boson Jul 16 '14 at 08:30
+1, although the fact that the OP says he got the right answer using critical indicates that the order is not important your answer is still interesting nevertheless. – Z boson Jul 21 '14 at 08:41
That this solution has a significant performance issue: `buffers[id].push_back(ii)` will cause false sharing as the *private* `vector`s (their 3 pointers, not the actual data) of multiple threads are on the same cache line. See also https://stackoverflow.com/a/43064331/620382 for a discussion. – Zulan Nov 14 '17 at 10:09

score 2 · Answer 3 · answered Jul 16 '14 at 19:04

2

TBB's concurrent_vector acts much like std::vector, but allows parallel calls to push_back.

answered Jul 16 '14 at 19:04

Arch D. Robison

3,829
2
16
26

Could it open compatibility problems using `tbb` containers inside `OpenMP` work-sharing constructs? – Massimiliano Jul 16 '14 at 20:07
@Massimiliano, maybe he's suggesting to use TBB instead of OpenMP? – Z boson Jul 17 '14 at 07:09
1

TBB concurrent containers have no dependencies on TBB tasking, and thus can be used with OpenMP or with PThreads. That was a deliberate design decision from the start of TBB. – Arch D. Robison Jul 17 '14 at 14:01

Parallelizing a for loop using openmp & replacing push_back

3 Answers3

Linked