No speed up with openmp for emplace_back a vector in a loop

Question

I'm trying to emplace_back a vector in a loop with openmp. I took my inspiration from this post : C++ OpenMP Parallel For Loop - Alternatives to std::vector. So I write a test code :

// Example program
#include <iostream>
#include <string>
#include <vector>
#include <random>
#include <chrono>

#include <omp.h>

int main()
{
  std::cout << "Numbers of thread available : " << omp_get_max_threads() << std::endl;

  std::random_device dev;
  std::mt19937 gen(dev());
  std::uniform_int_distribution<unsigned> distrib(1, 5);

  {
    std::chrono::time_point<std::chrono::system_clock> start, end;
    start = std::chrono::system_clock::now();
    std::vector<std::pair<uint32_t, uint32_t> > result;

#pragma omp declare reduction (merge : std::vector<std::pair<uint32_t, uint32_t> > : omp_out.insert(omp_out.end(), std::make_move_iterator(omp_in.begin()), std::make_move_iterator(omp_in.end())))

#pragma omp parallel for reduction(merge: result)
    for(int i=0; i<100000000; ++i)
      {
        if(distrib(gen) == 1)
          {
            result.emplace_back(std::make_pair(distrib(gen),distrib(gen)));
          }
      }
    end = std::chrono::system_clock::now();                               \
    auto elapsed_seconds = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count(); \
    std::cout << "With openmp " << " : " << elapsed_seconds << "ms\n";
  }

  {
    std::chrono::time_point<std::chrono::system_clock> start, end;
    start = std::chrono::system_clock::now();
    std::vector<std::pair<uint32_t, uint32_t> > result;


    for(int i=0; i<100000000; ++i)
      {
        if(distrib(gen) == 1)
          {
            result.emplace_back(std::make_pair(distrib(gen),distrib(gen)));
          }
      }
    end = std::chrono::system_clock::now();                               \
    auto elapsed_seconds = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count(); \
    std::cout << "Without openmp " << " : " << elapsed_seconds << "ms\n";
  }
}

I compile this code with

g++ -o main -std=c++17 -fopenmp main.cpp

and the output is :

Numbers of thread available : 12
With openmp  : 3982ms
Without openmp  : 3887ms

Obviously, I don't have any speed up with my openmp implementation. Why ?

Not too familiar with OpenMP but could it be that there is some synchronization due to random number generator being used? Since each time you request a number an internal state needs to be changed and synced. — user975989, Jul 18 '20 at 09:03
Not only that, `std::vector` itself does not allow modification from multiple threads concurrently. It would be better to resize `result` to be able to hold all elements first, and then use `results[i] = ...` inside the `for`-loop. — G. Sliepen, Jul 18 '20 at 09:04
I don't think so, because I have the same behaviour in my real code, without random numbers. But I can't be sure of that, it's a point. — Kafka, Jul 18 '20 at 09:04
I can't reproduce your results. I get `Numbers of thread available : 8`, `With openmp : 5768ms`, `Without openmp : 8909ms` ` Also make sure to compile with -O3 if you're looking at performance. — alex berne, Jul 18 '20 at 09:24
Weird that you see an acceleration. That should depend of the system. Ok, with -O3, I see an acceleration (With openmp : 1346ms, Without openmp : 2165ms). With 12 thread, it's quite low, I will try others approach. — Kafka, Jul 18 '20 at 10:06
PRNGs have internal state, which makes them practically unshareable between multiple threads. When shared, you must synchronise the access, because neither `std::mt19937` nor `std::uniform_int_distribution` perform any locking. Moreover, repeatedly writing to the same memory area (the PRNG state) from more than one thread results in constant cache line invalidation, which slows things down even further. — Hristo Iliev, Jul 19 '20 at 13:21

Jérôme Richard · Accepted Answer · 2020-07-19T00:25:27.257

The current code is ill-formed regarding the documentation (since the parallelized code contains mostly-implicit dependencies). As a result, an OpenMP implementation is free to generate a fast but completely "wrong" program or a slow "correct" one.

To get a correct implementation and a not-too-bad speedup using OpenMP, one solution is to replicate the generator/distribution in each worker (by moving the variable declarations in a #pragma omp parallel section) and to add critical sections (using #pragma omp critical) for the (sequential) emplace_back.

Due to possible false-sharing and lock-contention, the resulting parallel implementation may scale poorly. It is probably better to generate thread-private arrays and then merge ultimately the sub-arrays in a big shared one rather than using naive critical section (note however that this still not ideal since the computation will likely be limited by the speed of the shared memory).

Please note that the result can be different from the sequential implementation when a specific seed need to be used (here it is not a problem since the seed is extracted from random_device).

No speed up with openmp for emplace_back a vector in a loop

1 Answers1