In OpenMP, I can create a bunch of tasks as follows and run them asynchronously using some fixed number of threads:
#pragma omp parallel
{
#pragma omp single
{
for (int i = 0; i < 1000; i++) {
#pragma omp task
f(i);
} } }
In C++11, I can do something not-quite-same std::async
:
std::vector<std::future> futures;
for (int i = 0; i < 1000; i++) {
auto fut = std::async(f, i);
futures.push_back(std::move(fut));
}
...
for (auto & fut : futures) {
auto res = fut.get();
// do something with res
}
What I worry about is efficiency. If I am correct, in OpenMP, tasks are stored in some task pool and then distributed to threads (automatically by the OpenMP runtime).
In C++, at the moment of invoking std::async
, the runtime decides whether to run f(i)
asynchronously in a new thread or defer its run to the point of invoking std::future::get
.
Consequently, either a runtime
- creates 1000 threads and run them all concurrently,
- or create less threads, but then some invocation of
f(i)
will be performed sequentially in the main thread (within the final loop).
Both these options seem to be generally less efficient than what OpenMP does (create many tasks and run them concurrently in a fixed number of threads).
Is there any way to get the same behavior as what OpenMP tasks provide with C++ threading?
UPDATE
I did some measurements with the following code: https://wandbox.org/permlink/gLCFPr1IjTofxwQh on 12C Xeon E5 CPU compiled with GCC 7.2 and -O2
:
- OpenMP runtime with 12 threads: 12.2 [s]
- C++ threading runtime: 12.4 [s]
(averages from serveral runs). They seem to be practically the same.
However, I also tried the same with 500,000 tasks (n
) and 1,000 iterations within them (m
) and the times then differed significantly:
- OpenMP runtime with 12 threads: 15.1 [s]
- C++ threding runtime: 175.6 [s]
UPDATE 2
I measured how many times a new thread was created (following this answer to interpose pthread_create
calls: https://stackoverflow.com/a/3709027/580083):
First experiment (20,000 tasks, 20,000 iterations within):
- OpenMP runtime with 12 threads: 11
- C++ threding runtime: 20,000
Second experiment (500,000 tasks, 1,000 iterations within):
- OpenMP runtime with 12 threads: 11
- C++ threding runtime: 32,744