12

In OpenMP, I can create a bunch of tasks as follows and run them asynchronously using some fixed number of threads:

#pragma omp parallel
{
   #pragma omp single 
   {
      for (int i = 0; i < 1000; i++) {
         #pragma omp task
         f(i);
}  }  }

In C++11, I can do something not-quite-same std::async:

std::vector<std::future> futures;
for (int i = 0; i < 1000; i++) {
   auto fut = std::async(f, i);
   futures.push_back(std::move(fut));
}
...
for (auto & fut : futures) {
  auto res = fut.get();
  // do something with res
}

What I worry about is efficiency. If I am correct, in OpenMP, tasks are stored in some task pool and then distributed to threads (automatically by the OpenMP runtime).

In C++, at the moment of invoking std::async, the runtime decides whether to run f(i) asynchronously in a new thread or defer its run to the point of invoking std::future::get.

Consequently, either a runtime

  1. creates 1000 threads and run them all concurrently,
  2. or create less threads, but then some invocation of f(i) will be performed sequentially in the main thread (within the final loop).

Both these options seem to be generally less efficient than what OpenMP does (create many tasks and run them concurrently in a fixed number of threads).

Is there any way to get the same behavior as what OpenMP tasks provide with C++ threading?

UPDATE

I did some measurements with the following code: https://wandbox.org/permlink/gLCFPr1IjTofxwQh on 12C Xeon E5 CPU compiled with GCC 7.2 and -O2:

  1. OpenMP runtime with 12 threads: 12.2 [s]
  2. C++ threading runtime: 12.4 [s]

(averages from serveral runs). They seem to be practically the same.

However, I also tried the same with 500,000 tasks (n) and 1,000 iterations within them (m) and the times then differed significantly:

  1. OpenMP runtime with 12 threads: 15.1 [s]
  2. C++ threding runtime: 175.6 [s]

UPDATE 2

I measured how many times a new thread was created (following this answer to interpose pthread_create calls: https://stackoverflow.com/a/3709027/580083):

First experiment (20,000 tasks, 20,000 iterations within):

  1. OpenMP runtime with 12 threads: 11
  2. C++ threding runtime: 20,000

Second experiment (500,000 tasks, 1,000 iterations within):

  1. OpenMP runtime with 12 threads: 11
  2. C++ threding runtime: 32,744
Daniel Langr
  • 22,196
  • 3
  • 50
  • 93
  • Implementations can choose to implement `std::async()` using a thread pool. It seem libstdc++ uses a thread pool (based on performance measurements rather than looking at the code). – Dietmar Kühl Dec 19 '17 at 21:01
  • @DietmarKühl can you share your performance measurements about that. Seems very interesting. – Zulan Dec 19 '17 at 22:27
  • @Zulan: It isn’t particular exciting: when looking at the data for my [Parallel Algorithms](https://github.com/dietmarkuehl/presentation-files/blob/master/parallel-algorithms/README.md) presentation, especially the hand-crafted `for_each` using `std::thread` and `std::async()`, you can see that its performance resembles that of using a thread-pool. Other implementations behaved quite different at the time I last tested the behavior. – Dietmar Kühl Dec 19 '17 at 22:53
  • If you are interested in measuring the performance of C++ tasking systems, then you may also want to look at Threading Building Blocks (TBB) https://www.threadingbuildingblocks.org/ (despite Intel branding it is Apache licensed and runs on many different architectures (ARM, Power, SPARC, ...) – Jim Cownie Dec 21 '17 at 09:38
  • @DietmarKühl According to my 2nd update, it doesn't seem that libstdc++ uses a thread pool. I don't think it's even possible, see Zulan's answer. Implementation can **defer invocation of the selection of the policy**, but then it still either needs to create a new thread or run task synchronously. – Daniel Langr Jan 05 '18 at 07:26

2 Answers2

5

Your analysis is quite good, but I think there is a loophole for threadpools in std::async.

OpenMP does use a fixed, user-controlled amount of threads that execute tasks quite flexibly. untied tasks can even move between threads, although that doesn't seem to be well-supported in practice.

Yes, as per the C++11 standard, the implementation must chose either std::launch::async or std::launch::deferred. The former one must create a std::thread object, while the latter one will execute the task's code in the thread calling wait. However, the standard leaves a Note (emphasis mine):

If this policy is specified together with other policies, such as when using a policy value of launch::async | launch::deferred, implementations should defer invocation or the selection of the policy when no more concurrency can be effectively exploited.

To be honest fail to see how the standard wording besides that note would allow an implementation to defer the decision - but the standard seems to actually encourage thread pools! If decision to chose launch:async is deferred, it means that the required new std::thread could reuse an existing thread of execution - at least I don't see why not.

Originally I thought that std::thread could also be implemented as green threads which sort-of also means a thread pool. However, the standard remarks that threads [managed by <thread>] are intended to map one-to-one with operating system threads.

At the end of the day measure to verify your performance. There may be a very bad OpenMP implementation or a very clever standard library implementation.

This answer to a similar question, presents some measurement results that indicate high overhead of std::async and also shares the measurement code.

Zulan
  • 21,896
  • 6
  • 49
  • 109
  • 1
    That note from Standard answers my question, thanks for finding it. I even found a similar note on http://www.cplusplus.com/reference/future/async/: "_The function chooses the policy automatically (**at some point**)_". I mostly use http://en.cppreference.com, where no such statement is provided. – Daniel Langr Dec 20 '17 at 08:01
0

Both these options seem to be generally less efficient than what OpenMP does (create many tasks and run them concurrently in a fixed number of threads).

The thing is, you actually don't know that that is what OpenMP is doing. OpenMP is an API that lets you specify "this task can be parallelized" and provide guides on how the implementation should try to parallelize it. It doesn't expose the low-level details of that implementation unless you specify them yourself. It might try to allocate a fixed number of threads to execute batches of tasks in each thread, OR it might create threads for each individual task. It might even forgo allocating any threads at all if the implementation thinks your target machine is incapable of benefiting from the parallelism.

This is both a strength and weakness of OpenMP: It will try to do what it thinks is best for your particular task, which may mean optimizing the runtime of your program, but it might also occasionally be wrong about what actually is best. std::async is, in most respects, in the same boat: the implementation (and compiler) will make efforts to try to guess what the most efficient bytecode of your code will be, but there's always a chance it's wrong. So std::async, too, is often implemented to use a dedicated (fixed-size) thread-pool. It might split each task into its own thread, or execute everything on a single thread.

If you have expectations about the precise implementation of your multithreaded code, you need to specify those expectations using lower-level concepts and objects like std::thread. If you want to let your compiler assume what's best (and the compiler is often right), std::async and OpenMP are probably comparable.

And of course, it's important to measure: your situation might be one where OpenMP has the best idea about what's fastest. Or maybe not. The only way you'll know for sure is if you implement both versions and measure which is fastest.

Xirema
  • 19,889
  • 4
  • 32
  • 68
  • 1
    I mostly agree, especially with the necessity of measuring. However, as for _"It might try to allocate a fixed number of threads to execute batches of tasks in each thread, OR it might create threads for each individual task."_, I don't think this is correct. The number of threads is specified for the `parallel` section (e.g., from outside by `OMP_NUM_THREADS` env variable) and is therefore fixed. – Daniel Langr Dec 19 '17 at 21:38
  • @DanielLangr Yeah, but no part of your code specifies a value for `OMP_NUM_THREADS`. So you don't know what that value is actually going to be for your implementation. – Xirema Dec 19 '17 at 21:40
  • This value can be specified by the user, it's environment variable, not C++ variable. Or the number of threads is typically set by default to the number of CPU cores / hardware threads. My point is that the number of threads in OpenMP is fixed and can be controlled by user. – Daniel Langr Dec 19 '17 at 21:43
  • @Daniel is almost right, but if you read the OpenMP standard (http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf) you'll find Algorithm 2.1 which says how many threads the OpenMP runtime is allowed to use. In many cases the answer comes out as "implementation defined", even if you did make an explicit request. – Jim Cownie Dec 21 '17 at 09:35
  • @JimCownie In many cases? I see only 1 case there, which applies **only if** _ThreadsRequested > ThreadsAvailable_. For instance, on my Linux server _ThreadsAvailable_ is 2147483647, therefore I can reasonably expect that I can truly control the number of threads, e.g., via `num_threads` clause expression. – Daniel Langr Dec 21 '17 at 10:01
  • @Daniel there are all of the nested and dyn-var cases, aside from the more general issue that the value of ThreadsAvailable on your machine is not guaranteed anywhere else. (So, as ever, it;s easy to write code that does what you expect on one machine with a specific compiler. Harder if you want portability [which could just mean to the next machine you happen to buy]). – Jim Cownie Dec 21 '17 at 16:53
  • @JimCownie I think that you can really expect that on each machine the OpenMP runtime will be able to create at least as many threads as is the hardware concurrency (number of cores / hardware threads). This is enough for thread pooling, which my question was about. (If not, user can limit the number of threads to a lower number, even 1, and program should work as well.) – Daniel Langr Dec 22 '17 at 07:43