Does async(launch::async) in C++11 make thread pools obsolete for avoiding expensive thread creation?

Question

It is loosely related to this question: Are std::thread pooled in C++11?. Though the question differs, the intention is the same:

Question 1: Does it still make sense to use your own (or 3rd-party library) thread pools to avoid expensive thread creation?

The conclusion in the other question was that you cannot rely on std::thread to be pooled (it might or it might be not). However, std::async(launch::async) seems to have a much higher chance to be pooled.

It don't think that it is forced by the standard, but IMHO I would expect that all good C++11 implementations would use thread pooling if thread creation is slow. Only on platforms where it is inexpensive to create a new thread, I would expect that they always spawn a new thread.

Question 2: This is just what I think, but I have no facts to prove it. I may very well be mistaken. Is it an educated guess?

Finally, here I have provided some sample code that first shows how I think thread creation can be expressed by async(launch::async):

Example 1:

 thread t([]{ f(); });
 // ...
 t.join();

becomes

 auto future = async(launch::async, []{ f(); });
 // ...
 future.wait();

Example 2: Fire and forget thread

 thread([]{ f(); }).detach();

becomes

 // a bit clumsy...
 auto dummy = async(launch::async, []{ f(); });

 // ... but I hope soon it can be simplified to
 async(launch::async, []{ f(); });

Question 3: Would you prefer the async versions to the thread versions?

The rest is no longer part of the question, but only for clarification:

Why must the return value be assigned to a dummy variable?

Unfortunately, the current C++11 standard forces that you capture the return value of std::async, as otherwise the destructor is executed, which blocks until the action terminates. It is by some considered an error in the standard (e.g., by Herb Sutter).

This example from cppreference.com illustrates it nicely:

{
  std::async(std::launch::async, []{ f(); });
  std::async(std::launch::async, []{ g(); });  // does not run until f() completes
}

Another clarification:

I know that thread pools may have other legitimate uses but in this question I am only interested in the aspect of avoiding expensive thread creation costs.

I think there are still situations where thread pools are very useful, especially if you need more control over resources. For example, a server might decide to handle only a fixed number of requests simultaneously to guarantee fast response times and to increase the predictability of memory usage. Thread pools should be fine, here.

Thread-local variables may also be an argument for your own thread pools, but I'm not sure whether it is relevant in practice:

Creating a new thread with std::thread starts without initialized thread-local variables. Maybe this is not what you want.
In threads spawned by async, it is somewhat unclear for me because the thread could have been reused. From my understanding, thread-local variables are not guaranteed to be resetted, but I may be mistaken.
Using your own (fixed-size) thread pools, on the other hand, gives you full control if you really need it.

"However, `std::async(launch::async)` seems to have a much higher chance to be pooled." No, I believe its `std::async(launch::async | launch::deferred)` that may be pooled. With just `launch::async` the task is supposed to be launched on a new thread regardless of what other tasks are running. With the policy `launch::async | launch::deferred` then the implementation gets to choose which policy, but more importantly it gets to delay choosing which policy. That is, it can wait until a thread in a thread pool becomes available and then choose the async policy. — bames53, Sep 17 '13 at 17:45
As far as I know only VC++ uses a thread pool with `std::async()`. I'm still curious to see how they support non-trivial thread_local destructors in a thread pool. — bames53, Sep 17 '13 at 19:08
@bames53 That can be accomplished easily in their implementation using [RegisterWaitForSingleObject](http://msdn.microsoft.com/en-us/library/windows/desktop/ms685061(v=vs.85).aspx) with the "object" being the thread handle. When the thread exits, its handle will be signaled, and the callback will be queued to run in the thread pool. The callback can then call non-trivial destructors for TLS. I don't know if they do that (yet). — doug65536, Feb 09 '14 at 11:03
@bames53 I stepped through the libstdc++ that comes with gcc 4.7.2 and found that if the launch policy is not *exactly* `launch::async` then it treats it as if it were only `launch::deferred` and never executes it asynchronously - so in effect, that version of libstdc++ "chooses" to always use deferred unless forced otherwise. — doug65536, Feb 09 '14 at 11:08
@doug65536 My point about thread_local destructors was that destruction on thread exit isn't quite correct when using thread pools. When a task is run asynchronously it's run 'as if on a new thread', according to the spec, which means every async task gets its own thread_local objects. A thread pool based implementation has to take special care to ensure that tasks sharing the same backing thread still behave as if they have their own thread_local objects. Consider this program: http://pastebin.com/9nWUT40h — bames53, Feb 09 '14 at 18:51
@bames53 Using "as if on a new thread" in the spec was a *huge* mistake in my opinion. `std::async` could have been a beautiful thing for performance - it could have been the standard short-running-task execution system, naturally backed by a thread pool. Right now, it's just a `std::thread` with some crap tacked on to make the thread function be able to return a value. Oh, and they added redundant "deferred" functionality that overlaps the job of `std::function` completely. — doug65536, Feb 10 '14 at 01:45
Upvote for "the current C++11 standard forces that you capture the return value of std::async, as otherwise the destructor is executed, which blocks until the action terminates." Is it possible that 'future' related features were implemented with shared_ptr, thus it has a better memory management and won't need to be strict / weird on destruction part? — Hao Xi, Sep 26 '16 at 13:04
@doug65536: agree. If std::async just launches a new thread there is little gain except for some small syntactic improvement. The performance benefit would be if a relative expansive thread creation could be circumvented. — gast128, Feb 27 '20 at 20:17

Omnifarious · Accepted Answer · 2022-01-10T21:50:23.613

72

Question 1:

I changed this from the original because the original was wrong. I was under the impression that Linux thread creation was very cheap and after testing I determined that the overhead of function call in a new thread vs. a normal one is enormous. The overhead for creating a thread to handle a function call is something like 10000 or more times slower than a plain function call. So, if you're issuing a lot of small function calls, a thread pool might be a good idea.

It's quite apparent that the standard C++ library that ships with g++ doesn't have thread pools. But I can definitely see a case for them. Even with the overhead of having to shove the call through some kind of inter-thread queue, it would likely be cheaper than starting up a new thread. And the standard allows this.

IMHO, the Linux kernel people should work on making thread creation cheaper than it currently is. But, the standard C++ library should also consider using pool to implement launch::async | launch::deferred.

And the OP is correct, using ::std::thread to launch a thread of course forces the creation of a new thread instead of using one from a pool. So ::std::async(::std::launch::async, ...) is preferred.

Question 2:

Yes, basically this 'implicitly' launches a thread. But really, it's still quite obvious what's happening. So I don't really think the word implicitly is a particularly good word.

I'm also not convinced that forcing you to wait for a return before destruction is necessarily an error. I don't know that you should be using the async call to create 'daemon' threads that aren't expected to return. And if they are expected to return, it's not OK to be ignoring exceptions.

Question 3:

Personally, I like thread launches to be explicit. I place a lot of value on islands where you can guarantee serial access. Otherwise you end up with mutable state that you always have to be wrapping a mutex around somewhere and remembering to use it.

I liked the work queue model a whole lot better than the 'future' model because there are 'islands of serial' lying around so you can more effectively handle mutable state.

But really, it depends on exactly what you're doing.

Performance Test

So, I tested the performance of various methods of calling things and came up with these numbers on an 8 core (AMD Ryzen 7 2700X) system running Fedora 29 compiled with clang version 7.0.1 and libc++ (not libstdc++):

   Do nothing calls per second:   35365257                                      
        Empty calls per second:   35210682                                      
   New thread calls per second:      62356                                      
 Async launch calls per second:      68869                                      
Worker thread calls per second:     970415

And native, on my MacBook Pro 15" (Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz) with Apple LLVM version 10.0.0 (clang-1000.10.44.4) under OSX 10.13.6, I get this:

   Do nothing calls per second:   22078079
        Empty calls per second:   21847547
   New thread calls per second:      43326
 Async launch calls per second:      58684
Worker thread calls per second:    2053775

For the worker thread, I started up a thread, then used a lockless queue to send requests to another thread and then wait for a "It's done" reply to be sent back.

The "Do nothing" is just to test the overhead of the test harness.

It's clear that the overhead of launching a thread is enormous. And even the worker thread with the inter-thread queue slows things down by a factor of 20 or so on Fedora 25 in a VM, and by about 8 on native OS X.

I created an OSDN chamber holding the code I used for the performance test. It can be found here: https://osdn.net/users/omnifarious/pf/launch_thread_performance/

edited Jan 10 '22 at 21:50

answered Jan 16 '13 at 04:37

Omnifarious

54,333
19
131
194

3

I concur on the work-queue model, however this requires having a "pipeline" model which may not be applicable to every use of concurrent access. – Matthieu M. Jan 16 '13 at 07:36
@MatthieuM.: I've been working on a library to sort of combine them. You can put something on a work-queue for another thread that results in something being queued up for the original thread's work-queue when it's finished. That sort of looks like a future. – Omnifarious Jan 16 '13 at 07:55
I see this more as an asynchronous acknowledgement. Looks nice indeed, hope you'll publicize it. – Matthieu M. Jan 16 '13 at 08:07
@MatthieuM.: I want to implement auto-wrapping for functions so you can use it to write whole expressions who's evaluaion is delayed until all the results are available for it to be evaluated. That's what's been blocking me from having something I consider release ready, and why I've been asking all those weird complicated template questions. – Omnifarious Jan 16 '13 at 08:13
1

Looks to me like expression templates (for operators) could be used to compose the results, for function calls you would need a *call* method I guess but because of overloads it might be slightly more difficult. – Matthieu M. Jan 16 '13 at 08:16
@MatthieuM.: http://codereview.stackexchange.com/questions/20665/could-this-deferred-execution-scheme-be-any-simpler – Omnifarious Jan 18 '13 at 05:32
4

"very cheap" is relative to your experience. I find Linux thread creation overhead to be _substantial_ for my usage. – Jeff Hammond Dec 23 '14 at 01:45
I have a public domain one that uses state of the art c++11 and it adds about 50 microseconds of overhead pool.enque to actual work beind done, which for me is waaay too much. Is there anything in the 5-10 microsecond range? – Ivan Mar 28 '15 at 15:49
"I know that in Linux thread creation has been made very cheap" - can I get some resource to read about this? – Abhinav Gauniyal Aug 20 '16 at 09:35
@AbhinavGauniyal - My information on that was rather old and came from around 2003, when Ulrich Drepper got a bug up his but and implemented the `futex` system call and generally improved Linux threading significantly. Also: http://www.cs.utexas.edu/~witchel/372/lectures/POSIX_Linux_Threading.pdf – Omnifarious Mar 09 '17 at 07:33
1

@Jeff - I thought it was a lot cheaper than it is. I updated my answer awhile ago to reflect a test I did to discover the actual cost. – Omnifarious Mar 09 '18 at 14:44
6

In the first part, you are somewhat underestimating how much has to be done to create a threat, and how little has to be done to call a function. A function call and return is a few CPU instructions that manipulate a few bytes on the top of the stack. A threat creation means: 1. allocating a stack, 2. performing a syscall, 3. creating data structures in the kernel and linking them up, grapping locks along the way, 4. waiting for the scheduler to execute the thread, 5. switching context to the thread. Each of these steps in itself takes *much* longer than the most complex function calls. – cmaster - reinstate monica Mar 09 '18 at 15:10
1

As such, I find a bit thick to say "the Linux kernel people should work on making thread creation cheaper than it currently is". Those guys are working hard to deliver a performance that other OSes can only dream of, but they have to stay within the specs too. Thread creation *is* a costly operation by definition. – cmaster - reinstate monica Mar 09 '18 at 15:12
@cmaster - Why does a lot have to be done to create a thread? There are architectures in which creating and destroying threads of execution by the hundreds or thousands are the norm. Maybe it's the concept of what a thread is that needs revisiting. Maybe there should be a CPU change that lets you use one instruction recruit another core so you can split your medium sized (1000-50000 iteration) loop in half. All of these things are possible. – Omnifarious Mar 09 '18 at 15:24
2

@Omnifarious A function call takes about 20 CPU cycles (you measured less because some of the overhead was hidden behind your test harness). A memory allocation can easily take 200 CPU cycles. A syscall is no less than about 200ns. Grabbing a lock is inter-threat communication, which needs to be performed within the kernel, expect something around a microsecond. And I have not yet started on the overhead of setting page tables or flushing the TLB. If specialized hardware allows for faster thread creation, it's because the hardware is optimized for that, X86 CPUs are not. – cmaster - reinstate monica Mar 09 '18 at 15:32
1

@cmaster - Grabbing a lock, on Linux, requires no system call at all. There is only a system call if there's contention. I've tested this with a highly contended int counter wrapped in a `::std::mutex`. In counting to 10000 with two threads there were something like 10 calls to the `futex` system call. The Intel people listen to the Linux people. If a CPU change would help, it can happen. Why do you need to do anything with the TLB? – Omnifarious Mar 09 '18 at 15:37
1

I was talking about the locks that need to be grabbed **within** the kernel. Remember, the kernel itself is a parallel application. There are some overheads in the hardware itself that make even atomic operations for inter-thread communication significantly slower than normal memory accesses. The TLB comes into play if a threat is started on a different core, where the preceding time slot was used by something other than your process. We are talking about parallel here, after all. I don't know if linux can optimize the TLB flush away when switching between threads of the same process, though. – cmaster - reinstate monica Mar 09 '18 at 15:54
@cmaster - I think there are a lot of 'out-of-the-box' possibilities. I'm not trying to imply that the kernel people are incompetent. I suspect thread creation on Linux is cheaper than NT, for example. No, I just think some really creative thinking applied to this problem might yield some very interesting and impressive results. I suppose saying "they need to make thread creation faster" doesn't really get that across. – Omnifarious Mar 09 '18 at 18:36
Now I think, we are really starting to agree :-) Anyway, I think that current X86 CPUs are not really built for either fast thread switching, nor for fast thread creation. They are built for optimal benchmark performance, which means that anything as costly as thread creation won't really be optimized for because it will not show up in the benchmarks in the first place. It would be interesting to know, what could be achieved if we were to start over with CPU instruction set design - I have a feeling that we would be able to produce much better results in multi-threading and virtualization. – cmaster - reinstate monica Mar 09 '18 at 23:55
@cmaster - One idea would be an instruction to start a thread that took advantage of some sort of 'free core' state maintained in shared memory that could only be accessed or modified via this instruction if the CPU was in user mode. If there weren't any free cores, it would leave a fast path and revert to being a system call. That could make auto-parallelizing loops with only a thousand iterations a performance win. – Omnifarious Sep 10 '18 at 03:42
@Omnifarious your bitbucket link is not available. Can you add your code to GitHub and update the link? – moi Jan 06 '22 at 12:58
@moi - It is on Github. But, as a matter of personal choice, I don't generally link to Github since they don't support Mercurial, and I use Mercurial for all of my personal version control. I'll link to it on OSDN though. – Omnifarious Jan 10 '22 at 21:47

Does async(launch::async) in C++11 make thread pools obsolete for avoiding expensive thread creation?

1 Answers1

Performance Test

Linked