16

I feel like I'm missing something here...

I slightly altered some code to change from using std::thread to std::async and noticed a substantial performance increase. I wrote up a simple test which I assume should run nearly identically using std::thread as it does using std::async.

std::atomic<int> someCount = 0;
const int THREADS = 200;
std::vector<std::thread> threadVec(THREADS);
std::vector<std::future<void>> futureVec(THREADS);
auto lam = [&]()
{
    for (int i = 0; i < 100; ++i)
        someCount++;
};

for (int i = 0; i < THREADS; ++i)
    threadVec[i] = std::thread(lam);
for (int i = 0; i < THREADS; ++i)
    threadVec[i].join();

for (int i = 0; i < THREADS; ++i)
    futureVec[i] = std::async(std::launch::async, lam);
for (int i = 0; i < THREADS; ++i)
    futureVec[i].get();

I didn't get too deep into analysis, but some preliminary results made it seem that std::async code ran around 10X faster! Results varied slightly with optimizations off, I also tried switching the execution order.

Is this some Visual Studio compiler issue? Or is there some deeper implementation issue I'm overlooking that would account for this performance difference? I thought that std::async was a wrapper around the std::thread calls?


Also considering these differences, I'm wondering what would be the way to get the best performance here? (There are more than std::thread and std::async which create threads)

What about if I wanted detached threads? (std::async can't do that as far as I'm aware)

Ace24713
  • 173
  • 1
  • 6
  • If you've more than thread::hardware_concurrency() threads, you don't use true concurrency anymore and your os has to manage the overhead of context switching. By the way did you try to add yield() in the threaded loop ? – Christophe Nov 04 '14 at 08:23
  • Yes, the example is exaggerated - I did that to see how 'equivalent' the two calls were. I've still noticed a difference with < 10 threads running at a time. And no, I have not put any yield()'s in... Where do you propose I add it? and what might it do here? – Ace24713 Nov 04 '14 at 17:21
  • In the loop of your lambda function. The goal is to ease context switching. It will not magically get rid of your software-thread-overhead, however it could perhap's smoothen some bottlecneck effects. – Christophe Nov 04 '14 at 18:08

2 Answers2

11

When you're using async you are not creating new threads, instead you reuse the ones available in a thread pool. Creating and destroying threads is a very expensive operation that requires about 200 000 CPU cycles in Windows OS. On top of that, remember that having a number of threads much bigger than the number of CPU cores means that the operating system needs to spend more time creating them and scheduling them to use the available CPU time in each of the cores.

UPDATE: To see that the numbers of threads being used using std::async is a lot smaller than using std::thread, I have modified the testing code to count the number of unique thread ids used when run either way as below. Results in my PC shows this result:

Number of threads used running std::threads = 200
Number of threads used to run std::async = 4

but the number of threads running std::async show variations from 2 to 4 in my PC. It basically means that std::async will reuse threads instead of creating new ones every time. Curiously, if I increase the computing time of the lambda by replacing 100 by 1000000 iterations in the for loop, the number of async threads increases to 9 but using raw threads it always gives 200. Worth keeping in mind that "Once a thread has finished, the value of std::thread::id may be reused by another thread"

Here is the testing code:

#include <atomic>
#include <vector>
#include <future>
#include <thread>
#include <unordered_set>
#include <iostream>

int main()
{
    std::atomic<int> someCount = 0;
    const int THREADS = 200;
    std::vector<std::thread> threadVec(THREADS);
    std::vector<std::future<void>> futureVec(THREADS);

    std::unordered_set<std::thread::id> uniqueThreadIdsAsync;
    std::unordered_set<std::thread::id> uniqueThreadsIdsThreads;
    std::mutex mutex;

    auto lam = [&](bool isAsync)
    {
        for (int i = 0; i < 100; ++i)
            someCount++;

        auto threadId = std::this_thread::get_id();
        if (isAsync)
        {
            std::lock_guard<std::mutex> lg(mutex);
            uniqueThreadIdsAsync.insert(threadId);
        }
        else
        {
            std::lock_guard<std::mutex> lg(mutex);
            uniqueThreadsIdsThreads.insert(threadId);
        }
    };

    for (int i = 0; i < THREADS; ++i)
        threadVec[i] = std::thread(lam, false); 

    for (int i = 0; i < THREADS; ++i)
        threadVec[i].join();
    std::cout << "Number of threads used running std::threads = " << uniqueThreadsIdsThreads.size() << std::endl;

    for (int i = 0; i < THREADS; ++i)
        futureVec[i] = std::async(lam, true);
    for (int i = 0; i < THREADS; ++i)
        futureVec[i].get();
    std::cout << "Number of threads used to run std::async = " << uniqueThreadIdsAsync.size() << std::endl;
}
Darien Pardinas
  • 5,910
  • 1
  • 41
  • 48
  • @Christophe, not much of evidence that the internal implementation is a thread pool I admit, but at least proves thread reuse when using `std::async` – Darien Pardinas Aug 18 '17 at 14:22
5

As all your threads try to update the same atomic<int> someCount, the performance degradation could as well be linked to contention (the atomic making sure that all concurent accesses are sequentialy ordered). The consequence could be that:

  • the threads spend their time waiting.
  • but they consume CPU cycles anyhow
  • so your system throughput is wasted.

With the async() it would then be sufficient that some variations in the scheduling occur, which could result in a significant reduction of contention and increase in throughput. For example, the standard says that launch::async function object would be executed "as if in a new thread of execution represented by a thread object ...". It doesn't say that it must be a dedicated thread (so it can be -- but doesn't have to be -- a thread pool). Another hypothesis could be that the implementation takes a more relaxed scheduling, because nothing says that the thread needs to be executed immediatly (the constraint however is that it's executed before the get()).

Recommendation

Benchmark should be done with separation of concerns in mind. So for multithreading performance, inter-thread synchronisation should be avoided as much as possible.

Keep in mind that if you've more than thread::hardware_concurrency() threads active, there's no true concurrency anymore and the OS has to manage the overhead of context switching.

Edit: Some experimental feedback (2)

With a lam loop of 100, the benchmark result I measure are not usable because of magnitude of error linked to the windows clock resolution of 15 ms.

Test case            Thread      Async 
   10 000 loop          78          31
1 000 000 loop        2743        2670    (the longer the work, the smaler the difference)
   10 000 + yield()    500        1296    (much more context switches) 

When increasing the number of THREADS the timing evolves proportionnally, but only for test cases with short work. This suggest that the observed difference is in fact related to a an overhead at creation of threads rather than by their poor execution.

In a second experiment, I've added code to count the number of threads that are really involved, based on a vector storing this_thread::get_id(); for each execution :

  • For the thread version, no surprise, there are always 200 created (here).
  • Very interestingly the async() version displays between 8 and 15 processes in the case of shorter work, but show increasing number of threads (up to 131 in my tests) when work becomes longer.

This suggest that async does not a traditional thread pool (i.e. with a limited number of threads) but rather reuses threads if they already finished work. This reduces of course the overhead, especially for smaller tasks. (I updated my initial answer accordingly)

Christophe
  • 68,716
  • 7
  • 72
  • 138
  • 1
    I mostly threw in the atomic to prevent optimization from discarding the whole thing, but I changed it to increment with relaxed order and got some improved results on both ends - so thanks for that! - but still async beats thread by miles. The thread pool idea sounds correct given the timing, and your yield results are interesting. (And on bench marking with windows - use QueryPerformanceCounter and you will get much better resolution) – Ace24713 Nov 04 '14 at 23:04
  • Yes ! It puzzled me as well and I just edited the answer with some additional observations. – Christophe Nov 04 '14 at 23:09
  • A thread-pool will beat std::a sync by miles. Most task in thread pool will be executed as fast as a sync function in the main thread, whereas std::async although faster than std::thread , is more costlier than a plain function. If inter-thread synchronization is to be used, it is better to use a single thread instead and launch the tasks as serialized packages. – ark1974 May 12 '17 at 17:25