2

I have a large number (>>100K) of tasks with very high latency (minutes) and very little resource consumption. Potentially they could all be executed in parallel and I was considering using std::async to generate one future for each task.

My question is: what is the maximum number of threads that std::async will create and execute asynchronously? (using g++ 6.x on Ubuntu 16-xx or CentOs 7.x - x86_64)

It is important for me to get that number right because if I do not have enough tasks actually running (waiting) in parallel the cumulative cost of latency will be very high.

To get to an answer, I started by checking the capabilities of the system:

bob@vb:~/programming/cxx/async$ ulimit -u
43735
bob@vb:~/programming/cxx/async$ cat /proc/sys/kernel/threads-max 
87470

From these numbers, I was expecting to be able to get in the order of 43K threads running (mostly waiting) in parallel. To verify that, I wrote the program below to check the number of distinct thread ids and the time required to call 100K std::async with an empty task:

#include <thread>
#include <future>
#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>
#include <string>

std::thread::id foo()
{
    using namespace std::chrono_literals;
    //std::this_thread::sleep_for(2s);
    return std::this_thread::get_id();
}

int main(int argc, char **argv)
{
    if (2 != argc) exit(1);
    const size_t COUNT = std::stoi(argv[1]);
    std::vector<decltype(std::async(foo))> futures;
    futures.reserve(COUNT);
    while (futures.capacity() != futures.size())
    { 
        futures.push_back(std::async(foo));
    } 
    std::vector<std::thread::id> ids;
    ids.reserve(futures.size());
    for (auto &f: futures)
    { 
        ids.push_back(f.get());
    } 
    std::sort(ids.begin(), ids.end());
    const auto end = std::unique(ids.begin(), ids.end());
    ids.erase(end, ids.end());
    std:: cerr << "COUNT: " << COUNT << ": ids.size(): " << ids.size() << std::endl;
}

The time was fine but the number of distinct thread ids was much less than expected (32748 instead of 43735):

bob@vb:~/programming/cxx/async$ /usr/bin/time -f "%E" ./testAsync 100000
COUNT: 100000: ids.size(): 32748
0:03.29

Then I un-commented the sleep line in foo to add a 2s sleeping time. The resulting timings are consistent with 2s up to 10K tasks or so, but at some point beyond that, some tasks end-up sharing the same thread id and the elapsed time increases by 2s for each additional task:

bob@vb:~/programming/cxx/async$ /usr/bin/time -f "%E" ./testAsync 10056
COUNT: 10056: ids.size(): 10056
0:02.24
bob@vb:~/programming/cxx/async$ /usr/bin/time -f "%E" ./testAsync 10057
COUNT: 10057: ids.size(): 10057
0:04.27
bob@vb:~/programming/cxx/async$ /usr/bin/time -f "%E" ./testAsync 10058
COUNT: 10058: ids.size(): 10057
0:06.28
bob@vb:~/programming/cxx/async$ ps -eT | wc -l
277

So, it looks that for my problem, on this system, the limit is in the order of 10K. I checked on another system and the limit was in the order of 4K.

I can't figure out:

  • why these values are so small
  • how to predict these values from the specs of the system
Come Raczy
  • 1,590
  • 17
  • 26
  • Well, each of these threads need to get some resources from the OS. The typical default size for the thread stack is 8MB, so a total of _thread-count*8MB_ of DRAM is needed just for that. It's not only about firing up more threads, you need to have resources... Read [here](http://stackoverflow.com/questions/25814365/when-to-use-stdasync-vs-stdthreads) too. – Arash Mar 01 '17 at 02:57
  • A ridiculous number of tasks without an almost equally ridiculous number of processing cores isn't all that useful. The threads will spend most of their time fighting it out for access to a processor. Consider using a thread pool. – user4581301 Mar 01 '17 at 03:19
  • @arash thanks for the link. Regarding DRAM use, even though the default thread stack size is 8MB I would have expected this to be virtual memory and that the actual amount of DRAM required would be what the thread is actually using (rounded up to the page size). Am I wrong? – Come Raczy Mar 01 '17 at 17:48
  • @user4581301 i can run way more than 10K of these tasks that in parallel concurrent processes. I was hoping that with std::async I would have at least the same capability. How would thread pools help? – Come Raczy Mar 01 '17 at 18:43
  • 1
    @Come Raczy, Well, 8MB is the max and in reality you need less (the actual number depends on the amount of off-stack allocation) and true it is virtual memory. But keep in mind that pushing the number of threads you start to relying on swap in a program that actually does something (not this code). From a performance perspective, huge number of threads means more work from the OS side in scheduling, more swapping,competing threads-->overhead. High performance applications have a fixed thread pool equal to number of hardware threads and push works into a queue, where threads grab work from it. – Arash Mar 01 '17 at 19:33
  • 10000 threads 100 processors. How many concurrent tasks are actually running? 100. And each one of those 100 processors has 100 threads jockeying for position and probably slowing each other down due to all of the extra task switching overhead as those 100 threads get swapped in and out. But if you have 100 processors with 100 threads and 100 jobs queued to each thread you have a nice orderly progression and comparatively little overhead. Yes, some jobs will complete much earlier than others because they got into the queue earlier, but overall the problem will be solved faster. – user4581301 Mar 01 '17 at 20:15

1 Answers1

3

With g++ on linux, the straightforward answer seems to be "the maximum number of threads that can be created before pthread_create fails and returns EAGAIN". That number can be limited by several different values and man pthread_create lists 3 of them:

  • RLIMIT_NPROC:soft resource limit (4096 on my CentOs 7 server and 43735 on my Ubuntu/VirtualBox laptop)
  • the value of /proc/sys/kernel/threads-max (2061857 and 87470 resp.)
  • the value of /proc/sys/kernel/pid_max (40960 and 32768 resp.)

There is at least one other possible limit imposed by systemd, as man logind.conf indicates:

UserTasksMax= Sets the maximum number of OS tasks each user may run concurrently. This controls the TasksMax= setting of the per-user slice unit, see systemd.resource-control(5) for details. Defaults to 33%, which equals 10813 with the kernel's defaults on the host, but might be smaller in OS containers.

Come Raczy
  • 1,590
  • 17
  • 26