Is the time overhead of execution in a c++11 std::thread dependent on the payload executed?

Question

I want to know the time overhead to execute a method in a C++11 std::thread (or std::async) compared to direct execution. I know that thread pools can significantly reduce or even completely avoid this overhead. But I'd still like to get a better feeling for the numbers. I'd like to know roughly at what computational cost the thread creation pays off, and at what cost the pooling pays off.

I implemented a simple benchmark myself, that boils down to:

void PayloadFunction(double* aInnerRuntime, const size_t aNumPayloadRounds) {
    double vComputeValue = 3.14159;

    auto vInnerStart = std::chrono::high_resolution_clock::now();
    for (size_t vIdx = 0; vIdx < aNumPayloadRounds; ++vIdx) {
        vComputeValue = std::exp2(std::log1p(std::cbrt(std::sqrt(std::pow(vComputeValue, 3.14152)))));
    }
    auto vInnerEnd = std::chrono::high_resolution_clock::now();
    *aInnerRuntime += static_cast<std::chrono::duration<double, std::micro>>(vInnerEnd - vInnerStart).count();

    volatile double vResult = vComputeValue;
}

int main() {
    double vInnerRuntime = 0.0;
    double vOuterRuntime = 0.0;

    auto vStart = std::chrono::high_resolution_clock::now();
    for (size_t vIdx = 0; vIdx < 10000; ++vIdx) {
        std::thread vThread(PayloadFunction, &vInnerRuntime, cNumPayloadRounds);
        vThread.join();
    }
    auto vEnd = std::chrono::high_resolution_clock::now();
    vOuterRuntime = static_cast<std::chrono::duration<double, std::micro>>(vEnd - vStart).count();

    // normalize away the robustness iterations:
    vInnerRuntime /= static_cast<double>(cNumRobustnessIterations);
    vOuterRuntime /= static_cast<double>(cNumRobustnessIterations);

    const double vThreadCreationCost = vOuterRuntime - vInnerRuntime;
}

This works quite well and I can get typical thread creation costs of ~20-80 microseconds (us) on Ubuntu 18.04 with a modern Core i7-6700K. For one thing, this is cheap compared to my expectations!

But now comes the curious part: the thread overhead seems to depend (very reproducible) on the actual time spent in the payload method! This makes no sense to me. But it reproducible happens on six different hardware machines with various flavors of Ubuntu and CentOS!

If I spend between 1 and 100us inside PayloadFunction, the typical thread creation cost is around 20us.
When I increase the time spent in PayloadFunction to 100-1000us, the thread creation cost increases to around 40us.
A further increase to more then 10000us in PayloadFunction again increases the thread creation cost to around 80us.

I did not go to larger ranges, but I can clearly see a relation between payload time and thread overhead (as computed above). Since I can not explain this behavior, I assume there must be a pitfall. Is it possible that my time measurement is so inaccurate? Or could the CPU Turbo cause different timings based on the higher or lower load? Can somebody shed some light?

Here is a random example of the timings I get. The numbers are representative for the pattern described above. The same pattern can be observed on many different computer hardware (various Intel and AMD processors) and Linux flavors (Ubuntu 14.04, 16.04, 18.04, CentOS 6.9 and CentOS 7.4):

payload runtime      0.3 us., thread overhead  31.3 us.
payload runtime      0.6 us., thread overhead  32.3 us.
payload runtime      2.5 us., thread overhead  18.0 us.
payload runtime      1.9 us., thread overhead  21.2 us.
payload runtime      2.5 us., thread overhead  25.6 us.
payload runtime      5.2 us., thread overhead  21.4 us.
payload runtime      8.7 us., thread overhead  16.6 us.
payload runtime     18.5 us., thread overhead  17.6 us.
payload runtime     36.1 us., thread overhead  17.7 us.
payload runtime     73.4 us., thread overhead  22.2 us.
payload runtime    134.9 us., thread overhead  19.6 us.
payload runtime    272.6 us., thread overhead  44.8 us.
payload runtime    543.4 us., thread overhead  65.9 us.
payload runtime   1045.0 us., thread overhead  70.3 us.
payload runtime   2082.2 us., thread overhead  69.9 us.
payload runtime   4160.9 us., thread overhead  76.0 us.
payload runtime   8292.5 us., thread overhead  79.2 us.
payload runtime  16523.0 us., thread overhead  86.9 us.
payload runtime  33017.6 us., thread overhead  85.3 us.
payload runtime  66242.0 us., thread overhead  76.4 us.
payload runtime 132382.4 us., thread overhead  69.1 us.

`vThread.join();` waits for the thread to terminate, so you're serializing execution by not starting a new thread until the previous thread has notified the main thread. Either detach it or make an array of threads. — Peter Cordes, Oct 02 '18 at 13:04
Possible duplicate of [What is different between join() and detach() for multi threading in C++?](https://stackoverflow.com/questions/37015775/what-is-different-between-join-and-detach-for-multi-threading-in-c) — Peter Cordes, Oct 02 '18 at 13:04
Oh, I think you're doing that intentionally? But that means you're benchmarking the time to notify the main thread as well, not just the time to *create* new threads. — Peter Cordes, Oct 02 '18 at 13:31
@PeterCordes yes it is intentional to join the threads immediately. This implementation creates a serial execution of threads, and I can compare the net time spent in the payload to the outside time, to see the overhead from thread creation. It is also intentional to add the time to notify the main thread, because I want the total "cost overhead" of threaded payload execution. The question is not a duplicate and I do not want to detach threads. — emmenlau, Oct 02 '18 at 15:35
Your title says you're only trying to microbenchmark thread *creation* time, but you're actually testing the whole round-trip for spawning a single worker thread and waiting for its result, with no parallelism. A TL:DR summary at the top of that might be a good idea. — Peter Cordes, Oct 02 '18 at 15:47
Is your CPU staying pegged at 4GHz the whole time? Maybe check with `perf stat`. Maybe disable turbo and set your EPP to performanace with `sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;do echo performance > "$i";done'`. That's probably not it; if anything you'd expect a longer workload to keep the CPU running at max and reduce overhead. Maybe caches get cold, but an extra 60us seems like too much. — Peter Cordes, Oct 02 '18 at 15:49
And BTW, overhead increases as the payload grows by a power of 10 is not "linear". — Peter Cordes, Oct 02 '18 at 15:51
Thanks @PeterCordes I've improved the post with your suggestions and comments! — emmenlau, Oct 02 '18 at 15:58
Maybe there is a fast path for join over very short intervals where it spins before blocking? Also, the longer you run the payload thread the more stuff can get scheduled on the CPU that will run the main thread and so you'll have more cache misses when you come back. — BeeOnRope, Oct 02 '18 at 18:01
@BeeOnRope I agree that these are possible explanations. But with respect to the observed timings I would find them surprising. The typical timings are four times more expensive for longer-running payloads, with an almost continuous growth from short to long overhead. From a fast path for join, I would rather expect a single steep increase in overhead? And the main thread (as well as the whole machine) is mostly idle during my benchmarks, so there is no large pressure on the cache or other components. — emmenlau, Oct 08 '18 at 07:57

score 2 · Answer 1 · answered Aug 01 '20 at 15:27

I have some hypotheses. The first:

You are running this benchmark on an unloaded system, but presumably there's still a low level of background activity happening. Then:

The main thread gets stated on a CPU core, I will call it core 1.
The main thread fires off a child thread to run PayloadFunction.
At that moment the main thread is still running inside the vThread constructor/syscall, so the child thread gets scheduled to run on core 2, which is free.
A moment later the main thread calls join and gets suspended.
The child thread keeps running on core 2, core 1 remains unoccupied.

Now, if the payload runtime is not too high, most of the time when the child exits core 1 is still free. The main thread wakes up and gets scheduled again on core 1 by a smart system scheduler*.

But sometimes, when core 1 is free, a random background task wakes up and is scheduled to that core. Then the main thread wakes up again, but core 1 is still occupied. The scheduler notices that core 2 or some other core in the system is free, and migrates the main thread to that core. Thread migration is a relatively expensive operation. If the new core is asleep, it needs to be sent an inter-processor interrupt (or is that an inter-core interrupt?) to wake it up. Even if that is not necessary the main thread will at least incur slowdown as the caches on the new core need to load its data. I expect the new core to be core 2 most of the time, as that just finished its child thread and thus is now running the scheduler that just found out that the main thread is runnable again.

1a: If the scheduler remembers for every thread on which core they ran last and tries to schedule threads to run on the same core again when they become runnable, then this scenario only depends on the probability that core 1 is occupied when the main thread wakes up. That probability should not depend strongly on how long the core was idle. But there may be some reason that the system does not get a chance to schedule a different task to core 1 if the main thread is only suspended for very little time. This corresponds somewhat to the data you got as there seems to be a discontinuity around a payload runtime of 270 µs.

1b: If the scheduler only remembers the last thread each core ran, and only tries to run a thread again on the same core if nothing else has been running on that core in between, then we can expect the probability of the main thread to be woken up on core 1 to be linearly dependent on how long the thread was asleep. The averaged cost per loop iteration would then be seen to asymptotically approach the delay for migrating a thread to another core.

In your measurements I think there is too much jitter to strongly favor one of the above options over the other.

* I'm not entirely sure how smart Windows and Linux are in scheduling threads on the same core that they ran on last time, but a quick google showed that at least some schedulers try to do that. Here's an article describing some of the things the Linux scheduler does, which I have only quickly skimmed but seems interesting.

hypothesis two:

When a cpu core goes to sleep because there is no work for it to do, it presumably leaves the process context of the last process to run on it intact. I expect that it only switches the process context when it finds that it has an actual new proces to run. If core wakes up and finds out that it can continue running the same task it was running before, it presumably notices that it does not need to change context.

If the above assumption holds, that would also mean that the time it takes for a thread to be woken up and continue running depends on whether it was the last thread to run on a core, as in that case its context (including e.g. memory mapping, TLB cache, etc) does not need to be reloaded. The probability of something else being scheduled while the main thread is asleep is linearly proportional to how long the thread was asleep, so this would show behavior similar to hypothesis 1b.

Ways to test

Under all of the above hypotheses we can expect certain behavior:

If you were to measure operations for each individual thread create/join cycle, you would see that there is a fast and a slow time, or maybe more than two levels, corresponding to if the thread got migrated. There probably exists a way to find out on which physical core you are currently running, so you could try to verify this directly.
The slowdown you measure does not happen because individual thread create/joins become more expensive, but because you are taking the slow mode more often.
You could also try to run you program with a cpu affinity set so it will only run on one core. Then that one core will always be occupied and other processes will run on other cores. That should then eliminate the difference between fast and slow operations. You might even get a speedup compared to the current fast case, as there won't need to be inter-core communication to start the child thread. There will be a little bit of contention in the time the child thread has started but the main thread has not suspended itself in join yet, but I expect that to be less than the inter processor communication delay.

You could try to distinguish between hypoteses 1a and 1b by taking measurements with less jitter and trying to find out if the increase in overhead matches the expected overhead for both scenario's. I'm not sure you can distinguish 1b and 2, but you could also try to read up on the scheduler.

If a core is idle for longer, it can go into a deeper sleep state and e.g. power down its private caches, losing the info in them, making execution slower when it wakes. [Does Cache empty itself if idle for a long time?](https://stackoverflow.com/q/59659702). This includes branch-predictor state: [Branch Predictor Entries Invalidation upon program finishes?](https://stackoverflow.com/a/59470308). — Peter Cordes, Aug 01 '20 at 21:52
Also, the deeper the sleep, the longer it takes to wake up (e.g. for voltage to stabilize on parts of the core that fully powered down). [Lost Cycles on Intel? An inconsistency between rdtsc and CPU\_CLK\_UNHALTED.REF\_TSC](https://stackoverflow.com/q/45472147) shows that a Skylake loses about 8.5us of execution time when just changing CPU frequency/voltage between two P-states (like max turbo and some other speed). I think this kind of effect may be the explanation. Good idea to think about what happens to the main thread and its core while the new thread runs. — Peter Cordes, Aug 01 '20 at 21:55

score 0 · Answer 2 · answered Oct 02 '18 at 13:06

0

You might be executing some of the code on the "wrong" side of the timing instructions. A simple way to avoid that is to call the special x86 instruction CPUID. On GCC, you can do it this way:

#include <cpuid.h>

unsigned out[4];
__get_cpuid(1, &out[0], &out[1], &out[2], &out[3]);

Put such a call before you start timing and after you stop timing. It will act as a "fence" preventing the reordering of operations across your timing boundaries.

answered Oct 02 '18 at 13:06

John Zwinck

239,568
38
324
436

2

That could account for maybe a full out-of-order-execution window of 224 uops (Skylake ROB size https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Execution_engine). On a ~4GHz at maybe 0.1 instructions per clock worst case if there were slow cache misses, that might account for `224 * 0.1 IPC / (4 GHz) = 560 ns` = 0.5 microseconds, as a rough back of the envelope calculation. The OP is seeing effects in the 20 us range. – Peter Cordes Oct 02 '18 at 13:29

Is the time overhead of execution in a c++11 std::thread dependent on the payload executed?

2 Answers2

I have some hypotheses. The first:

hypothesis two:

Ways to test