I want to know the time overhead to execute a method in a C++11 std::thread (or std::async) compared to direct execution. I know that thread pools can significantly reduce or even completely avoid this overhead. But I'd still like to get a better feeling for the numbers. I'd like to know roughly at what computational cost the thread creation pays off, and at what cost the pooling pays off.
I implemented a simple benchmark myself, that boils down to:
void PayloadFunction(double* aInnerRuntime, const size_t aNumPayloadRounds) {
double vComputeValue = 3.14159;
auto vInnerStart = std::chrono::high_resolution_clock::now();
for (size_t vIdx = 0; vIdx < aNumPayloadRounds; ++vIdx) {
vComputeValue = std::exp2(std::log1p(std::cbrt(std::sqrt(std::pow(vComputeValue, 3.14152)))));
}
auto vInnerEnd = std::chrono::high_resolution_clock::now();
*aInnerRuntime += static_cast<std::chrono::duration<double, std::micro>>(vInnerEnd - vInnerStart).count();
volatile double vResult = vComputeValue;
}
int main() {
double vInnerRuntime = 0.0;
double vOuterRuntime = 0.0;
auto vStart = std::chrono::high_resolution_clock::now();
for (size_t vIdx = 0; vIdx < 10000; ++vIdx) {
std::thread vThread(PayloadFunction, &vInnerRuntime, cNumPayloadRounds);
vThread.join();
}
auto vEnd = std::chrono::high_resolution_clock::now();
vOuterRuntime = static_cast<std::chrono::duration<double, std::micro>>(vEnd - vStart).count();
// normalize away the robustness iterations:
vInnerRuntime /= static_cast<double>(cNumRobustnessIterations);
vOuterRuntime /= static_cast<double>(cNumRobustnessIterations);
const double vThreadCreationCost = vOuterRuntime - vInnerRuntime;
}
This works quite well and I can get typical thread creation costs of ~20-80 microseconds (us) on Ubuntu 18.04 with a modern Core i7-6700K. For one thing, this is cheap compared to my expectations!
But now comes the curious part: the thread overhead seems to depend (very reproducible) on the actual time spent in the payload method! This makes no sense to me. But it reproducible happens on six different hardware machines with various flavors of Ubuntu and CentOS!
- If I spend between 1 and 100us inside
PayloadFunction
, the typical thread creation cost is around 20us. - When I increase the time spent in
PayloadFunction
to 100-1000us, the thread creation cost increases to around 40us. - A further increase to more then 10000us in
PayloadFunction
again increases the thread creation cost to around 80us.
I did not go to larger ranges, but I can clearly see a relation between payload time and thread overhead (as computed above). Since I can not explain this behavior, I assume there must be a pitfall. Is it possible that my time measurement is so inaccurate? Or could the CPU Turbo cause different timings based on the higher or lower load? Can somebody shed some light?
Here is a random example of the timings I get. The numbers are representative for the pattern described above. The same pattern can be observed on many different computer hardware (various Intel and AMD processors) and Linux flavors (Ubuntu 14.04, 16.04, 18.04, CentOS 6.9 and CentOS 7.4):
payload runtime 0.3 us., thread overhead 31.3 us.
payload runtime 0.6 us., thread overhead 32.3 us.
payload runtime 2.5 us., thread overhead 18.0 us.
payload runtime 1.9 us., thread overhead 21.2 us.
payload runtime 2.5 us., thread overhead 25.6 us.
payload runtime 5.2 us., thread overhead 21.4 us.
payload runtime 8.7 us., thread overhead 16.6 us.
payload runtime 18.5 us., thread overhead 17.6 us.
payload runtime 36.1 us., thread overhead 17.7 us.
payload runtime 73.4 us., thread overhead 22.2 us.
payload runtime 134.9 us., thread overhead 19.6 us.
payload runtime 272.6 us., thread overhead 44.8 us.
payload runtime 543.4 us., thread overhead 65.9 us.
payload runtime 1045.0 us., thread overhead 70.3 us.
payload runtime 2082.2 us., thread overhead 69.9 us.
payload runtime 4160.9 us., thread overhead 76.0 us.
payload runtime 8292.5 us., thread overhead 79.2 us.
payload runtime 16523.0 us., thread overhead 86.9 us.
payload runtime 33017.6 us., thread overhead 85.3 us.
payload runtime 66242.0 us., thread overhead 76.4 us.
payload runtime 132382.4 us., thread overhead 69.1 us.