Consider this case:
for (...)
{
const size_t count = ...
for (size_t i = 0; i < count; ++i)
{
calculate(i); // thread-safe function
}
}
What is the most elegant solution to maximize performance using C++17 and/or boost?
Cyclic "create + join" threads makes no sense because of huge overhead (which in my case exactly equals possible gain).
So I have to create N threads only once and keep them synchronized with the main one (using: mutex, shared_mutex, condition_variable, atomic, etc.). It appeared to be quite difficult task for such common and clear situation (in order to make everything really safe and fast). Sticking with it during days I have a feeling of "inventing a bicycle"...
- Update 1: calculate(x) and calculate(y) can (and should) run in parallel
- Update 2: std::atomic::fetch_add (or smth.) is more preferable than queue (or smth.)
- Update 3: extreme computations (i.e. millions of "outer" calls and hundreds of "inner")
- Update 4: calculate() changes internal object's data without returning a value
Intermediate solution
For some reason "async + wait" is much faster then "create + join" threads. So these two examples make 100% speed increase:
Example 1
for (...)
{
const size_t count = ...
future<void> execution[cpu_cores];
for (size_t x = 0; x < cpu_cores; ++x)
{
execution[x] = async(launch::async, ref(*this), x, count);
}
for (size_t x = 0; x < cpu_cores; ++x)
{
execution[x].wait();
}
}
void operator()(const size_t x, const size_t count)
{
for (size_t i = x; i < count; i += cpu_cores)
{
calculate(i);
}
}
Example 2
for (...)
{
index = 0;
const size_t count = ...
future<void> execution[cpu_cores];
for (size_t x = 0; x < cpu_cores; ++x)
{
execution[x] = async(launch::async, ref(*this), count);
}
for (size_t x = 0; x < cpu_cores; ++x)
{
execution[x].wait();
}
}
atomic<size_t> index;
void operator()(const size_t count)
{
for (size_t i = index.fetch_add(1); i < count; i = index.fetch_add(1))
{
calculate(i);
}
}
Is it possible to make it even faster by creating threads only once and then synchronize them with a small overhead?
Final solution
Additional +20% of speed increase in comparison to std::async!
for (size_t i = 0; i < _countof(index); ++i) { index[i] = i; }
for_each_n(par_unseq, index, count, [&](const size_t i) { calculate(i); });
Is it possible to avoid redundant array "index"?
Yes:
for_each_n(par_unseq, counting_iterator<size_t>(0), count,
[&](const size_t i)
{
calculate(i);
});