In C++17 the standard algorithms are available in a parallel version. You specify the execution policy (std::seq
), parallel (std::par
), or parallel and vectorised (std::par_unseq
), and it will do the multithreading for you in the background.
So for what you want to do you can make use of std::transform
with a lambda function to capture the operation you want to perform on every element of your input vector, and the results are put in the results
vector (size has to be same):
#include <execution>
#include <algorithm>
#include <vector>
int compute_something(int i, int j) {
return i * j;
}
int main()
{
auto params = std::vector<int>(1000, 5);
std::vector<int> results(1000, 0);
std::transform(std::execution::par_unseq, params.begin(), params.end(),
results.begin(), [](int i) { return compute_something(i, 4); }
);
}
Of course, it is possible embed the computation within lambda for such a simple calculation as you have in compute_something
. Then the code becomes:
std::transform(std::execution::par_unseq, params.begin(), params.end(),
results.begin(), [](int i) { return i * 4; }
Not all compilers have implemented execution policy yet. So if your compiler doesn't support it you can do it another way: use std::async
and process the input vector in chunks. To do this you would have to define a new function that takes iterators and returns result vector. Then you can combine the results at the end.
Example:
#include <future>
#include <vector>
using Iter = std::vector<int>::iterator;
std::vector<int> parallel_compute(Iter beg, Iter end)
{
std::vector<int> results;
//Reserve memory to avoid reallocations
auto size = std::distance(beg, end);
results.reserve(size);
for (Iter it = beg; it != end; ++it)
{
results.push_back(*it * 4); //Add result to vector
}
return results;
}
int main()
{
const int Size = 1000;
//Chunk size
const int Half = Size / 2;
//Input vector
auto params = std::vector<int>(Size, 5);
//Create futures
auto fut1 = std::async(std::launch::async, parallel_compute, params.begin(), params.begin()+ Half);
auto fut2 = std::async(std::launch::async, parallel_compute, params.begin()+ Half, params.end());
//Get results
auto res1 = fut1.get();
auto res2 = fut2.get();
//Combine results into one vector
std::vector<int> results;
results.insert(results.end(), res1.begin(), res1.end());
results.insert(results.end(), res2.begin(), res2.end());
}
The launch::async
policy will ensure two threads are created. However, I wouldn't create too many threads - one per core is a reasonable strategy. You could make use of std::thread::hardware_concurrency()
to get number of concurrent threads supported by system. Creating threads and managing them introduces some overhead and can be counterproductive if you create too many.
Edit:
To avoid expensive allocations for individual small vectors, we can create a result vector at the start and pass iterators to the result range for each parallel invocation of parallel_compute
. Since each thread will be accessing a different part of the result vector, we don't need synchronisation:
#include <future>
#include <vector>
using Iter = std::vector<int>::iterator;
void parallel_compute(Iter beg, Iter end, Iter outBeg)
{
for (Iter it = beg; it != end; ++it)
{
*outBeg++ = (*it * 4); //Add result to vector
}
}
int main()
{
const int Size = 1000;
//Chunk size
const int Half = Size / 2;
//Input vector
auto params = std::vector<int>(Size, 5);
//Output vector
std::vector<int> results(Size, 0);
//Create futures
auto fut1 = std::async(std::launch::async, parallel_compute, params.begin(), params.begin() + Half, results.begin());
auto fut2 = std::async(std::launch::async, parallel_compute, params.begin() + Half, params.end(), results.begin() + Half);
//Get results
fut1.wait();
fut2.wait();
}