2

I want to make this function to compute in parallel:

#include <iostream>
#include <vector>

int compute_something(int i, int j) {
    return i*j;
}

int main() {

    auto params = std::vector<int>(1000,5);
    std::vector<int> results;
    for (auto i : params)
        results.push_back(compute_something(i,4));

    return 0;
}

I pass arguments from a list to a function and want the results returned in order. I want to use asynchronous parallelization, since compute_something() will take different times, depending on the input. Also the input vector is much larger than the number of cores.

Joel Bodenmann
  • 2,152
  • 2
  • 17
  • 44
Jake B.
  • 435
  • 3
  • 13

1 Answers1

4

In C++17 the standard algorithms are available in a parallel version. You specify the execution policy (std::seq), parallel (std::par), or parallel and vectorised (std::par_unseq), and it will do the multithreading for you in the background.

So for what you want to do you can make use of std::transform with a lambda function to capture the operation you want to perform on every element of your input vector, and the results are put in the results vector (size has to be same):

#include <execution>
#include <algorithm>
#include <vector>

int compute_something(int i, int j) {
    return i * j;
}

int main()
{
    auto params = std::vector<int>(1000, 5);
    std::vector<int> results(1000, 0);
    std::transform(std::execution::par_unseq, params.begin(), params.end(),
        results.begin(), [](int i) { return compute_something(i, 4); }
    );
}

Of course, it is possible embed the computation within lambda for such a simple calculation as you have in compute_something. Then the code becomes:

std::transform(std::execution::par_unseq, params.begin(), params.end(),
        results.begin(), [](int i) { return i * 4; }

Not all compilers have implemented execution policy yet. So if your compiler doesn't support it you can do it another way: use std::async and process the input vector in chunks. To do this you would have to define a new function that takes iterators and returns result vector. Then you can combine the results at the end.

Example:

#include <future>
#include <vector>

using Iter = std::vector<int>::iterator;

std::vector<int> parallel_compute(Iter beg, Iter end)
{
    std::vector<int> results;
    //Reserve memory to avoid reallocations
    auto size = std::distance(beg, end);
    results.reserve(size);

    for (Iter it = beg; it != end; ++it)
    {
        results.push_back(*it * 4); //Add result to vector
    }

    return results;
}

int main()
{
    const int Size = 1000;
    //Chunk size
    const int Half = Size / 2;
    //Input vector
    auto params = std::vector<int>(Size, 5);
    //Create futures
    auto fut1 = std::async(std::launch::async, parallel_compute, params.begin(), params.begin()+ Half);
    auto fut2 = std::async(std::launch::async, parallel_compute, params.begin()+ Half, params.end());
    //Get results
    auto res1 = fut1.get();
    auto res2 = fut2.get();
    //Combine results into one vector
    std::vector<int> results;
    results.insert(results.end(), res1.begin(), res1.end());
    results.insert(results.end(), res2.begin(), res2.end());
}

The launch::async policy will ensure two threads are created. However, I wouldn't create too many threads - one per core is a reasonable strategy. You could make use of std::thread::hardware_concurrency() to get number of concurrent threads supported by system. Creating threads and managing them introduces some overhead and can be counterproductive if you create too many.


Edit:

To avoid expensive allocations for individual small vectors, we can create a result vector at the start and pass iterators to the result range for each parallel invocation of parallel_compute. Since each thread will be accessing a different part of the result vector, we don't need synchronisation:

#include <future>
#include <vector>

using Iter = std::vector<int>::iterator;

void parallel_compute(Iter beg, Iter end, Iter outBeg)
{
    for (Iter it = beg; it != end; ++it)
    {
        *outBeg++ = (*it * 4); //Add result to vector
    }
}

int main()
{
    const int Size = 1000;
    //Chunk size
    const int Half = Size / 2;
    //Input vector
    auto params = std::vector<int>(Size, 5);
    //Output vector
    std::vector<int> results(Size, 0);
    //Create futures
    auto fut1 = std::async(std::launch::async, parallel_compute, params.begin(), params.begin() + Half, results.begin());
    auto fut2 = std::async(std::launch::async, parallel_compute, params.begin() + Half, params.end(), results.begin() + Half);
    //Get results
    fut1.wait();
    fut2.wait();
}
jignatius
  • 6,304
  • 2
  • 15
  • 30
  • Lambda could directly use the computation instead of calling that free function and with `auto` i.e. `[](const auto i) { return i * 4; }`. – Azeem Apr 04 '20 at 03:49
  • 1
    @Azeem Yes, for a simple calculation as we have you can embed the logic within the lambda. I leave that to OP. – jignatius Apr 04 '20 at 04:45
  • I am having problems compiling the code. I'm using MinGW-w64 on Win10. Does it support the "execution library"? – Jake B. Apr 04 '20 at 06:39
  • @JakeB. Actually not all compilers support execution policies yet as its a relatively new feature. Try adding the compiler option `-std=c++17`. I've got MS VS2019 and it does support it. You might find this useful: https://stackoverflow.com/questions/51031060/are-c17-parallel-algorithms-implemented-already – jignatius Apr 04 '20 at 06:44
  • @JakeB. I've added an alternative solution if you can't use execution policy with your compiler. – jignatius Apr 04 '20 at 08:02
  • I cannot get to load under Linux nor Windows. Thank you for your alternative solution, but the problem is that if some tasks finish way before others, then the core has to wait for the slow one to finish. I would like to feed the waiting core as soon as it finishes – Jake B. Apr 04 '20 at 15:49
  • @JakeB. Actually for the second solution you don't need `#include `, so I have removed that, but I'm very surprised you don't have that header. That's part of the standard library. You have to check your installation. Sorry I can't help there. As for your other question about some task finishing before another. Yes, that can happen. But they are running in parallel, so overall that should be faster (not accounting for thread creation,etc). You're only merging the results at the end. – jignatius Apr 04 '20 at 16:46
  • @JakeB. hint: look for dynamic load balancing and work stealing – prog-fh Apr 04 '20 at 20:47