Calculating the sum of a large vector in parallel

Question

Problem background

I have a program that currently takes way too long to sum up large std::vectors of ~100 million elements using std::accumulate, and this is a bottleneck.

I want it to be faster and I want it to be an asynchronous calculation so the GUI/Server doesn't block. The calculation should also be using multithreading so I can reduce the time it takes to sum up a vector.

I want to split up the summation so that each thread sums a part of the vector and then when all partial sums are calculated, each thread's partial sum should be added together to get the total summation.

Boost.Asio?

I was wondering how I could go about this in Boost.Asio? My program ideally needs to reuse the threads (like a thread group), not sure how store and retrieve the partials sums and finally retrieve the sum of the partial sums.

I was thinking of creating a thread group which call boost::asio::io_service::run, passing a handler to compute the partial sums, but I'm not sure how to pass the partials sums to another handler and add all the partial sums together.

It would be great if someone showed some skeleton code of how I can go about this.

Have look at `future` and `packaged_task` in Boost.Thread instead. — Johannes S., Jan 20 '15 at 15:17
interestingly GNU has a parallel mode STL https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html which likely does everything you wish automatically — sehe, Jan 20 '15 at 15:41
It is also worth mentioning open-source [parallel_stl](https://parallelstl.codeplex.com/) by Artur Laksberg (C++ standardization committee member from Microsoft), which is an implementation of a proposal [N3960: Working Draft, Technical Specification for C++ Extensions for Parallelism](http://isocpp.org/blog/2014/03/n3960) (by Intel, Microsoft and NVidia). Can't wait to see what they will come up for C++17. — Ivan Aksamentov - Drop, Jan 20 '15 at 23:40

Felix Glas · Accepted Answer · 2015-01-21T17:13:47.077

Is Boost.Asio suitable for this problem?

The main purpose of Boost.Asio is to provide an asynchronous model for network and I/O programming, and the problem you describe does not seem to have much to do with networking and I/O.

I think that the simplest solution is to use the threading primitives provided by either Boost or the C++ standard library.

A parallel algorithm

Here's an example of a parallel version of accumulate created by only using the standard library.

/* Minimum number of elements for multithreaded algorithm.
   Less than this and the algorithm is executed on single thread. */
static const int MT_MIN_SIZE = 10000;

template <typename InputIt, typename T>
auto parallel_accumulate(InputIt first, InputIt last, T init) {
    // Determine total size.
    const auto size = std::distance(first, last);
    // Determine how many parts the work shall be split into.
    const auto parts = (size < MT_MIN_SIZE)? 1 : std::thread::hardware_concurrency();

    std::vector<std::future<T>> futures;

    // For each part, calculate size and run accumulate on a separate thread.
    for (std::size_t i = 0; i != parts; ++i) {
        const auto part_size = (size * i + size) / parts - (size * i) / parts;
        futures.emplace_back(std::async(std::launch::async,
            [=] { return std::accumulate(first, std::next(first, part_size), T{}); }));
        std::advance(first, part_size);
    }

    // Wait for all threads to finish execution and accumulate results.
    return std::accumulate(std::begin(futures), std::end(futures), init,
        [] (const T prev, auto& future) { return prev + future.get(); });
}

Live example _{(Parallel version performs about the same as sequential on Coliru, probably only 1 core available)}

Timings

On my machine (using 8 threads) the parallel version gave, on average, a ~120 % boost in performance.

Sequential sum:
Time taken: 46 ms
5000000050000000
--------------------------------
Parallel sum:
Time taken: 21 ms
5000000050000000

However, the absolute gain for 100,000,000 elements is only marginal (25 ms). Although, the performance gain might be greater when accumulating a different element type than int.

OpenMP

As mentioned by @sehe in the comments, it is worth mentioning that OpenMP might provide a simple solution to this problem, e.g.

template <typename T, typename U>
auto omp_accumulate(const std::vector<T>& v, U init) {
    U sum = init;

    #pragma omp parallel for reduction(+:sum)
    for(std::size_t i = 0; i < v.size(); i++) {
        sum += v[i];
    }

    return sum;
}

On my machine this method performed the same as the parallel method using standard thread primitives.

Sequential sum:
Time taken: 46 ms
5000000050000000
--------------------------------
Parallel sum:
Time taken: 21 ms
Sum: 5000000050000000
--------------------------------
OpenMP sum:
Time taken: 21 ms
Sum: 5000000050000000

interestingly, here's the same idea using OpenMP: http://stackoverflow.com/questions/28050669/can-i-report-progress-for-openmp-tasks - would be interesting to see how it compares — sehe, Jan 21 '15 at 02:00
@sehe Added edit section with OpenMP alternative. Got somewhat different timings when measuring this time, don't know exactly why (didn't change anything). Anyway, I updated the timings for all 3 methods. — Felix Glas, Jan 21 '15 at 03:21
Interesting results as I thought it would reduce the time by more than 1/2 given you have 8-cores. — pezpezpez, Jan 21 '15 at 18:10
@pezpezpez Indeed, I guess summing 100 million `int`s is not a big enough work load for multithreading to be more than marginally beneficial. Multithreading always comes with heavy overhead for thread creation and synchronization etc. However, if the sequential calculation represent a heavy enough workload and the work can be mostly parallelized, then you will see a greater performance boost when using multithreading. Also see [Amdahl's law](http://en.wikipedia.org/wiki/Amdahl%27s_law). — Felix Glas, Jan 21 '15 at 18:24

score 1 · Answer 2 · edited May 23 '17 at 11:47

You can use Boost Asio as a thread pool. But there's not a lot of sense in it unless you have... asynchronous IO operations to coordinate.

In this answer to "c++ work queues with blocking" I show two thread_pool implementations:

Solution #1: one based on boost::asio::io_service
Solution #2: the other based on boost::thread primitives

Both accept any void() signature compatible task. This means, you could wrap your function-that-returns-the-important-results in a packaged_task<...> and get the future<RetVal> from it.