conditionally filling a vector in parallel

Question

I'd like to write a multi-threaded version of the following code

template<typename Out, typename In, typename Cond, typename Func>
std::vector<Out> collect(std::vector<In> const&values,
                         Cond const&cond, Func const&func)
{
    std::vector<Out> result;
    for(auto const&val : values)
        if(cond(val))
            result.emplace_back(func(val));
    return result;
}

where the order of selected elements does not matter.

One simple approach is

template<typename Out, typename In, typename Cond, typename Func>
std::vector<Out> collect(std::vector<In> const&values,
                         Cond const&cond, Func const&func)
{
    std::vector<Out> result(values.size());
    std::atomic<size_t> index = 0;
    // some multithreaded for loop implementation
    parallel_for(size_t(0),values.size(),[&](size_t i) {
        if(cond(values[i]))
            result[index++] = func(values[i]);
    });
    result.resize(index);
    return result;
}

(of course, initializing result is serial, but let's ignore that here). This appears to work, but may not be lock-free. Is there a better way? In particular, can I avoid allocating too many data (if only a few input data get selected)?

The problem is very similar to std::copy_if (it would be std::transform_if it that existed) -- how is the parallel version std::copy_if(std::par,...) implemented (which is C++17, but I'm restricted to C++11)?

i wouldnt make all threads write to the same vector, but each to its own and merge them in the end, in this way you need synchronization only once not "all the time" — 463035818_is_not_an_ai, Nov 22 '18 at 12:40
Why do you use `std::vector` for concurrent `push_back` if you already use `tbb` (implied from your `parallel_for`)? `tbb::concurrent_vector` is designed for concurrent access. — llllllllll, Nov 22 '18 at 12:45
Seems highly dependent on your exact details, but copying to separate vectors and merging sounds fine. — Passer By, Nov 22 '18 at 13:30
@PasserBy +1, in addition the using of separate vectors or even writing in the separated spans of the same vector will really [increase the performance](https://www.arbinada.com/en/node/1626). — serge, Nov 22 '18 at 14:35
@liliscent I tried your advice of using `tbb::concurrent_vector`, but found this resulted in **poorer performance**. I suspect it's implementation does not avoid false sharing and hence runs into the same problems as my simple approach. — Walter, Nov 26 '18 at 11:18
@Walter I haven't used `tbb` for a while. But if I remember correctly, you should make sure to `push_back` a block of size bigger than cache line(64bytes). Did you consider this in your implementation? — llllllllll, Nov 26 '18 at 11:25

Max Langhof · Answer 1 · 2018-11-22T15:25:26.673

0

A decent way to do this is using thread-local vectors that are merged at the end. This answer (in particular the last code sample) would be a good duplicate if not for the fact that it uses OMP:

https://stackoverflow.com/a/18671256/9528746

Edit: Changed to a better answer.

edited Nov 22 '18 at 15:25

answered Nov 22 '18 at 13:18

Max Langhof

23,383
5
39
72

The main difference is not the OMP thing, but that I only collect selected items. – Walter Nov 22 '18 at 15:21

conditionally filling a vector in parallel

1 Answers1