0

I'd like to write a multi-threaded version of the following code

template<typename Out, typename In, typename Cond, typename Func>
std::vector<Out> collect(std::vector<In> const&values,
                         Cond const&cond, Func const&func)
{
    std::vector<Out> result;
    for(auto const&val : values)
        if(cond(val))
            result.emplace_back(func(val));
    return result;
}

where the order of selected elements does not matter.

One simple approach is

template<typename Out, typename In, typename Cond, typename Func>
std::vector<Out> collect(std::vector<In> const&values,
                         Cond const&cond, Func const&func)
{
    std::vector<Out> result(values.size());
    std::atomic<size_t> index = 0;
    // some multithreaded for loop implementation
    parallel_for(size_t(0),values.size(),[&](size_t i) {
        if(cond(values[i]))
            result[index++] = func(values[i]);
    });
    result.resize(index);
    return result;
}

(of course, initializing result is serial, but let's ignore that here). This appears to work, but may not be lock-free. Is there a better way? In particular, can I avoid allocating too many data (if only a few input data get selected)?

The problem is very similar to std::copy_if (it would be std::transform_if it that existed) -- how is the parallel version std::copy_if(std::par,...) implemented (which is C++17, but I'm restricted to C++11)?

Walter
  • 44,150
  • 20
  • 113
  • 196
  • i wouldnt make all threads write to the same vector, but each to its own and merge them in the end, in this way you need synchronization only once not "all the time" – 463035818_is_not_an_ai Nov 22 '18 at 12:40
  • Why do you use `std::vector` for concurrent `push_back` if you already use `tbb` (implied from your `parallel_for`)? `tbb::concurrent_vector` is designed for concurrent access. – llllllllll Nov 22 '18 at 12:45
  • 1
    Seems highly dependent on your exact details, but copying to separate vectors and merging sounds fine. – Passer By Nov 22 '18 at 13:30
  • @PasserBy +1, in addition the using of separate vectors or even writing in the separated spans of the same vector will really [increase the performance](https://www.arbinada.com/en/node/1626). – serge Nov 22 '18 at 14:35
  • @liliscent I tried your advice of using `tbb::concurrent_vector`, but found this resulted in **poorer performance**. I suspect it's implementation does not avoid false sharing and hence runs into the same problems as my simple approach. – Walter Nov 26 '18 at 11:18
  • @Walter I haven't used `tbb` for a while. But if I remember correctly, you should make sure to `push_back` a block of size bigger than cache line(64bytes). Did you consider this in your implementation? – llllllllll Nov 26 '18 at 11:25

1 Answers1

0

A decent way to do this is using thread-local vectors that are merged at the end. This answer (in particular the last code sample) would be a good duplicate if not for the fact that it uses OMP:

https://stackoverflow.com/a/18671256/9528746

Edit: Changed to a better answer.

Max Langhof
  • 23,383
  • 5
  • 39
  • 72