1

I've got a bunch of threads working on getting a result and storing it in a shared vector.

Right now I am doing this in the following way, but am wondering if there is a faster way of doing it (speed is the only priority here so please focus on that), e.g. to avoid false sharing and the like.

Globally, there exists a vector

std::vector<double> allResults;

and each thread stores its results in this vector by writing

std::for_each(allResults.begin() + i, allResults.begin() + i + n, 
    [](double& res) 
    { 
    res = .... //Calculate result and store it.
    });

Here, for each thread, i, n are indeces that correspond to a range of results that the current thread is supposed to calculate: allResults[i], ...., allResults[i + n - 1]. These ranges are unique for every thread. So each thread is working on its own range of results, without overlaps.

balki
  • 26,394
  • 30
  • 105
  • 151
DejayeBo
  • 11
  • 1
  • Are the threads joined before `allResults` is evaluated? – Scheff's Cat Oct 22 '18 at 14:04
  • Yes, the main thread waits for all threads to complete first. – DejayeBo Oct 22 '18 at 14:06
  • 3
    To avoid false sharing you need `sizeof(double) * n` to be evenly divisible by the cache line size. That can be tricky to do in portable code though. – NathanOliver Oct 22 '18 at 14:10
  • 2
    I recently did something similar in [SO: Eigen library with C++11 multithreading](https://stackoverflow.com/a/52724681/7478597) (My answer didn't rely on Eigen.) and [SO: Multi-threading benchmarking issues](https://stackoverflow.com/a/52835213/7478597). When I wrote the first, I was in doubt for quite the same reasons and googled a bit. So, the first answer contains some notes and links about this. – Scheff's Cat Oct 22 '18 at 14:11
  • 1
    Why not let each thread work with individual vectors and then merge them at the end? I would start with that implementation and improve only after profiling. – balki Oct 22 '18 at 14:16
  • You need to `reserve()` the space first and make sure to never resize the `vector` while working on it's elements in different threads. Also make sure that writes from different threads are separated by >64 bytes or else performance will suffer from false sharing. – rustyx Oct 22 '18 at 14:16
  • 1
    In addition to @NathanOliver: Concerning the issue of false cache sharing, I recently got a link from a colaborator (which I've not yet fully read and understood): [Eliminate False Sharing](http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206). – Scheff's Cat Oct 22 '18 at 14:18
  • Try this: Split the vector by the amount of threads. Each split is a separate vector. Have the threads work only on their vector. When all done, merge the vectors in to 1. – Ivan Rubinson Oct 22 '18 at 15:12
  • The advice of splitting the vectors and having each thread work with its own vector definitely seems to work, I got the code running faster... but the trouble is I actually don't know how much work a vector will do beforehand. So now I have the choice between allocating a very large vector for each thread (I know the MAX amount of work so I could create vectors of that size) or giving it a smaller size and then resizing as needed. Probably going to have to run some tests to see what works best. – DejayeBo Oct 22 '18 at 18:44

0 Answers0