9

I need to fill a huge (7734500 elements) std::vector<unsigned int> with random values, and I am trying to do it in parallel with multiple threads to achieve a higher efficiency. Here is the code that I have so far:

std::random_device rd; // seed generator

std::mt19937_64 generator{rd()}; // generator initialized with seed from rd

static const unsigned int NUM_THREADS = 4;


std::uniform_int_distribution<> initialize(unsigned long long int modulus)
{
    std::uniform_int_distribution<> unifDist{0, (int)(modulus-1)};
    return unifDist;
}


void unifRandVectorThreadRoutine
    (std::vector<unsigned int>& vector, unsigned int start,
    unsigned int end, std::uniform_int_distribution<>& dist)
{
    for(unsigned int i = start ; i < end ; ++i)
    {
        vector[i] = dist(generator);
    }
}


std::vector<unsigned int> uniformRandomVector
    (unsigned int rows, unsigned int columns, unsigned long long int modulus)
{
    std::uniform_int_distribution<> dist = initialize(modulus);

    std::thread threads[NUM_THREADS];

    std::vector<unsigned int> v;
    v.resize(rows*columns);

    // number of entries each thread will take care of
    unsigned int positionsEachThread = rows*columns/NUM_THREADS;

    // all but the last thread
    for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
    {
        threads[i] = std::thread(unifRandVectorThreadRoutine, v, i*positionsEachThread,
            (i+1)*positionsEachThread, dist);
        // threads[i].join();
    }

    // last thread
    threads[NUM_THREADS - 1] = std::thread(unifRandVectorThreadRoutine, v,
        (NUM_THREADS-1)*positionsEachThread, rows*columns, dist);
    // threads[NUM_THREADS - 1].join();

    for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
    {
        threads[i].join();
    }

    return v;
}

At the moment, it takes approximately 0.3 seconds: do you think there is a way to make it more efficient?


Edit: Giving each thread its own generator

I have modified the routine as follows

void unifRandVectorThreadRoutine
    (std::vector<unsigned int>& vector, unsigned int start,
    unsigned int end, std::uniform_int_distribution<>& dist)
{
    std::mt19937_64 generator{rd()};
    for(unsigned int i = start ; i < end ; ++i)
    {
        vector[i] = dist(generator);
    }
}

and the running time dropped by one half. So I am still sharing the std::random_device but each thread has its own std::mt19937_64.


Edit: Giving each thread its own vector and then concatenating

I changed the code as follows:

void unifRandVectorThreadRoutine
    (std::vector<unsigned int>& vector, unsigned int length,
    std::uniform_int_distribution<>& dist)
{
    vector.reserve(length);
    std::mt19937_64 generator{rd()};
    for(unsigned int i = 0 ; i < length ; ++i)
    {
        vector.push_back(dist(generator));
    }
}

and

std::vector<unsigned int> uniformRandomVector
    (unsigned int rows, unsigned int columns, unsigned long long int modulus)
{
    std::uniform_int_distribution<> dist = initialize(modulus);

    std::thread threads[NUM_THREADS];

    std::vector<unsigned int> v[NUM_THREADS];

    unsigned int positionsEachThread = rows*columns/NUM_THREADS;

    // all but the last thread
    for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
    {
        threads[i] = std::thread(unifRandVectorThreadRoutine, std::ref(v[i]), positionsEachThread, dist);
    }

    // last thread
    threads[NUM_THREADS - 1] = std::thread(unifRandVectorThreadRoutine, std::ref(v[NUM_THREADS-1]),
        rows*columns - (NUM_THREADS-1)*positionsEachThread, dist);

    for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
    {
        threads[i].join();
    }

    std::vector<unsigned int> finalVector;
    finalVector.reserve(rows*columns);

    for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
    {
        finalVector.insert(finalVector.end(), v[i].begin(), v[i].end());
    }

    return finalVector;
}

The execution time is slightly worse than before, when I was using just one vector shared between all the threads. Am I missing something or can it just happen?


Edit: using a different PRNG + benchmarks

Using a different PRNG (as suggested in some comments/answers) helps a lot: I tried with the xorshift+ and here is the implementation I am using:

class xorShift128PlusGenerator
{
public:
    xorShift128PlusGenerator()
    {
        state[0] = rd();
        state[1] = rd();
    };


    unsigned long int next()
    {
        unsigned long int x = state[0];
        unsigned long int const y = state[1];
        state[0] = y;
        x ^= x << 23; // a
        state[1] = x ^ y ^ (x >> 17) ^ (y >> 26); // b, c
        return state[1] + y;
    }


private:
    std::random_device rd; // seed generator
    unsigned long int state[2];

};

Then the routine is as follows

void unifRandVectorThreadRoutine
    (std::vector<unsigned int>& vector, unsigned int start,
    unsigned int end)
{
    xorShift128PlusGenerator prng;
    for(unsigned int i = start ; i < end ; ++i)
    {
        vector[i] = prng.next();
    }
}

Since I am now home and I am using a different (and more powerful) machine, I redid the tests to compare the results. Here is what I obtain:

  • Mersenne Twister with one generator per thread: 0.075 seconds
  • xorshift128+ shared between all threads: 0.023 seconds
  • xorshift128+ with one generator per thread: 0.023 seconds

Note: the execution time varies at each repetition. These are just typical values.

So there seems to be no difference if the xorshift generator is shared or not, but with all these improvements the execution time has dropped significantly.

minomic
  • 645
  • 1
  • 9
  • 22
  • 6
    Why do you `join` the thread as soon as you create it? That's essentially the same as doing it sequentially. – TartanLlama Feb 22 '16 at 10:31
  • @TartanLlama You are right! I changed the code, but the execution time is always the same (if not slightly worse). This makes me think that I am not getting anything out of using multiple threads – minomic Feb 22 '16 at 10:35
  • Are your threads running simultaneously (i.e. separate cores)? – Mohamad Elghawi Feb 22 '16 at 10:37
  • 2
    Also, you might be better with a RNG per thread rather than sharing one. – TartanLlama Feb 22 '16 at 10:38
  • @MohamadElghawi Yes, I am using an i5 with four cores, so I guess each thread is run separately – minomic Feb 22 '16 at 10:38
  • @Niall Uhm, this is an interesting idea: I will try! – minomic Feb 22 '16 at 10:38
  • @TartanLlama Another interesting idea: will try this as well – minomic Feb 22 '16 at 10:39
  • 4
    You have a race on your `generator` because accessing unsynchronized from several threads. Make it `thread_local` perhaps? – Galik Feb 22 '16 at 10:39
  • 2
    PRNGs have a state and therefore are not generally thread-safe. – Daniel Langr Feb 22 '16 at 10:39
  • @TartanLlama and Galik Indeed, using a generator for each thread made the running time drop to approximately 0.15 seconds. Now I want to try to give each thread its own vector – minomic Feb 22 '16 at 10:45
  • Also consider using some faster generator, see the answer for details. Drawback is that they are not in the C++ standard library. – Daniel Langr Feb 22 '16 at 10:47
  • If your processor supports hyperthreading you could try running on 8 cores rather than 4. – TartanLlama Feb 22 '16 at 10:49
  • @DanielLangr Okay, thanks: I will take a look at different generators as well – minomic Feb 22 '16 at 10:49
  • @TartanLlama I tried but the running time is slightly worse: maybe this processor is a bit too old to support hyperthreading... – minomic Feb 22 '16 at 10:50
  • False Sharing. https://en.wikipedia.org/wiki/False_sharing – Robinson Feb 22 '16 at 10:51
  • @DanielLangr Are you sure? Actually I am treating the last thread in a different way exactly for that reason... But maybe I am wrong – minomic Feb 22 '16 at 11:01
  • Sorry, overlooked your last thread treatment, you are right of course :) – Daniel Langr Feb 22 '16 at 11:05
  • 3
    instead of `NUM_THREADS = 4;` try `NUM_THREADS = std::thread::hardware_concurrency();` , that is, to not to guess the number of cores (some cores support hypher- threading) – David Haim Feb 22 '16 at 11:31
  • @DavidHaim I checked and it returns 0. Maybe because I am compiling with `std=c++0x` instead of `std=c++11`? Anyway I am forced to do so because otherwise I get `unrecognized command line option ‘-std=c++11’` – minomic Feb 22 '16 at 12:04
  • On the current edit. What that probably means is that the time to collate is probably longer than the time lost in the cache contention. The timing is sensitive to the number of elements being used, processor etc. You should look to make reasonable measurements over the target processor environments and pick an implementation that offers the desired performance. There is no single answer to "what is the most efficient", the results vary on processor, element count, initialisations etc. – Niall Feb 22 '16 at 14:17
  • @Niall I see, I was just curious to know whether the code could be written in a better (i.e. more efficient) way. Then, if this is not the case, it seems the implementation with just one shared vector is the fastest in my situation – minomic Feb 22 '16 at 14:21
  • @minomic. I've made a note in my answer. I think you've done the appropriate thing, run through the process of evolving solutions that identify bottlenecks and then implement appropriately to minimise the effect of these - there will always be limits, and given the size of data you are dealing with, you are possibly hitting those limits. The only other thing to do it to run a performance analyser with the code and identify the "slow" parts of it. Depending on the compiler, you can also use [profile guided optimisations (linked to MS)](https://msdn.microsoft.com/en-us/library/e7k32f4k.aspx). – Niall Feb 22 '16 at 14:30
  • Have you tried a different type of PRNG (e.g., Xorshift+)? I would like to see the impact on the running time. – Daniel Langr Feb 22 '16 at 14:42
  • 1
    @DanielLangr You will not believe it but I was trying it right now. For the moment I have tried the xorShift (found [here](http://stackoverflow.com/a/1640399/2651079)) and the time is around 0.1 seconds, much better than 0.15 seconds with the Mersenne twister – minomic Feb 22 '16 at 14:54
  • I believe, that's why I recommended different PRNG in my answer :) – Daniel Langr Feb 22 '16 at 14:59
  • @DanielLangr I meant "you will not believe that I was trying right now, without seeing your suggestion about the xorShift" but I admit it was not clear... :-) – minomic Feb 22 '16 at 15:01
  • 1
    :) I see. Remember again that you should use a distinct generator for each thread even for Xorshift+, which in not what your code indicates (single `static` state), otherwise you may have a data race. (This data race might not matter in this special random case, especially when run on x86_64, since `mov` instructions are atomic there. But...). And beware of storing states in an array, such as `uin64_t states[NUM_THREADS][2];` which would likely yield a lot of cache contention (depending on a size of a cache line). – Daniel Langr Feb 22 '16 at 15:56
  • @minomic Could you please remeasure with one Xorshift+ generator per thread (suggest declaring them as thread local variables)? With your static shared generator, there must be a terribly strong cache contention. Wonder how much time it will spare (recall that this approach spared you for MT generators half a time). – Daniel Langr Feb 22 '16 at 16:28
  • @DanielLangr You're right: I had forgotten about that! Thanks. I will measure and come back – minomic Feb 22 '16 at 16:28
  • @DanielLangr I have edited the OP with more details and benchmarks with a different machine – minomic Feb 22 '16 at 18:30
  • Seemingly, a compiler optimized away memory operations and stored PRNG state in registers in your case, which would explain same running times. Thanks for the benchmark. – Daniel Langr Feb 23 '16 at 10:02

3 Answers3

8

The generator std::mt19937_64 generator{rd()}; is shared between threads. There will be some shared state that requires updating in it, hence contention; there is a data race. You should also look to allow each thread to use its own generator - you will just need to make sure that they generate separate sequences.

You possibly have a cache contention issue around std::vector<unsigned int> v;, it is declared outside the threads and then hit with each iteration of the for loop in each thread. Let each thread have its own vector to fill, once all the threads are done, collate their results in the vector v. Possibly via a std::future will be the quickest. The exact size of the contention depends on the cache line sizes and the size of the vector being used (and segmented).

In this case you fill a large number of elements (7734500) with a comparatively small number of threads (4), the ratio would possibly lead to fewer contentions.

W.r.t. the number threads you could be using, you should consider binding the NUM_THREADS to the hardware concurrency available on the target; i.e. std::thread::hardware_concurrency().

When dealing with this large a number of elements, you could also look to avoid unnecessary initialisations and moving of the results (albeit given the int type, the move is less not noticeable here). The container itself is also something to be aware of; vector requires contiguous memory, so any additional elements (during a coalition phase) could result in memory allocation and copying.

The speed of the random number generator may also have an impact, other implementations and/or algorithms may impact the final execution times significantly enough to be considered.

As always with all performance based questions - the final solution requires measurement. Implement the possible solutions, measure over the target processors and environments and adapt until a suitable performance is found.

Niall
  • 30,036
  • 10
  • 99
  • 142
  • I don't think there is much cache contention, each thread touches only its contiguous part of `v` from some starting position. The contention occurs only at boundaries and at very different moments. The same problem will occur while collating thread-local vectors, and much more memory operations would be executed with this solution. – Daniel Langr Feb 22 '16 at 10:52
  • A write would invalidate the cache line, hence a lot of this depends on the size of the cache line etc. Giving each thread it's own vector to write to will (and then collate at the end) avoids the contention, however much contention that may be. – Niall Feb 22 '16 at 10:57
  • Yes, that's my point, when each thread accesses only its contiguous part of `v`, the only shraed cache lines are those at both ends of this part. Where else is any contention? – Daniel Langr Feb 22 '16 at 11:03
  • 1
    As Daniel Langr writes, in this example. cache contention happens only for the few elements at the ends of a part. This could be optimized by aligning parts to cache lines, but it is unlikely to have much of an effect. Also, using multiple vectors does *not* guarantee that the arrays are on separate cache lines, though large arrays are typically placed on page ranges of their own by the allocator. – Arne Vogel Feb 22 '16 at 11:56
  • And, the very same contention will be when thread-local vectors are merged, there is absolutely no benefit by using local vectors IMO. Maybe the avoidance of `resize()` as proposed by David Haim, but I would prefer a solution based on a custom allocator. – Daniel Langr Feb 22 '16 at 12:29
  • As noted in the answer, the magnitude of the contention on the vector `v` is dependent on the size and the number of writers. Merging the vectors does not suffer the same contention, there is 1 reader from the local resultant vectors (possibly 4 given the `NUM_THREADS` in the OP), and 1 writer to the final result. Again, the wins/gains would be dependent on the columns of elements vs the threads in use. *As always with all performance based questions* - the final solution requires measurements. Implement the possible solutions, measure and adapt until a suitable performance is found. – Niall Feb 22 '16 at 12:42
  • Using a global vector runs `for ... v[i] = dist()` by each thread for distinct `i`. Merging local vectors into global ones _effectively_ runs `for ... v[i] = local_v[j]` by each thread. **What is the difference as for cache contention?** Or you propose sequential merging? I really doubt that sequential merging will be faster than the parallel one just because of few elements possibly sharing a cache line or two. Most modern architectures will not fully utilize memory bandwidth with a single thread only. – Daniel Langr Feb 22 '16 at 12:55
  • Allowing merging intermediate results into a global result by each thread would be similar to simply writing to that global result to begin with. The merging would be once all the threads are complete. – Niall Feb 22 '16 at 13:01
  • @Niall, the formal term is "data race". A race condition is a much more general thing, which is not always undefined behaviour, and not even always a bug. A data race is always UB. – Jonathan Wakely Feb 22 '16 at 14:25
3

The Mersenne Twister generator (std::mt19937_64) is not too fast. You might consider other generators such as Xorshift+. See, e.g., this question: What is performance-wise the best way to generate random bools? (the discussion there goes beyond just bools).

And you should get rid of the data race in your code. Use one generator per thread.

Community
  • 1
  • 1
Daniel Langr
  • 22,196
  • 3
  • 50
  • 93
0
  std::vector<unsigned int> v;
    v.resize(rows*columns);

Unfortunatly, std::vector::resize value-intialize primitives as well, making your program once write zeros over the vector memory, then overriding this value with the random numbers.

try std::vector::reserve + std::vector::push_back.
that means the the threads can no longer share the vector without a lock, but you can give each one it's own vector, use reserve+push_back then combine all the results to bigger vector.

if that is not enough, and I hate to say that , use std::unique_ptr with malloc (with costume deleter). yes, this is C, yes this is nasty, yes, we have new[] , but malloc won't zero initalize the memory (unlike new[] and stl containers), then you can spread segments of the memory to each threads and let it generate the random number on it. you will save combining the vectors to one combined vector.

David Haim
  • 25,446
  • 3
  • 44
  • 78
  • I also encountered this problem; having vector of several billions of elements, `resize()` took more than **20 seconds** due to value-initialization. In C++11 (N3346 post-standard correction), one can avoid that by using of a custom allocator with empty `construct()` member functions. – Daniel Langr Feb 22 '16 at 12:24
  • 2
    @DanielLangr, or have a `vector` where `X` is a simple wrapper around `unsigned int` which doesn't initialize the member in its default ctor. – Jonathan Wakely Feb 22 '16 at 14:24
  • Sure, I have used this empty-constructor solution in practice for elements being of class/struct types. It works as well. – Daniel Langr Feb 22 '16 at 14:49