C++ multithreaded version of creating vector of random numbers slower than single-threaded version

Question

I am trying to write a multi-threaded program to produce a vector of N*NumPerThread uniform random integers, where N is the return value of std::thread::hardware_concurrency() and NumPerThread is the amount of random numbers I want each thread to generate.

I created a multi-threaded version:

#include <iostream>
#include <thread>
#include <vector>
#include <random>
#include <chrono>

using Clock = std::chrono::high_resolution_clock;

namespace Vars
{
    const unsigned int N = std::thread::hardware_concurrency(); //number of threads on device
    const unsigned int NumPerThread = 5e5; //number of random numbers to generate per thread
    std::vector<int> RandNums(NumPerThread*N);
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(1, 1000);
    int sz = 0;
}

using namespace Vars;

void AddN(int start)
{
    static std::mutex mtx;
    std::lock_guard<std::mutex> lock(mtx);
    for (unsigned int i=start; i<start+NumPerThread; i++)
    {
        RandNums[i] = dis(gen);
        ++sz;
    }
}

int main()
{
    auto start_time = Clock::now();
    std::vector<std::thread> threads;
    threads.reserve(N);
    
    for (unsigned int i=0; i<N; i++)
    {
        threads.emplace_back(std::move(std::thread(AddN, i*NumPerThread)));
    }

    for (auto &i: threads)
    {
        i.join();
    }
        
    auto end_time = Clock::now();
    std::cout << "\nTime difference = "
    << std::chrono::duration<double, std::nano>(end_time - start_time).count() << " nanoseconds\n";
    std::cout << "size = " << sz << '\n';
}

and a single-threaded version

#include <iostream>
#include <thread>
#include <vector>
#include <random>
#include <chrono>


using Clock = std::chrono::high_resolution_clock;



namespace Vars
{
    const unsigned int N = std::thread::hardware_concurrency(); //number of threads on device
    const unsigned int NumPerThread = 5e5; //number of random numbers to generate per thread
    std::vector<int> RandNums(NumPerThread*N);
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(1, 1000);
    int sz = 0;
}
    


using namespace Vars;


void AddN()
{
    for (unsigned int i=0; i<NumPerThread*N; i++)
    {
        RandNums[i] = dis(gen);
        ++sz;
    }
}

int main()
{
    auto start_time = Clock::now();

    AddN();
    
    auto end_time = Clock::now();
    std::cout << "\nTime difference = "
    << std::chrono::duration<double, std::nano>(end_time - start_time).count() << " nanoseconds\n";
    std::cout << "size = " << sz << '\n';
}

The execution times are more or less the same. I am assuming there is a problem with the multi-threaded version?

P.S. I looked at all of the other similar questions here, I don't see how they directly apply to this task...

When each of your threads starts, the first thing it does is try to lock the mutex. Once it succeeds, it generates all the numbers, and only unlocks the mutex when it's done. This basically means that there is no real parallelization, the threads run in sequence, only one working at the same time. Hence it's no better than single threaded version, probably worse due to the overhead you added. — Dan Mašek, Dec 03 '22 at 15:21
You don't need the mutex in the first example, because there is no data race. The mutex prevents running other threads while a single thread is running. The variable sz can be removed, it does not have a sense. — 273K, Dec 03 '22 at 15:22
Try to allocate space for all numbers and then each thread can fill one partition of it (might be possible without locks) — KIIV, Dec 03 '22 at 15:24
Remove the mutex. Also having multiple threads execute the initialization of "static std::mutex mtx" is potentially unsafe, depending on C++ version and compiler. — Sven Nilsson, Dec 03 '22 at 15:24
@273K - The mutex would be required to protect `sz` and I think the `dis` and `gen` (I could be wrong about parallelism and the random number libraries). — Stephen Newell, Dec 03 '22 at 15:24
@StephenNewell I would expect the same, although probably best to cross-check some good documentation. Related: https://stackoverflow.com/questions/40655814/is-mersenne-twister-thread-safe-for-cpp — Dan Mašek, Dec 03 '22 at 15:26
Regarding "dis" and "gen", you may need one "gen" per thread. — Sven Nilsson, Dec 03 '22 at 15:27
This doesn't address the question, but `namespace Vars { ... }` followed immediately by `using namespace Vars;` is pointless. Just skip the namespace and define those variables in global scope. — Pete Becker, Dec 03 '22 at 16:44

score 4 · Answer 1 · answered Dec 03 '22 at 15:32

Threading is not a magical salve you can rub onto any code that makes it go faster. Like any tool, you have to use it correctly.

In particular, if you want performance out of threading, among the most important questions you need to ask is what data needs to be shared across threads. Your algorithm decided that the data which needs to be shared is the entire std::vector<int> result object. And since different threads cannot manipulate the object at the same time, each thread has to wait its turn to do the manipulation.

Your code is the equivalent of expecting 10 chefs to cook 10 meals in the same time as 1 chef, but you only provide them a single stove.

Threading works out best when nobody has to wait on anybody else to get any work done. Arrange your algorithms accordingly. For example, each thread could build its own array and return them, with the receiving code concatenating all of the arrays together.

It's a nice idea, but I don't get how to implement that; how would you get the return value from a function executed by `std::thread`? — Edward Finkelstein, Dec 03 '22 at 17:35
@EdwardFinkelstein: With a [`promise`](https://en.cppreference.com/w/cpp/thread/promise). Or its slightly easier-to-use cousin, the [`packaged_task`](https://en.cppreference.com/w/cpp/thread/packaged_task). — Nicol Bolas, Dec 03 '22 at 17:44
Hi, so I posted an answer where I implemented what you said, is it ok now? — Edward Finkelstein, Dec 03 '22 at 22:37

lewis · Accepted Answer · 2022-12-03T21:04:16.597

You can do with without any mutex.

Create your vector
Use a mutex just to (and technically this probably isn't ncessary) to create an iterator point at v.begin () + itsThreadIndex*NumPerThread;
then each thread can freely increment that iterator and write to a part of the vector not touched by other threads.

Be sure each thread has its own copy of

   std::random_device rd;
   std::mt19937 gen(rd());
   std::uniform_int_distribution<> dis(1, 1000);

That should run much faster.

UNTESTED code - but this should make my above suggestion more clear:

using Clock = std::chrono::high_resolution_clock;

namespace SharedVars
{
    const unsigned int N = std::thread::hardware_concurrency(); //number of threads on device
    const unsigned int NumPerThread = 5e5; //number of random numbers to generate per thread
    std::vector<int> RandNums(NumPerThread*N);
    std::mutex mtx;
}

void PerThread_AddN(int threadNumber)
{
    using namespace SharedVars;
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(1, 1000);
    int sz = 0;

    vector<int>::iterator from;
    vector<int>::iterator to;
    {
        std::lock_guard<std::mutex> lock(mtx);  // hold the lock only while accessing shared vector, not while accessing its contents
        from = RandNums.begin () + threadNumber*NumPerThread;
        to = from + NumPerThread;
    }
    for (auto i = from; i < to; ++i)
    {
        *i = dis(gen);
    }
}

int main()
{
    auto start_time = Clock::now();
    std::vector<std::thread> threads;
    threads.reserve(N);
    
    for (unsigned int i=0; i<N; i++)
    {
        threads.emplace_back(std::move(std::thread(PerThread_AddN, i)));
    }
    for (auto &i: threads)
    {
        i.join();
    }
    auto end_time = Clock::now();
    std::cout << "\nTime difference = "
    << std::chrono::duration<double, std::nano>(end_time - start_time).count() << " nanoseconds\n";
    std::cout << "size = " << sz << '\n';
}

A good idea to pre-generate `rd()` values and make sure there are no duplicates. A single duplicate will result in the same values and really mess up the randomness. — doug, Dec 03 '22 at 15:54
I don't understand what you mean... I tried `for (auto i = RandNums.begin()+start; i < RandNums.begin()+start+NumPerThread; i++){*i = dis(gen); ++sz; }` but it made it worse? — Edward Finkelstein, Dec 03 '22 at 17:30
I edited my answer to include more details. I've not tested the code. But my more detailed answer shows how to avoid holding the lock while the N threads are running. — lewis, Dec 03 '22 at 21:04
I don't know how, but this is like 50 times faster... But I'm only using 8 threads! — Edward Finkelstein, Dec 03 '22 at 22:45
I have't tried compiling/running the code I sent, nor what you wrote. So I'm not sure. But the MAIN reasons I EXPECTED a speedup like that are that I completely avoid locks during the core of the algorithm. The way I wrote it - you do a few locks to 'setup the problem space'. And then run without locks and just doing pointer arithmatic/minimal work to record the results. — lewis, Dec 06 '22 at 02:28

Edward Finkelstein · Answer 3 · 2022-12-06T13:14:01.457

0

Nicol Boas was right on the money. I reimplemented it using std::packaged_task, and it's around 4-5 times faster now.

#include <iostream>
#include <vector>
#include <random>
#include <future>
#include <chrono>

using Clock = std::chrono::high_resolution_clock;

const unsigned int N = std::thread::hardware_concurrency(); //number of threads on device
const unsigned int NumPerThread = 5e5; //number of random numbers to generate per thread

std::vector<int> x(NumPerThread);

std::vector<int> createVec()
{
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(1, 1000);
    
    for (unsigned int i = 0; i < NumPerThread; i++)
    {
        x[i] = dis(gen);
    }
    return x;
}

int main()
{
    auto start_time = Clock::now();

    std::vector<int> RandNums;
    RandNums.reserve(N*NumPerThread);

    std::vector<std::future<std::vector<int>>> results;
    results.reserve(N);
    
    std::vector<int> crap;
    crap.reserve(NumPerThread);

    for (unsigned int i=0; i<N; i++)
    {
        std::packaged_task<std::vector<int>()> temp(createVec);
        results[i] = std::move(temp.get_future());
        temp();
        crap = std::move(results[i].get());
        RandNums.insert(RandNums.begin()+(0*NumPerThread),crap.begin(),crap.end());
    }
    std::cout << RandNums.size() << '\n';
    auto end_time = Clock::now();
    std::cout << "Time difference = "
    << std::chrono::duration<double, std::nano>(end_time - start_time).count() << " nanoseconds\n";
}

But is there a way to make this one better? lewis's version is way faster than this, so there must be something else missing...

edited Dec 06 '22 at 13:14

answered Dec 03 '22 at 22:36

Edward Finkelstein

3
1
2

There are MANY reasons this code is much slower than what I wrote. One, is that this creates many temporary vector objects, and copies them around all over the place. In STL, copying vectors is QUITE SLOW (I wrote a competing library - Stroika - where copying the equivilent of vectors is much faster - using copy-on-write - https://github.com/SophistSolutions/Stroika/blob/v2.1-Release/Library/Sources/Stroika/Foundation/Containers/Sequence.h). – lewis Dec 06 '22 at 02:32
Also - futures use alot of locks. For more complex problems, this can be a better strategy than the one I presented. But one key is the bit about copying the vectors around, and merging the work. I laid out all the memory, and had the 'futures/tasks' just scribble directly where the answer should go. – lewis Dec 06 '22 at 02:35
I am trying to use std::move a lot but doesn't help. I then made the vector x global and that helped a bit, but yours is still ~6 times faster... – Edward Finkelstein Dec 06 '22 at 13:15
std::move doesn't work the way you think it does. But thats not the only reason its not helping. Think big picture and forgot the code for a moment. I allocated ONE BIG ARRAY and handed iterators marking the bounds of that array to the threads. They then could operate without any allocations or interfereence with one another. You - even ignorning the locking - are creating N vectors (so hitting the memory allocator N times) - and filling those in - and then COPYING all that data back into a single array (and deleting the N temporaries). That is more work. – lewis Dec 06 '22 at 15:08
If you like the speed of my approach, but somehow find it inelegant, you can use c++20 ranges to make it a bit more elegant - and a bit closer to your future-based design. Maybe even use ranges with futures (wont be AS fast as mine due to locking, but maybe alot closer). – lewis Dec 06 '22 at 15:10

C++ multithreaded version of creating vector of random numbers slower than single-threaded version

3 Answers3