It is normal that my code is faster without multithreading o am i doing something wrong?

Question

i'm trying to optimize my code using multithreading and is not just that the program is not the double speed as is suposed to be in this dual-core computer, it is SO MUCH SLOW. And i just wanna know if i'm doing something wrong or is pretty normal that in this case use multithreading does not help. I make this recreation of how i used the multithreading, and in my computer the parallel versions take's 4 times the time in the comparation of the normal version:

#include <iostream>
#include <random>
#include <thread>
#include <chrono>

using namespace std;

default_random_engine ran;
inline bool get(){
    return ran() % 3;
}

void normal_serie(unsigned repetitions, unsigned &result){
    for (unsigned i = 0; i < repetitions; ++i)
        result += get();
}

unsigned parallel_series(unsigned repetitions){
    const unsigned hardware_threads = std::thread::hardware_concurrency();
    cout << "Threads in this computer: " << hardware_threads << endl;
    const unsigned threads_number = (hardware_threads != 0) ? hardware_threads : 2;
    const unsigned its_per_thread = repetitions / threads_number;
    unsigned *results = new unsigned[threads_number]();
    std::thread *threads = new std::thread[threads_number - 1];

    for (unsigned i = 0; i < threads_number - 1; ++i)
        threads[i] = std::thread(normal_serie, its_per_thread, std::ref(results[i]));
    normal_serie(its_per_thread, results[threads_number - 1]);
    for (unsigned i = 0; i < threads_number - 1; ++i)
        threads[i].join();

    auto result = std::accumulate(results, results + threads_number, 0);
    delete[] results;
    delete[] threads;
    return result;
}

int main()
{
    constexpr unsigned repetitions = 100000000;
    auto to = std::chrono::high_resolution_clock::now();
    cout << parallel_series(repetitions) << endl;
    auto tf = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(tf - to).count();
    cout << "Parallel duration: " << duration << "ms" << endl;

    to = std::chrono::high_resolution_clock::now();
    unsigned r = 0;
    normal_serie(repetitions, r);
    cout << r << endl;
    tf = std::chrono::high_resolution_clock::now();
    duration = std::chrono::duration_cast<std::chrono::milliseconds>(tf - to).count();
    cout << "Normal duration: " << duration << "ms" << endl;
    return 0;
}

Things that i already know, but i didn't to make this code shorter:

I should set a max_iterations_per_thread because you don't wanna make 10 iterations per thread, but in this case we are doing one billion iterations so that is not gonna happend. The number of iterations must be divisible by the number or threads, otherwise the code will not do an effective work.

This is the output that i get in my computer:

Threads in this computer: 2
66665160
Parallel duration: 4545ms
66664432
Normal duration: 1019ms

(Solved partially doing this changes: )

inline bool get(default_random_engine &ran){
    return ran() % 3;
}

void normal_serie(unsigned repetitions, unsigned &result){
    default_random_engine eng;
    unsigned saver_result = 0;
    for (unsigned i = 0; i < repetitions; ++i)
        saver_result += get(eng);
    result += saver_result;
}

Your `parallel_series` function does a *lot* more than just doing the calculations. Memory allocations, and worse *output*, take a lot of time. — Some programmer dude, Apr 03 '20 at 06:48
@Someprogrammerdude He will only *create* one thread. But he already had one thread. One plus one is two. — David Schwartz, Apr 03 '20 at 06:49
Offtopic but why `T* name = new T[size]` and then delete it instead of `std::vector`? — fas, Apr 03 '20 at 06:59
Consider [5 Big Fat Reasons Why Mutexes Suck Big Time](https://accu.org/var/uploads/journals/Overload149.pdf) (2nd Article). Always worth keeping in mind. — David C. Rankin, Apr 03 '20 at 07:03
@DavidC.Rankin A significant part of that article is based on a misunderstanding of what mutexes do. Mutexes deschedule contending threads so that threads that do not contend can run in parallel without contention. The whole point is to allow one thread context switch to switch to a thread that can very cheaply acquire uncontended mutexes because the threads that would slow it down aren't running. — David Schwartz, Apr 03 '20 at 07:09
I've just ran this on an 8 core machine, built by VS2017, release, default optimizations, 64 bit. It took over 2 times longer on 8 threads. My best bet - i tend to agree with @DavidSchwartz, the random generator is either accessing it's state with locks, or, it may happen to be stateless, but intentionally does something that takes a set amount of time, on purpose, such as collecting a set amount of hardware noise from some source. I know nothing about its implementation, but it'd bet on it being the bottleneck one way or another. — Boris Lipschitz, Apr 03 '20 at 07:13
@BorisLipschitz did you test it in the new version? now the `get` funtion takes a `random_default_engine&` and works better, i just wanna know if its anyway to do it even better. (Thanks for use you 8-core PC to this) — Alexander Porovorovich, Apr 03 '20 at 07:16
Sorry for my miss using of the platform, it wasn't edited but it is now — Alexander Porovorovich, Apr 03 '20 at 07:19
@AlexanderPorovorovich, yes, i did , on my own, before noticing it mentioned, but i did pretty much the same, gave each thread its own default_random_engine. The 8 threads became about 3 times faster in this case, but still nowhere near 8 times faster. — Boris Lipschitz, Apr 03 '20 at 07:21
And BTW, just to make sure the threading code at least /feels/ ok, i've ran the threads without the random, with a short sleep inside (dropping the number of iterations to something less insane, obviously), and it works fine with sleeps (about 8 times faster in parallel). One way or another, its fighting over something related to the random engine, in addition to an operation itself being relatively short, likely comparable to the CPU load of the context switching itself. — Boris Lipschitz, Apr 03 '20 at 07:24
Please stop editing the answers into your question, if you want to summarise everyone's answers you can create your own answer but editing the answers into the question just makes the answers make no sense — Alan Birtles, Apr 03 '20 at 07:30
With sleeps? `#include ` to have the sleep, `return 0` instead of the random, `Sleep(10)` before that return, drop total iterations to 1000, thats all. — Boris Lipschitz, Apr 03 '20 at 07:31
You're changing your code and results to include the solutions suggested in the answers — Alan Birtles, Apr 03 '20 at 07:33
@DavidSchwartz thank you for the clarification. I didn't so much take that the workings or function of mutexes were the main problem in the article, but more that over use resulted in inherently unproveable code due to the varying or indeterminate control paths they produce leading to code that could not be fully validated. I guess I should have qualified the intent. — David C. Rankin, Apr 03 '20 at 07:33
Well, interesting question, but i think we already have the general consensus here. Random generator fighting over it's static resources, locking the threads down. — Boris Lipschitz, Apr 03 '20 at 07:34
@AlanBirtles well, i change it to like this, sorry, new in the platform — Alexander Porovorovich, Apr 03 '20 at 07:37

score 3 · Accepted Answer · answered Apr 03 '20 at 06:46

3

All your threads are tripping over each other fighting for access to ran which can only perform one operation at a time because it only has one state and each operation advances its state. There is no point in running operations in parallel if the vast majority of each operation involves a choke point that cannot support any concurrency.

answered Apr 03 '20 at 06:46

David Schwartz

179,497
17
214
278

That's right, i remember that there is a reserved word (that i cannot find in google) like `static` but that creates one for every thread, what is is if it exists? – Alexander Porovorovich Apr 03 '20 at 06:54
@AlexanderPorovorovich Just have each thread create its own local one in `normal_serie`. (And the moral is that if you care about performance, you *must* measure and analyze.) – David Schwartz Apr 03 '20 at 06:56
Hello, you are right, i just did it and i take's better results! (code updated up) – Alexander Porovorovich Apr 03 '20 at 07:03
I don't see any evidence of random engines being thread safe https://stackoverflow.com/questions/8813592/c11-thread-safety-of-random-number-generators, calling them from multiple threads is undefined behaviour – Alan Birtles Apr 03 '20 at 07:12
I guess they are, because the code never explotes nor gives an wrong answer; BUT of course is obvious that i have to change that anyways – Alexander Porovorovich Apr 03 '20 at 07:22

score 1 · Answer 2 · answered Apr 03 '20 at 06:46

1

All elements of results are likely to share a cache line, which means there is lots of inter-core communication going on.

Try modifying normal_serie to accumulate into a local variable and only write it to results in the end.

answered Apr 03 '20 at 06:46

Botje

26,269
3
31
41

Hello, you are right, in this case i got 1700ms in the parallel version, but is still slower that the normal version; is all we can do here? – Alexander Porovorovich Apr 03 '20 at 06:49
@AlexanderPorovorovich Are you still measuring with the output to `std::cout` and all the memory handling? – Some programmer dude Apr 03 '20 at 06:55
@Someprogrammerdude just updated the code and the results – Alexander Porovorovich Apr 03 '20 at 07:01

It is normal that my code is faster without multithreading o am i doing something wrong?

2 Answers2