Simple division of labour over threads is not reducing the time taken

Question

I have been trying to improve computation times on a project by splitting the work into tasks/threads and it has not been working out very well. So I decided to make a simple test project to see if I can get it working in a very simple case and this also is not working out as I expected it to.

What I have attempted to do is:

do a task X times in one thread - check the time taken.
do a task X / Y times in Y threads - check the time taken.

So if 1 thread takes T seconds to do 100'000'000 iterations of "work" then I would expect:

2 threads doing 50'000'000 iterations each would take ~ T / 2 seconds
3 threads doing 33'333'333 iterations each would take ~ T / 3 seconds

and so on until I reach some threading limit (number of cores or whatever).

So I wrote the code and tested it on my 8 core system (AMD Ryzen) plenty of RAM >16GB doing nothing else at the time.

1 Threads took: ~6.5s
2 Threads took: ~6.7s
3 Threads took: ~13.3s
8 Threads took: ~16.2s

So clearly something is not right here!

I ported the code into Godbolt and I see similar results. Godbolt only allows 3 threads, and for 1, 2 or 3 threads it takes ~8s (this varies by about 1s) to run. Here is the godbolt live code: https://godbolt.org/z/6eWKWr

Finally here is the code for reference:

#include <iostream>
#include <math.h>
#include <vector>
#include <thread>

#define randf() ((double) rand()) / ((double) (RAND_MAX))

void thread_func(uint32_t interations, uint32_t thread_id)
{
    // Print the thread id / workload
    std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl;
    // Get the start time
    auto start = std::chrono::high_resolution_clock::now();
    // do some work for the required number of interations
    for (auto i = 0u; i < interations; i++)
    {
        double value = randf();
        double calc = std::atan(value);
        (void) calc;
    }
    // Get the time taken
    auto total_time = std::chrono::high_resolution_clock::now() - start;
    // Print it out
    std::cout << "thread: " << thread_id << " finished after: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count()
              << "ms" << std::endl;
}

int main()
{
    // Note these numbers vary by about probably due to godbolt servers load (?)
    // 1 Threads takes: ~8s
    // 2 Threads takes: ~8s
    // 3 Threads takes: ~8s
    uint32_t num_threads = 3; // Max 3 in godbolt
    uint32_t total_work = 100'000'000;

    // Seed rand
    std::srand(static_cast<unsigned long>(std::chrono::steady_clock::now().time_since_epoch().count()));

    // Store the start time
    auto overall_start = std::chrono::high_resolution_clock::now();

    // Start all the threads doing work
    std::vector<std::thread> task_list;
    for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++)
    {
        task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); }));
    }

    // Wait for the threads to finish
    for (auto &task : task_list)
    {
        task.join();
    }

    // Get the end time and print it
    auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start;
    std::cout << "\n==========================\n"
              << "thread overall_total_time time: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count()
              << "ms" << std::endl;
    return 0;
}

Note: I have tried using std::async also with no difference (not that I was expecting any). I also tried compiling for release - no difference.

I have read such questions as: why-using-more-threads-makes-it-slower-than-using-less-threads and I can't see an obvious (to me) bottle neck:

CPU bound (needs lots of CPU resources): I have 8 cores
Memory bound (needs lots of RAM resources): I have assigned my VM 10GB ram, running nothing else
I/O bound (Network and/or hard drive resources): No network trafic involved
There is no sleeping/mutexing going on here (like there is in my real project)

Questions are:

Why might this be happening?
What am I doing wrong?
How can I improve this?

If I had to guess, your `rand` might be implemented with some form of exclusive lock. — François Andrieux, Dec 10 '20 at 21:25
I suspect that your `rand` is locking a global lock, then generating the number, then unlocking. This would mean that increasing threads does not make `rand` any faster. — Mooing Duck, Dec 10 '20 at 21:25
Ego, make your own pseudo random generator and create an instance per thread — JHBonarius, Dec 10 '20 at 21:27
And thread creation is part of the timing process. Timing a bunch of cout statements as well. — sweenish, Dec 10 '20 at 21:29
@1201ProgramAlarm oh... bugger - yes, took that out and it works as expected... I didn't even suspect little old `rand()` - thanks! — code_fodder, Dec 10 '20 at 21:29
Reminds me of an old question I had, the more threads the slower it gets. [Reading multiple files slower than reading sequentially](https://stackoverflow.com/questions/42620323/why-is-reading-multiple-files-at-the-same-time-slower-than-reading-sequentially) though it has nothing to do with your case. — Tony Tannous, Dec 10 '20 at 21:29
@1201ProgramAlarm also please add as an answer or such and I can mark this question as solved :) — code_fodder, Dec 10 '20 at 21:31
@code_fodder Consider creating a unique generator for each thread. See [](https://en.cppreference.com/w/cpp/numeric/random) for modern random number generation features. These should be used over `rand()`. — François Andrieux, Dec 10 '20 at 21:32
@TonyTannous unrelated I guess - but sort of similar in that it appears to have a bottle neck (that I did not expect or realise) in rand() - I'll have a read of your old question though anyway as I swat up on this :) — code_fodder, Dec 10 '20 at 21:32
@code_fodder Note that your code has no visible behavior except for timing. The actual work you do in the loop has no observable impact. So smarter compilers might be able to optimize your benchmark in ways you didn't expect, up to and including eliminating the simulated work loop. — François Andrieux, Dec 10 '20 at 21:33
@JHBonarius actually I don't really need the rand part - I just added it as some extra "work" to do in my thread - thanks :) — code_fodder, Dec 10 '20 at 21:33
@FrançoisAndrieux I noticed that on my VM in release! - so I figured that my "work" was "nonsense" as you suggest - thanks - I am bad at making up code examples! — code_fodder, Dec 10 '20 at 21:38
@code_fodder Don't worry, writing benchmarks is specially difficult. This is why it is more useful to profile real code instead of minimized examples. — François Andrieux, Dec 10 '20 at 21:40
@FrançoisAndrieux thanks for the advice! - I guess I just needed to see it working - I think bench mark might be the next step : ) — code_fodder, Dec 10 '20 at 21:43
"`(void) calc;`" Use `volatile auto result = calc;`. Volatile is very often useful and usually disregarded. You can be censored on this Website just for using it to solve a problem! — curiousguy, Dec 10 '20 at 22:58

score 2 · Accepted Answer · answered Dec 10 '20 at 21:33

2

The rand function is not guaranteed to be thread safe. It appears that, in your implementation, it is by using a lock or mutex, so if multiple threads are trying to generate a random number that take turns. As your loop is mostly just the call to rand, the performance suffers with multiple threads.

You can use the facilities of the <random> header and have each thread use it's own engine to generate the random numbers.

answered Dec 10 '20 at 21:33

1201ProgramAlarm

32,384
7
42
56

So this solved my question here - I am thinking I may have something similar in my real project - i.e. I use lots of maths functions, I guess I will need to check each one for thread-safety mutex locking :o (or maybe just profile it) – code_fodder Dec 10 '20 at 21:35
3

@code_fodder It would be shocking for math functions to have have such a lock, they should be stateless. `rand()` is stateful (the result changes with every call) so it can develop a race condition. – François Andrieux Dec 10 '20 at 21:37

score 0 · Answer 2 · answered Dec 10 '20 at 22:52

Never mind that rand() is or isn't thread safe. That might be the explanation if a statistician told you that the "random" numbers you were getting were defective in some way, but it doesn't explain the timing.

What explains the timing is that there is only one random state object, it's out in memory somewhere, and all of your threads are competing with each other to access it.

No matter how many CPUs your system has, only one thread at a time can access the same location in main memory.

It would be different if each of the threads had its own independent random state object. Then, most of the accesses from any given CPU to its own private random state would only have to go as far as the CPU's local cache, and they would not conflict with what the other threads, running on other CPUs, each with their own local cache were doing.

Simple division of labour over threads is not reducing the time taken

2 Answers2

Linked