So I had this question: simple-division-of-labour-over-threads-is-not-reducing-the-time-taken. I thought I had it sorted, but going back to re-visit this work, I am not getting crazy slow down like I was before (due to mutex within rand()
) but nor am I getting any improvement in total time taken.
In the code I am splitting up a task of x iteration of work over y threads.
So if I want to do 100'000'000 calculations, in one thread that might take ~350ms, then my hope its that in 2 threads (doing 50'000'000 calcs each) that would take ~175ms and in the three threads ~115ms and so on...
I know using threads won't perfectly split the work due to thread overheads and such. But i want hoping for some performance gain at least.
My slightly updated code is here:
Reults
1thread:
starting thread: 1 workload: 100000000
thread: 1 finished after: 303ms val: 7.02066
==========================
thread overall_total_time time: 304ms
3 threads
starting thread: 1 workload: 33333333
starting thread: 3 workload: 33333333
starting thread: 2 workload: 33333333
thread: 3 finished after: 363ms val: 6.61467
thread: 1 finished after: 368ms val: 6.61467
thread: 2 finished after: 365ms val: 6.61467
==========================
thread overall_total_time time: 368ms
You can see the 3 threads actually takes slightly longer then 1 thread, but each thread is only doing 1/3 of the work iterations. I see similar lack of performance gain on my PC at home which has 8 CPU cores.
Its not like threading overhead should take more then a few milliseconds (IMO) so I can't see what is going on here. I don't believe there is any resource sharing conflicts because this code is quite simple and uses no external outputs/inputs (other then RAM).
Code For Reference
in godbolt: https://godbolt.org/z/bGWdxE
In main() you can tweak the number of threads and amount of work (loop iterations).
#include <iostream>
#include <vector>
#include <thread>
#include <math.h>
void thread_func(uint32_t interations, uint32_t thread_id)
{
// Print the thread id / workload
std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl;
// Get the start time
auto start = std::chrono::high_resolution_clock::now();
// do some work for the required number of interations
double val{0};
for (auto i = 1u; i <= interations; i++)
{
val += i / (2.2 * i) / (1.23 * i); // some work
}
// Get the time taken
auto total_time = std::chrono::high_resolution_clock::now() - start;
// Print it out
std::cout << "thread: " << thread_id << " finished after: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count()
<< "ms" << " val: " << val << std::endl;
}
int main()
{
uint32_t num_threads = 3; // Max 3 in godbolt
uint32_t total_work = 100'000'000;
// Store the start time
auto overall_start = std::chrono::high_resolution_clock::now();
// Start all the threads doing work
std::vector<std::thread> task_list;
for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++)
{
task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); }));
}
// Wait for the threads to finish
for (auto &task : task_list)
{
task.join();
}
// Get the end time and print it
auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start;
std::cout << "\n==========================\n"
<< "thread overall_total_time time: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count()
<< "ms" << std::endl;
return 0;
}
Update
I think I have narrowed down my issue: On my 64Bit VM I see:
- Compiling for 32-bit no optimisation: more threads = runs slower!
- Compiling for 32-bit with optimisation: more threads = runs a bit faster
- Compiling for 64-bit no optimisation: more threads = runs faster (as expected)
- Compiling for 64-bit with optimisation: more threads = same as with out opt, except everything takes less time in general.
So my issue might just be from running 32-bit code on a 64-bit VM. But I don't really understand why adding threads does not work very well if my executable is 32-bit running on a 64-bit architecture...