Using threads to split up task is slowing down my work

Question

So I had this question: simple-division-of-labour-over-threads-is-not-reducing-the-time-taken. I thought I had it sorted, but going back to re-visit this work, I am not getting crazy slow down like I was before (due to mutex within rand()) but nor am I getting any improvement in total time taken.

In the code I am splitting up a task of x iteration of work over y threads.

So if I want to do 100'000'000 calculations, in one thread that might take ~350ms, then my hope its that in 2 threads (doing 50'000'000 calcs each) that would take ~175ms and in the three threads ~115ms and so on...

I know using threads won't perfectly split the work due to thread overheads and such. But i want hoping for some performance gain at least.

My slightly updated code is here:

Reults

1thread:

starting thread: 1 workload: 100000000
thread: 1 finished after: 303ms val: 7.02066
==========================
thread overall_total_time time: 304ms

3 threads

starting thread: 1 workload: 33333333
starting thread: 3 workload: 33333333
starting thread: 2 workload: 33333333
thread: 3 finished after: 363ms val: 6.61467
thread: 1 finished after: 368ms val: 6.61467
thread: 2 finished after: 365ms val: 6.61467
==========================
thread overall_total_time time: 368ms

You can see the 3 threads actually takes slightly longer then 1 thread, but each thread is only doing 1/3 of the work iterations. I see similar lack of performance gain on my PC at home which has 8 CPU cores.

Its not like threading overhead should take more then a few milliseconds (IMO) so I can't see what is going on here. I don't believe there is any resource sharing conflicts because this code is quite simple and uses no external outputs/inputs (other then RAM).

Code For Reference

in godbolt: https://godbolt.org/z/bGWdxE

In main() you can tweak the number of threads and amount of work (loop iterations).

#include <iostream>
#include <vector>
#include <thread>
#include <math.h>

void thread_func(uint32_t interations, uint32_t thread_id)
{
    // Print the thread id / workload
    std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl;
    // Get the start time
    auto start = std::chrono::high_resolution_clock::now();
    // do some work for the required number of interations
    double val{0};
    for (auto i = 1u; i <= interations; i++)
    {
        val += i / (2.2 * i) / (1.23 * i); // some work
    }
    // Get the time taken
    auto total_time = std::chrono::high_resolution_clock::now() - start;
    // Print it out
    std::cout << "thread: " << thread_id << " finished after: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count()
              << "ms" << " val: " << val << std::endl;
}

int main()
{
    uint32_t num_threads = 3; // Max 3 in godbolt  
    uint32_t total_work = 100'000'000;

    // Store the start time
    auto overall_start = std::chrono::high_resolution_clock::now();

    // Start all the threads doing work
    std::vector<std::thread> task_list;
    for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++)
    {
        task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); }));
    }

    // Wait for the threads to finish
    for (auto &task : task_list)
    {
        task.join();
    }

    // Get the end time and print it
    auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start;
    std::cout << "\n==========================\n"
              << "thread overall_total_time time: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count()
              << "ms" << std::endl;
    return 0;
}

Update

I think I have narrowed down my issue: On my 64Bit VM I see:

Compiling for 32-bit no optimisation: more threads = runs slower!
Compiling for 32-bit with optimisation: more threads = runs a bit faster
Compiling for 64-bit no optimisation: more threads = runs faster (as expected)
Compiling for 64-bit with optimisation: more threads = same as with out opt, except everything takes less time in general.

So my issue might just be from running 32-bit code on a 64-bit VM. But I don't really understand why adding threads does not work very well if my executable is 32-bit running on a 64-bit architecture...

thread creation/joining is expensive. testing once isn't a reliable measurement anyway. you have to run this code at least a thousand times and calculate the averages. — Raildex, Jan 06 '21 at 08:01
Cannot reproduce. I've got 2 cores and I see ~2x speed up between putting 1 and 2 in your code. I wonder if your system somehow defaults to binding the entire process to one core, so that even if it makes multiple threads all of them compete for time and thus slow each other down. I would at least not be surprised if Godbolt does that. — HTNW, Jan 06 '21 at 08:22
You need to enable compiler optimisations, performance testing on un-optimised code is fairly meaningless — Alan Birtles, Jan 06 '21 at 08:24
@AlanBirtles Not in this case... either the work is split and the time goes down or its not. Enabling optimizations will do nothing except scale all the times by some factor, (and on my system they actually appear to do nothing, period). — HTNW, Jan 06 '21 at 08:26
@HTNW not necessarily, there could be stuff in `std::thread` that is hideously slow without optimistations — Alan Birtles, Jan 06 '21 at 08:28
@AlanBirtles Whatever that code is doing, the loop in our code is surely (and, for my system, *definitely*) more hideous. — HTNW, Jan 06 '21 at 08:35
@HTNW I have written this code in a different way, where I have pre-existing threads and I signal the threads to start with a cond-var and I get similar results. I get that you might need to run this many hundred/thousand times, but what I am trying to simulate is a signular slow task that I want to speed up with threads - so a one-shot speed up is what I am trying to achieve and expect. But its interesting that you are seeing a speedup! - I am testing on my VM at home, so I might re-test on a non-VM. But running `top` I see CPU usage at 300% (or more for more threads) — code_fodder, Jan 06 '21 at 09:24
@HTNW also what system / setup have you got? - is it windows / linux? - wandering if the OS has some implication — code_fodder, Jan 06 '21 at 09:26
It works well on my linux VM. Nothing displayed on Windows (cygwin). — Damien, Jan 06 '21 at 14:22
I added an update - I was compiling for 32-bit on my 64-bit VM. Once I started compiling for 64-bit everything worked as expected. I get why 32-bit would run slower, but I am not really sure why threaded does not really help - `top` shows me that if I use 6 threads all 6 cores and maxing out at 100%, but the total time taken is the same as if I run the test with 1 thread and 1 core is running at max. I guess there is some horrible 32-bit <--> 64 translation going on.. but I don't understand what is happening exactly — code_fodder, Jan 07 '21 at 08:12

Mike Nakis · Accepted Answer · 2022-06-29T10:57:36.590

There are many possible reasons that could explain the observed results, so I do not think that anyone could give you a definitive answer. Also, the majority of reasons have to do with peculiarities of the hardware architecture, so different answers might be right or wrong on different machines.

As already mentioned in the comments, it could very well be that there is something wrong with thread allocation, so you are not really enjoying any benefit from using multiple threads. Godbolt.org is a cloud service, so it is most probably very heavily virtualized, meaning that your threads are competing against who knows how many hundreds of other threads, so I would assign a zero amount of trust to any results from running on godbolt.

A possible reason for the bad performance of unoptimized 32-bit code on a 64-bit VM could be that the unoptimized 32-bit code is not making efficient use of registers, so it becomes memory-bound. The code looks like it would all nicely fit within the CPU cache, but even the cache is considerably slower than direct register access, and the difference is more pronounced in a multi-threaded scenario where multiple threads are competing for access to the cache.

A possible reason for the still not stellar performance of optimized 32-bit code on a 64-bit VM could be that the CPU is optimized for 64-bit use, so instructions are not efficiently pipelined when running in 32-bit mode, or that the arithmetic unit of the CPU is not being used efficiently. It could be that these divisions in your code make all threads contend for the divider circuitry, of which the CPU may have only one, or of which the CPU may have only one when running in 32-bit mode. That would mean that most threads do nothing but wait for the divider to become available.

Note that these situations where a thread is being slowed down due to contention for CPU circuitry are very different from situations where a thread is being slowed down due to waiting for some device to respond. When a device is busy, the thread that waits for it is placed by the scheduler in I/O wait mode, so it is not consuming any CPU. When you have contention for CPU circuitry, the stalling happens inside the CPU; there is no thread context switch, so the thread slows down while it appears as if it is running full speed (consuming a full CPU core.) This may be an explanation for your CPU utilization observations.

Thanks Mike - actually after a bit more testing, I noticed I don't always get great results on x64 debug builds compared to x64 release builds (in that the % gain is not great for debug always) - but your explanation does appear to match the sort of things I am seeing. Definitely happy to disregard the goldbolt output :) — code_fodder, Jan 07 '21 at 12:08

Using threads to split up task is slowing down my work

Reults

Code For Reference

Update

1 Answers1