4

I need to solve a large problem, on a large graph instances, and in order to do so I divide the input space between threads to solve indipendenlty the same function on each set of inputs. When i time to understand the scalability of my software, I notice that when I increase the number of threads used, after 4 threads the time increases. I coded a really small example to see why this happens, here it follows:

#include <algorithm>
#include <random>
#include <thread>
#include <iostream>
#include <chrono>


template<typename T>
inline double getMs(T start, T end) {
    return double(
        std::chrono::duration_cast<std::chrono::milliseconds>(end - start)
        .count()) /
        1000;
}

int main(int) {

    
    std::random_device rd;
    std::mt19937 g(rd());


    unsigned int n = std::thread::hardware_concurrency();
    std::cout << n << " concurrent threads are supported.\n";


    for (size_t np = 2; np < 17; np++) {
        auto start = std::chrono::high_resolution_clock::now();
        std::cout << np << " threads: ";
        std::vector<std::thread> threads(np);

        int number_stops = 50; // memory 39420
        int number_transfers = 1; // memory
        int number_structures = 1; // memory
        int number_iterations = 1000000; // time

        auto dimension = number_stops * (number_transfers + 1) * number_structures;

        auto paraTask = [&]() {
            for (int b = 0; b < number_iterations; b++) {
                //std::srand(unsigned(std::time(nullptr)));
                std::vector<int> v(dimension, 1586)
                //std::generate(v.begin(), v.end(), std::rand);
                v.clear();
            }
        };

        for (size_t i = 0; i < np; i++) {
            threads[i] =
                std::thread(paraTask);
        }
        // Join the threads
        for (auto&& thread : threads) thread.join();

        double elapsed = getMs(start, std::chrono::high_resolution_clock::now());

        printf("parallel completed: %.3f sec.\n",
            elapsed);
    }

    return 0;
}

Just a brief description. In order to emulate the actual software I'm working on, I use here the variables:

int number_stops = 50; // memory 39420
int number_transfers = 1; // memory
int number_structures = 1; // memory
int number_iterations = 1000000; // time

Without much details, the first three are there to simulate the memory consumption (how many vector entries I fill in each call), while the fourth one is there to simulate the number of iterations. This is here to see what causes the increasing in time, if is the memory consumption when we add threads, or if we have more problems with more computational time in each thread. (or both)

I copy down here the result with the setting above:

16 concurrent threads are supported.
2 threads: parallel completed: 0.995 sec.
3 threads: parallel completed: 1.017 sec.
4 threads: parallel completed: 1.028 sec.
5 threads: parallel completed: 1.081 sec.
6 threads: parallel completed: 1.131 sec.
7 threads: parallel completed: 1.122 sec.
8 threads: parallel completed: 1.216 sec.
9 threads: parallel completed: 1.445 sec.
10 threads: parallel completed: 1.603 sec.
11 threads: parallel completed: 1.596 sec.
12 threads: parallel completed: 1.626 sec.
13 threads: parallel completed: 1.634 sec.
14 threads: parallel completed: 1.611 sec.
15 threads: parallel completed: 1.648 sec.
16 threads: parallel completed: 1.688 sec.

So, as you can see, the time increases. Why is that. I also tried the other way around (less iteration but more memory):

int number_stops = 50; // memory 39420
int number_transfers = 100; // memory
int number_structures = 100; // memory
int number_iterations = 50; // time 

and the same happens, the time increases:

16 concurrent threads are supported.
2 threads: parallel completed: 0.275 sec.
3 threads: parallel completed: 0.267 sec.
4 threads: parallel completed: 0.278 sec.
5 threads: parallel completed: 0.282 sec.
6 threads: parallel completed: 0.303 sec.
7 threads: parallel completed: 0.314 sec.
8 threads: parallel completed: 0.345 sec.
9 threads: parallel completed: 0.370 sec.
10 threads: parallel completed: 0.368 sec.
11 threads: parallel completed: 0.395 sec.
12 threads: parallel completed: 0.407 sec.
13 threads: parallel completed: 0.431 sec.
14 threads: parallel completed: 0.444 sec.
15 threads: parallel completed: 0.448 sec.
16 threads: parallel completed: 0.455 sec.

To give more context, here the specification of my computer:

  • CPU - 11th Gen Intel(R) Core(TM) i7-11700KF @ 3.60GHz
  • RAM - 16 GB DDR4
  • Windows 11 Compiler - MS_VS 2022

Furthermore, here an hardware report from CPU-Z

enter image description here

My CPU has 8 physical cores, and 16 logical ones.

  • 3
    `std::srand` should be used once per program, read [this](https://stackoverflow.com/a/6161352/11424134) – Raffallo Apr 12 '23 at 14:00
  • 3
    The first problem I see is that you use `std::rand` from multiple threads. This PRNG is not thread-safe. You have data races, therefore undefined behavior. And, the measurements are likely affected by a lot of cache contention. – Daniel Langr Apr 12 '23 at 14:00
  • ok I did it only to fill the vector somehow and emulate the real computatio I need. Now I just fill it normaly with `std::vector v(dimension, 1586)`, but nothing changes in the comp time – Claudio Tomasi Apr 12 '23 at 14:03
  • yes, but I wanted to see if the problem was a longer runtime or a bigger memory consumption. In that version, why the time increases? – Claudio Tomasi Apr 12 '23 at 14:06
  • If you're spending the majority of the time allocating memory, be aware that the operating system is probably synchronizing that and bottlenecking all your threads. Try using a compute-only test, with buffers pre-allocated, and see if that changes anything. – paddy Apr 12 '23 at 14:09
  • Without seeing the generated assembly, we cannot tell whether the loop is or isn't optimized out. If they are not, then what you observe is likely some overhead given by synchronization of allocations by the heap. Note that the heap operates on a shared memory space, therefore, it needs to synchronize allocations. This overhead generally grows with the increasing number of threads. To prove that, you can try to use some "scalable memory allocators" such as jemalloc, tcmalloc, or tbbmalloc (or others). Their use in Linux is simple, however, I suppose some of them will work in Windows as well. – Daniel Langr Apr 12 '23 at 14:15
  • Thread synchronization has a (non-zero) cost. – Jesper Juhl Apr 12 '23 at 14:27
  • You only have 8 cores, so after 8 threads the elapsed time will need to increase. Modern CPUs also constantly change the CPU clock to stay within the thermal and power limits of the CPU. A single core alone can run at a higher clock rate than all CPU cores running at once. But mostly you see this because, as has been said, your test is spending most of its time allocating memory and this results in contention for access to shared resources. – TrentP Apr 12 '23 at 14:28
  • Ok, but as you see the proble starts from thread 3. Is ir really only a matter of allocating memory? I I have 8 core, shouldn't each thread manage its own allocation ? – Claudio Tomasi Apr 12 '23 at 14:31
  • After seen comment under my answer I'm sure question do not have enough information to help. Most important part of code has been omitted. – Marek R Apr 12 '23 at 14:53
  • There are parallel allocators that allow each thread to do that, within limits, Daniel's comment mentioned some. On Linux, the default is/was a ptmalloc-based allocator and allows for parallel allocations. I don't know about Windows. But it's not so easy as to allow perfect scaling. Imagine one thread allocates lots of memory and many others allocate little, how does one divide the heap without synchronization between threads? What if one thread allocates and a different thread frees the allocation? – TrentP Apr 12 '23 at 14:53
  • fwiw, [`std::high_resolution_clock`](https://en.cppreference.com/w/cpp/chrono/high_resolution_clock) "The high_resolution_clock is not implemented consistently across different standard library implementations, and its use should be avoided. [...] use `steady_clock` for duration measurements [...]" – 463035818_is_not_an_ai Apr 12 '23 at 14:55

2 Answers2

2

Using threads is not a golden hammer of perfomance. If you do not design your program well outcome adding threads will lead to pesimization.

Your threads do not do much. Just allocate and deallocate memory. Heap is shared state, which is synchronized. This means that besides spawning threads overhead, you will have synchronization penalty.
So you have multiple overheads coming from threads and only thing which is parallel and not obstructed by other threads (initialization to default value) is to fast to see any gains from multiple threads. It is to fast comparing to those overhands.

Marek R
  • 32,568
  • 6
  • 55
  • 140
  • is there not such a way to allocate indipendently based on the thread which done the allocation? outside of that thread the vector v does not exist, so how can we do better with that ? – Claudio Tomasi Apr 12 '23 at 14:33
  • Code you are trying to optimize is to fast. Looks like you stripped intresting part of code which actually computes anything. – Marek R Apr 12 '23 at 14:37
  • yes it is just an example to understand in a small part what happens. in the real software, each thread calls a function, which allocates memory to save the arrival time at each node in a graph. and it needs to keep taht value in memory to see if it can be optimized. – Claudio Tomasi Apr 12 '23 at 14:39
  • So your question lacks content which has information allowing explain your problem. – Marek R Apr 12 '23 at 14:46
  • here I just want to understand why the posted code works as it does. How can I allocate memory in multithreading so that each thread does this indipendently and without synchronization? is it really needed to share the heap between all the cores? is there a way to specify variables to not syncronize (since, as we all know, the variables are not shared) – Claudio Tomasi Apr 12 '23 at 14:47
  • Note that there are more ways where multithreading code can be slower then single thread code. Synchronization issues (races/deadlocks), false sharing (sharing cache lines between threads), or .... . This heavily depends how you design data structures and how do you iterate over them. I've seen paper when adding small temporary variable change performance drastically and it was not shared between threads. – Marek R Apr 12 '23 at 14:52
  • @ClaudioTomasi you cannot. What you can do is allocate memory upfront and then let each thread work on a slice of it – 463035818_is_not_an_ai Apr 12 '23 at 14:52
  • ok thanks! now let us consider that I need a structure for each thread, independent. Would create them a priori and then calling threads on them could improve performances? – Claudio Tomasi Apr 12 '23 at 14:57
  • 1
    In your example, allocate one big vector of size `np * 1586`, then have each thread use a different section of it, from [i * np, i * np + 1586). You will need to pass `i` to the thread's function as an argument. – TrentP Apr 12 '23 at 15:00
  • @ClaudioTomasi depends on what the threads do. In your example they do nothing. You should read about Amdahl's law to understand why your speed up is so bad. Then read about gustavsons law to see how you can get better speed up (by increasing the parallel workload) – 463035818_is_not_an_ai Apr 12 '23 at 15:01
  • FYI after your comments I'm planing to delete my answer. I really recommend you to edit your question and provide complete representative example which I could run on my machine. – Marek R Apr 12 '23 at 15:12
  • @TrentP I would suggest allocating independent vectors (each with 1586 elements) for threads. With a single vector, you can suffer from _false sharing_. With independent vectors, this is less likely. – Daniel Langr Apr 13 '23 at 09:46
  • @DanielLangr, would vectors allocated sequentially from the heap be located in different address ranges? I think with most allocators, one would be likely to get a nearly contiguous allocation for all vectors, with only a 8-16 byte control record for the allocator between them. So it wouldn't really make much difference for false sharing. The best thing would be to just do one large allocation, but pad each 1586 element section to a multiple of `std::hardware_destructive_interference_size`, so that threads won't share the same cacheline. – TrentP Apr 13 '23 at 14:45
  • @TrentP Yes, that would be the best solution. – Daniel Langr Apr 14 '23 at 06:28
-1

the logical CPU (16) are as you said, only logical. The physical CPU number is the number of calculation machines, branch processores, etc. A larger number of logical CPU just enables the multicore CPU to use those "sub engines" more efficiently.

So what you should wonder about, is that the time increases from 1 to 8 threads, becauses all threads are doing the same and so are using (most time) the same 8 physical core parts, just minimizing wait times a bit.

Next step: your threads' work is mainly memory related and the I/O channel to the memory is a really limited ressouce, that is NOT available 8 times.

To get the result you wish to see, create a working loop, that e.g. just uses integer calculation, VERY low memory (e.g. by always summing the SAME integers) and throw away all results to further avoid memory channel concurrency. THEN your threads should nearly run in full parallel, delivering nearly stabile runtimes, but for the (minimal) time, the CPU requires for context switches when odds are bad and some other process needs the CPU also - your program is not alone, the operating system should also get its turn :-) and the more cores you occupy, the higher the probability to share calculation ressources with the system.

Synopsis
  • 190
  • 10