2

I'm trying to understand why increasing the number of threads (after a certain number) increases the CPU time instead of decreasing it.

A summary of what the code does: I have a main which create a large vector based on a dimension. I fill it with indexes (0..dimension-1) and then shuffle it. Then, in order to divide-and-conquer, I partition this vector giving a slice to each thread. I prepare a vector of vector of solutions, to give each entry to the threads. Each thread calls a function on each element of its slice and passing the reference to its prepared solution. This function just change the value of the solution at the given index in input, and then it just increases all the elements in the solution (I put this to increasing a bit the comp. time). Here the code:

#include <algorithm>
#include <random>
#include <thread>
#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>

using namespace std;
template<typename T>
inline double getMs(T start, T end) {
    return double(
        std::chrono::duration_cast<std::chrono::milliseconds>(end - start)
        .count()) /
        1000;
}

void fun(const std::vector<int>& slice, int dimension, 
    vector<int>& internal_sol) {
    int size = dimension;

    for (auto const& s : slice) {
        internal_sol[s] = 45;
        internal_sol[int(s/2)] = 45;
        if (s != 0) {
            internal_sol[s - 1] = 45;
        }
        if (s != internal_sol.size() - 1) {
            internal_sol[s + 1] = 45;
        }
        for (auto& i_s : internal_sol)
            i_s += 1;
    }

}


int main(int) {

    
    std::random_device rd;
    std::mt19937 g(rd());

    unsigned int n = std::thread::hardware_concurrency();
    std::cout << n << " concurrent threads are supported.\n";

    std::srand(unsigned(std::time(nullptr)));

    int number_stops = 50000; 
    int number_transfers = 3; 
    int number_structures = 2; 

    auto dimension = number_stops * (number_transfers + 1) * number_structures;
    std::vector<int> v(dimension);
    std::iota(std::begin(v), std::end(v), 0);
    std::shuffle(std::begin(v), std::end(v), g);

    printf("stops:\t%d\ntransfers:\t%d\nstructures:\t%d\n\n", number_stops, number_transfers, number_structures);

    for (size_t np = 2; np < 17; np++) {

        size_t sz = dimension;
        size_t part = sz / np;
        auto start = std::chrono::steady_clock::now();
        std::cout << np << " threads: ";
        std::vector<std::thread> threads(np);


        auto start_ext_sol = std::chrono::steady_clock::now();
        vector<vector<int>> external_sol; //As TAUs from velociraptor
        for (size_t i = 0; i < np; i++) {
            external_sol.emplace_back(dimension, -1);
        }
        double elapsed_ext_sol = getMs(start_ext_sol, std::chrono::steady_clock::now());


        auto paraTask = [&](size_t start, size_t end, vector<int>& ext_sol) {
            for (size_t l = start; l < end; ++l)
                fun({v[l]}, dimension, ext_sol);
            for (int i = 0;i < ext_sol.size(); i++)
                ext_sol[i] = -1;
        };

        for (size_t i = 0; i < np; i++) {
            size_t start = i * part;
            size_t length = (i + 1 == np) ? sz - i * part : part;
            threads[i] =
                std::thread(paraTask, start, start + length, std::ref(external_sol[i]));
        }
        
        // Join the threads
        for (auto&& thread : threads) thread.join();

        double elapsed = getMs(start, std::chrono::steady_clock::now());

        printf("parallel completed: %.3f sec. preparing all vectors ext_sols: %.3f sec\n",
            elapsed, elapsed_ext_sol);
    }

    return 0;
}

The result obtained is:

16 concurrent threads are supported.
stops:  50000
transfers:      3
structures:     2

2 threads: parallel completed: 6.776 sec. preparing all vectors ext_sols: 0.000 sec
3 threads: parallel completed: 5.711 sec. preparing all vectors ext_sols: 0.004 sec
4 threads: parallel completed: 5.285 sec. preparing all vectors ext_sols: 0.003 sec
5 threads: parallel completed: 4.892 sec. preparing all vectors ext_sols: 0.001 sec
6 threads: parallel completed: 4.614 sec. preparing all vectors ext_sols: 0.008 sec
7 threads: parallel completed: 4.726 sec. preparing all vectors ext_sols: 0.001 sec
8 threads: parallel completed: 4.281 sec. preparing all vectors ext_sols: 0.004 sec
9 threads: parallel completed: 4.815 sec. preparing all vectors ext_sols: 0.007 sec
10 threads: parallel completed: 4.791 sec. preparing all vectors ext_sols: 0.008 sec
11 threads: parallel completed: 5.594 sec. preparing all vectors ext_sols: 0.002 sec
12 threads: parallel completed: 5.411 sec. preparing all vectors ext_sols: 0.012 sec
13 threads: parallel completed: 5.213 sec. preparing all vectors ext_sols: 0.010 sec
14 threads: parallel completed: 6.159 sec. preparing all vectors ext_sols: 0.005 sec
15 threads: parallel completed: 8.353 sec. preparing all vectors ext_sols: 0.006 sec
16 threads: parallel completed: 6.769 sec. preparing all vectors ext_sols: 0.012 sec

As you can see the time increases. I don't see why this happens. The dimension here is 450.000 integers for each threads, therefore around 1.7 MB (counting 4 bytes for each integer). Thus, we don't go over the dimension of the L1 cache, in my understanding. What is happening?

Here some details about my machine:

  • CPU - 11th Gen Intel(R) Core(TM) i7-11700KF @ 3.60GHz
  • RAM - 16 GB DDR4
  • Windows 11 Compiler - MS_VS 2022

And here the memory hardware report from CPU-Z

enter image description here

PS: I also tried removing the lopp

for (auto& i_s : internal_sol)
            i_s += 1

from the function, but the same happens (of course with lower timings)

Sam Thomas
  • 177
  • 10
  • Refer to https://stackoverflow.com/questions/76007997/parallel-for-vs-foreach-vs-for-performance#comment134064103_76007997 – Graffito Apr 14 '23 at 09:35
  • That CPU has a 3.60 GHz base frequency for multicore loads and a 5.0 GHz boost for singlecore and some intermediate value for few-core loads. Maybe this could be a factor. You can watch the CPU frequency during the load and/or disable turbo-boost to see if it is contributing. – teapot418 Apr 14 '23 at 09:38
  • Your L1 cache is 80K per core. – Mike Vine Apr 14 '23 at 09:38
  • ok yes, I completely got confused. I have L1 of 80K, L2 of 512K and L3 of 16MB but shared between threads. Therefore is it correct to say that I have probably cache misses (with more threads, there is to much stuff for the L3) and also syncronization problem since the L3 level is shared ? – Claudio Tomasi Apr 14 '23 at 09:44
  • I again suggest to read about Amdahls law and Gustavsons law. The trend you see is consistent with Amdahls law, adding more threads you first see some speedup, but once the parallel part of the algorithm is already so small, the overhead by adding more threads dominates. And Gustavsons law tells you that you get better utliization by increasing the workload – 463035818_is_not_an_ai Apr 14 '23 at 09:46
  • only a perfectly parallel algorithm and only if creating threads would have 0 overhead, you could make 100% use of all the hardware available. – 463035818_is_not_an_ai Apr 14 '23 at 09:48
  • in this case where is the thread overhead ? – Claudio Tomasi Apr 14 '23 at 09:50
  • 2
    `threads[i] = std::thread(...` creating a thread is not for free. `thread.join();` joining a thread is not for free. The operating system and the language do a great deal at hiding some complexity from you. The overhead is small, but it increases the more threads you spawn. – 463035818_is_not_an_ai Apr 14 '23 at 09:54
  • https://en.wikipedia.org/wiki/Amdahl%27s_law – 463035818_is_not_an_ai Apr 14 '23 at 09:57
  • Analysing what is happening in this code turned out to be far from being trivial. Wild guesses did not work. The rule of thumb is to **use low-level profilers** to check hypothesises. – Jérôme Richard Apr 14 '23 at 23:57

2 Answers2

6

Your observations are consitent with Amdahl's law plus considering the overhead of using multiple threads.

Very roughly speaking and not going into details, you always have some part of the computation that can benefit from adding more threads, call that P, some part that cannot benefit from adding more threads call it S, and the overhead that comes per thread T.

More threads t decreases P but not S and increases T:

 total time = P/t + S + T*t

Even if the overhead is small, at some point it will dominate, because adding more threads cannot make P/t smaller than 0. In the limit for lots of threads you end up with increasing time

 total time (for huge t) = 0 + S + T*t

This is just very basic analysis of different components that contribute to your runtime. But it is already sufficient to see that even for an algorithm that can be parallelized well (P > S) and small overhead (small T), there is always a point where the overhead of adding more threads will dominate.

Amdahl did not consider the overhead, hence his total time (actually he considers the speed up, which is total time (t=0) / total time) saturates at some point. However, real threads always come with a tiny overhead. For further investigation, to see where the overhead actually comes from I would try to measure time for differently sized data. Then you might see effects from cachce.

For completeness I should also mention Gustavson, who observed that the parallel part P often increases with the problem size, while the the non-parallel part S often does not. For example reading some configuration from a file needs to be done once, wether you run 100 or 10000 iterations of some algorithm. With this assumptions, P = p(N) and S = constant, one can conclude that increasing the workload, N, improves the utilization of the hardware. You might see this effect when using a smaller vector will lead to increasing time already for smaller number of threads, while using a bigger vector you might see increasing time only for larger number of threads.

463035818_is_not_an_ai
  • 109,796
  • 11
  • 89
  • 185
  • I have to disagree. I profiled the program on my machine and the time to create the thread is not an issue! It make sense since the program last for several seconds while creating threads takes no more than few milliseconds. The problem comes from cache accesses (miss & thrashing). – Jérôme Richard Apr 14 '23 at 23:49
  • @JérômeRichard i am merely saying that there is a component of the runtime that increases with number of threads – 463035818_is_not_an_ai Apr 15 '23 at 06:51
  • @JérômeRichard well ok the first sentence can be misunderstood. I meant all the overhead that comes from creating more threads – 463035818_is_not_an_ai Apr 15 '23 at 10:04
1

Here is a more practical and more precise/specific explanation why the timings are increasing with more threads:

  • cache misses is the main scalability bottleneck, especially due to the shared L3 cache and the DRAM (also shared between cores);
  • hyper-threading does not help much, so using more than 8 threads is useless;
  • creating threads is definitively not an issue (negligible time).

Your i7-11700KF processor has 8 cores and 2 hardware threads/core. Thus, 8 of the 8 core can execute a thread without much scalability issue coming from the hardware itself. Using more than 8 threads causes 2 threads to run on the same core. This is why the timings are decreasing up to 8 cores and then start to increase again with >=9 threads. 1 core has two thread to run while others are likely only 1 thread to run. Note that threads can move from one core to another (unless you bind them manually), hence the instable timings beyond 8 threads.

Hyper-threading (the Intel technologie enabling cores to execute more than 1 thread per core) is mainly meant to increase the processor efficiency by running 2 threads in parallel on the same core. But there is a catch : the hardware threads share mostly the hardware resources. For example, the L1 and L2 caches are shared so having more thread on the same core also means less space per core. This is also the case for the instruction decoder which needs to decode the instruction of the two threads instead of just 1 (twice more instructions). This is why hyper-threading do not make codes running systematically faster. In fact, many codes are (a bit) slower because of the cache overheads (e.g. more cache misses due to cache thrashing). Codes running faster with hyper-threading are typically the one that are latency-bound, especially the one that are bound by the memory hierarchy (e.g. naive matrix transposition). That being said, this is not always true. For example, the line fill buffer unit of the processor responsible for registering memory accesses to the L2 due to L1 misses is shared between the 2 hardware threads so having two threads does not help if 1 can already saturate it. Such a thing happens in your case. Also not that codes running a long chain of dependent instructions with a high latency can also benefit a bit from hyper-threading (e.g. iterative numerical recipes using divisions of FMAs) but this is not your case. In the end, this is why hyper-threading does not help here.

Then the question is why the algorithm does not scale even on 8 cores. @463035818_is_not_a_number mostly explained why there is a theoretical limit to the scalability : the significant fraction of instructions being executed sequentially is expected to be the bottleneck. However, the code looks pretty embarrassingly parallel overall so it does not really help. The time to create and join threads (as indicated in the comments) is not an issue here because the code runs for several seconds while the time to create threads is no more than few milliseconds on most PCs (including mine).

The thing is your algorithm is bound by the memory hierarchy, and more specifically, the scalability of the code is bound by the L3 cache which is shared by all cores. Indeed, there are np vectors called external_sol[i] and each one takes dimension*sizeof(int) which is 50_000*(3+1)*2*4/1024**2 = 1.52 MiB in memory. Thus, with 8 threads, 12.16 MiB is used. Each vector does not fit in the L1 nor the L2 cache of your processor (respectively 80 KiB and 512 KiB per core). Vectors are fetched in each thread using a random integer between 0 and the size of the vector. This means fetched items cannot be stored for a while in the L1/L2 cache because previous values will quickly be evicted in order to read new ones and the shuffle no values to be read twice. The same cache lines can be read twice by the same thread but a given cache lines is likely evicted long before it is read again. This causes threads to load a cache line for each value from the L3 cache. The L3 cache is designed to scale but this code put a serious pressure on the L3 (throughput). There is another big issue : the L3 slices are 2 MiB per core and the L3 cache is not perfect so there are some cache misses causing the DRAM to be used. This is also why your program starts to be significantly slower with much more than 8 threads : the amount of space required for all the vectors is clearly too big for the L3 and most of the read/store needs to be done in directly in the DRAM with a much bigger latency. For example, with 11 threads, you need 11*1.52 = 16.72 MiB which is more than your L3 which should be 16 MiB wide. With 16 threads, it is even 24.32 MiB. Last but not least, there is another complex effect : while using more threads performs less memory accesses per threads, the accesses are more spread in memory so the probability to reuse cache lines drop as the number of threads grows.

I analysed the execution of this on on my i5-9600KF processor with a low-level profiler (VTune) and I can confirm a significant part of the accesses are performed in DRAM with 5 threads while this is clearly not the case with 1 thread. VTune also reports that a significant time is spent in cache misses (mainly the L2 on my processor since it is twice smaller than yours).

Note that the frequency of core is decreasing with more cores being active so for the processor to be power efficient (as indicated by @teapot418). This is why your processor (like many recent processor) cannot execute algorithms 8 times fast on 8 cores (assuming the target algorithms are efficient in sequential).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Perfect! that was a thorough answer, thank you so much. Just a little more thing (but maybe requires a furthere entire question): is it possibile to select the max number of threads dinamically based on the hardware. For example, I get the cache size, the physical core, the dimension of what I want to create, and thus selecting an optimal number of threads, all inside c++ ? – Claudio Tomasi Apr 17 '23 at 09:11
  • when refering to my answer you focus a little too much on "overhead from creating threads". This was poor wording from my side and is fixed by now. In my answer i am not refering to the overhead of only creating a thread, but generally the overhead that scales linear with number of threads. Yes my answer is very handwavy. You might want to update your answer, because "I was not convinced by the accepted answer ..." and there is no need to refer to mine, this answer can stand on its own right. – 463035818_is_not_an_ai Apr 17 '23 at 09:49
  • @ClaudioTomasi You can get the number of hardware thread in C++ with `std::thread::hardware_concurrency()`. However, this is generally not a good idea for well-optimized code to use hyper-threading. Low-level API such as `hwloc` let you retrive interesting information such as the hardware core layout, the caches, NUMA node distance, etc. The thing is this is very hard to do a general formula to find the best number of thread even for this use-case. A simplistic formula might be found for mainstream processors of a given vendor (eg. only Intel mainstream processor, only AMD). – Jérôme Richard Apr 26 '23 at 22:12
  • @ClaudioTomasi In practice, using 1 thread per core here is not so bad so you can just do that until you find a machine where this is clearly sub-optimal and on which you need to run your application. This has the benefit to be simple. – Jérôme Richard Apr 26 '23 at 22:15
  • @463035818_is_not_a_number Fair point. I removed the first sentence. – Jérôme Richard Apr 26 '23 at 22:15