Determining Optimum Thread Count

Question

So, as part of a school assignment, we are being asked to determine what our optimum thread count is for our personal computers by constructing a toy program.

To start, we are to create a task that takes between 20 and 30 seconds to run. I chose to do a coin toss simulation, where the total number of heads and tails are accumulated and then displayed. On my machine, 300,000,000 tosses on one thread ended up at 25 seconds. After that, I went to 2 threads, then 4, then 8, 16, 32, and, just for fun, 100. Here are the results:

* Thread Tosses per thread time(seconds) * ------------------------------------------ * 1 300,000,000 25 * 2 150,000,000 13 * 4 75,000,000 13 * 8 37,500,000 13 * 16 18,750,000 14 * 32 9,375,000 14 * 100 3,000,000 14

And here is the code I'm using:

void toss()
{
    int heads = 0, tails = 0;
    default_random_engine gen;
    uniform_int_distribution<int> dist(0,1);
    int max =3000000;                          //tosses per thread
    for(int x = 0; x < max; ++x){(dist(gen))?++heads:++tails;}
    cout<<heads<<" "<<tails<<endl;
}

int main()
{
    vector<thread>thr;
    time_t st, fin;
    st = time(0);

    for(int i = 0;i < 100;++i){thr.push_back(thread(toss));} //thread count
    for(auto& thread: thr){thread.join();}

    fin = time(0);
    cout<<fin-st<<" seconds\n";
    return 0;
}

Now for the main question:

Past a certain point, I would've expected there to be a considerable decline in computing speed as more threads were added, but the results don't seem to show that.

Is there something fundamentally wrong with my code that would yield these sorts of results, or is this behavior considered normal? I'm very new to multi-threading, so I have a feeling it's the former....

Thanks!

EDIT: I am running this on a macbook with a 2.16 GHz Core 2 Duo (T7400) processor

"*I would've expected there to be a considerable decline in computing speed as more threads were added,*" Why? Are you assuming the implementation would do so many context switches that they take up a significant fraction of the execution time? That only happens if the implementation has no choice (say because the threads keeps blocking) but here it has a choice, it can let the threads run. If you have a decent OS, the people who designed its scheduler are world-class experts, they don't do obviously bad stuff when there are obvious better choices. — David Schwartz, Nov 22 '16 at 03:47

kmdreko · Accepted Answer · 2016-11-22T04:01:45.593

Your results seem very normal to me. While thread creation has a cost, its not that much (especially compared to the per-second granularity of your tests). An extra 100 thread creations, destructions, and possible context-switches isn't going to change your timing by more than a few milliseconds I bet.

Running on my Intel i7-4790 @ 3.60 GHz I get these numbers:

threads - seconds
-----------------
1       -  6.021
2       -  3.205
4       -  1.825
8       -  1.062
16      -  1.128
32      -  1.138
100     -  1.213
1000    -  2.312
10000   - 23.319

It takes many, many more threads to get to the point at which the extra threads make a noticeable difference. Only when I get to 1,000 threads do I see that the thread-management has made a significant difference and at 10,000 it dwarfs the loop (the loop is only doing 30,000 tosses at that point).

As towards your assignment, it should be fairly straightforward to see that the optimal number of threads for your system should be the same as the available threads that can be executed at once. There's not any processing power left to execute another thread until one is either done or yielded, which doesn't help you finish faster. And, any less threads and you aren't using all available resources. My CPU has 8 threads and the chart reflects that.

dtech · Answer 2 · 2018-03-08T13:27:17.680

Edit 2 - To further elaborate on the "lack of performance penalty" part due to popular demand:

...I would've expected there to be a considerable decline in computing speed as more threads were added, but the results don't seem to show that.

I made this giant chart in order to better illustrate the scaling.

To explain the results:

The blue bar illustrates the total time to do all the tosses. Although that time decreases all the way up to 256 threads, the gains from doubling the thread count gets smaller and smaller. The CPU I ran this test had 4 physical and 8 logical cores. Scaling is pretty good all the way to 4 cores and decent to 8 cores, and then it plummets. Pipeline saturation allows to get minor gains all the way to 256 but it is simply not worth it.

The red bar illustrates the time per toss. It is nearly identical for 1 and 2 threads, as the CPU pipeline hasn't reached full saturation yet. It gets a minor hit at 4 threads, it still runs fine but now the pipeline is saturated, and at 8 threads it really shows that logical threads are not the same thing as physical, that gets progressively worse pushing above 8 threads.

The green bar illustrates the overhead, or how much lower actual performance is relative to the expected double boost from doubling the threads. Pushing above the available logical cores causes the overhead to skyrocket. Note that this is mostly thread synchronization, the actual thread scheduling overhead is probably constant after a given point, there is a minimal window of activity time a thread must receive, which explains why the thread switching doesn't come to the point of overwhelming the work throughput. In fact there is no severe performance drop all the way to 4k threads, which is expected as modern systems have to be able and often run over thousand threads in parallel. And again, most of that drop is due to thread synchronization, not thread switching.

The black outline bar illustrates the time difference relative to the lowest time. At 8 threads we only lose ~14% of absolute performance from not having the pipeline oversaturated, which is a good thing because it is most cases not really worth stressing the entire system over so little. It also shows that 1 thread is only ~6 times slower than the maximum the CPU can pull off. Which gives a figure of how good logical cores are compared to physical cores, 100% extra logical cores give a 50% boost in performance, in this use case a logical thread is ~50% as good as a physical thread, which also correlates to the ~47% boost we see going from 4 to 8. Note that this is a very simply workload though, in more demanding cases it is close to 20-25% for this particular CPU, and in some edge cases there is actually a performance hit.

Edit 1 - I foolishly forgot to isolate the computational workload from the thread synchronization workload.

Running the test with little to no work reveals that for high thread counts the thread management part takes the bulk of the time. Thus the thread switching penalty is indeed very small and possibly after a certain point a constant.

And it would make a lot of sense if you put yourself in the shoes of a thread scheduler maker. The scheduler can easily be protected from being choked by an unreasonably high switching to working ratio, so there is likely a minimal window of time the scheduler will give to a thread before switching to another, while the rest are put on hold. This ensures that the the switching to working ratio will never exceed the limits of what is reasonable. It would be much better to stall other threads than go crazy with thread switching, as the CPU will mostly be switching and doing very little actual work.

The optimal thread count is the available amount of logical CPU cores. This achieves optimal pipeline saturation.

If you use more you will suffer performance degradation due to the cost of thread context switching. The more threads, the more penalty.

If you use less, you will not be utilizing the full hardware potential.

There is also the problem of workload graining, which is very important when you utilize synchronization such as a mutex. If your concurrency is too finely grained you can experience performance drops even when going from 1 to 2 threads on a 8 thread machine. You'd want to reduce synchronization as much as possible, doing as much work as possible in between synchronizations, otherwise you can experience huge performance drops.

Note the difference between physical and logical CPU core. Processors with hyper-threading can have more than one logical core per physical core. "Secondary" logical cores do not have the same computational power as the "primary" as they are merely used to utilize vacancies in the processor pipeline usage.

For example, if you have a 4 core 8 thread CPU, in the case of a perfectly scaling workload you will see 4 times increase of performance going from 1 to 4 threads, but a lot less going from 4 to 8 threads, as evident from vu1p3n0x's answer.

You can look here for ways to determine the number of available CPU cores.

Determining Optimum Thread Count

2 Answers2