Why are 50 threads faster than 4?

Question

DWORD WINAPI MyThreadFunction(LPVOID lpParam) {
    volatile auto x = 1;
    for (auto i = 0; i < 800000000 / MAX_THREADS; ++i) {
        x += i / 3;
    }
    return 0;
}

This function is run in MAX_THREADS threads.
I have run the tests on Intel Core 2 Duo, Windows 7, MS Visual Studio 2012 using Concurrency Visualizer with MAX_THREADS=4 and MAX_THREADS=50.
test1 (4 threads) completed in 7.1 seconds, but test2 (50 threads) completed in 5.8 seconds while test1 has more context switches than test2.
I have run the same tests on Intel Core i5, Mac OS 10.7.5 and got the same results.

Have you tried anything in between 4 and 50 threads? And what compiler options did you use? — Mysticial, Apr 28 '13 at 22:16
How many actual cores does your processor have (it's been a while since I looked at different models of Intel processors - I'm an AMD man since about 20 years back, haven't had my own computer with an Intel processor in that time)? How do you measure the time? — Mats Petersson, Apr 28 '13 at 22:22
Can you describe how you set up the test and at what stage you started timing? I expect you created all the threads suspended, then started your timer and immediately resumed all threads. You should then stop the timer after all threads have signalled that they are finished. It would be worth running the program for much longer than a few seconds to get a true measure. And/or repeat the experiment multiple times. — paddy, Apr 28 '13 at 22:22
I just create Console Application project with default options. But in Mac OS I use GCC with -O3. — dizel3d, Apr 28 '13 at 22:24
@paddy I start timer when program started and stop exectly before finished. But in MSVC I use Concurrency Visualizer instead of timer. I have tried many times. — dizel3d, Apr 28 '13 at 22:30
If the main thread exits before the worker threads are done, it will kill the worker threads before they complete. — brian beuning, Apr 28 '13 at 23:59
I've confirmed the behaviour on my machine, using a more statistical approach. I've provided a graph and a probable explanation as an answer. — paddy, Apr 29 '13 at 02:48
I stumbled upon this question and it intrigued me. Is it possible that with 50 threads your code has more chances to get into L2 cache on all cores and stay there? This would explain the result. Wouldn't it? — Victor Havin, Jun 09 '18 at 01:28

paddy · Accepted Answer · 2013-05-01T05:22:05.697

I decided to benchmark this myself on my 4-core machine. I directly compared 4 threads with 50 threads by interleaving 100 tests of each. I used my own numbers so that I had a reasonable execution time for each task.

The result was as you described. The 50-thread version is marginally faster. Here is a box plot of my results:

Parallel task comparison graph

Why? I think this comes down to the thread scheduling. The task is not complete until all threads have done their work, and each thread must do a quarter of the job. Because your process is being shared with other processes on the system, if any single thread is switched out to another process, this will delay the entire task. While we are waiting for the last thread to finish, all other cores are idle. Note how the time distribution of the 4-thread test is much wider than the 50-thread test, which we might expect.

When you use 50 threads, each thread has less to do. Because of this, any delays in a single thread will have a less significant effect on the total time. When the scheduler is busy rationing cores out to lots of short threads, a delay on one core can be compensated by giving these threads time on another core. The total effect of latency on one core is not as much of a show-stopper.

So it would seem that in this case the extra context-switching is not the biggest factor. While the gain is small, it appears to be beneficial to swamp the thread scheduler a little bit, given that the processing is much more significant than the context-switching. As with everything, you must find the correct balance for your application.

[edit] Out of curiosity I ran a test overnight while my computer wasn't doing much else. This time I used 200 samples per test. Again, tests were interleaved to reduce the impact of any localised background tasks.

The first plot of these results is for low thread-counts (up to 3 times the number of cores). You can see how some choices of thread count are quite poor... That is, anything that is not a multiple of the number of cores, and especially odd values.

Additional test plot - low thread count

The second plot is for higher thread-counts (from 3 times the number of cores up to 60).

Additional test plot - high thread count

Above, you can see a definite downward trend as the thread-count increases. You can also see the spread of results narrow as the thread-count increases.

In this test, it's interesting to note that the performance of 4-thread and 50-thread tests were about the same and the spread of results in the 4-core test was not as wide as my original test. Because the computer wasn't doing much else, it could dedicate time to the tests. It would be interesting to repeat the test while placing one core under 75% load.

And just to keep things in perspective, consider this:

Scaling threads

[Another edit] After posting my last lot of results, I noticed that the jumbled box plot showed a trend for those tests that were multiples of 4, but the data was a little hard to see.

I decided to do a test with only multiples of four, and thought I may as well find the point of diminishing returns at the same time. So I used thread counts that are powers of 2, up to 1024. I would have gone higher, but Windows bugged out at around 1400 threads.

The result is rather nice, I think. In case you wonder what the little circles are, those are the median values. I chose it instead of the red line that I used previously because it shows the trend more clearly.

Trend for exponentiating the thread-count

It seems that in this particular case, the pay dirt lies somewhere between 50 and 150 threads. After that, the benefit quickly drops away, and we're entering the territory of excessive thread management and context-switching.

The results might vary significantly with a longer or shorter task. In this case, it was a task involving a lot of pointless arithmetic which took approximately 18 seconds to compute on a single core.

By tuning only the number of threads, I was able to shave an extra 1.5% to 2% off the median execution time of the 4-thread version.

Thanks @Mysticial. Actually, I'm disappointed I didn't choose better values for the higher thread-counts. If you look at the box plot for only multiples of 4, it makes a very nice asymptotic curve. But I didn't test all multiples of 4 up to 60. It would be interesting to extend the test up to crazy thread-counts like 10000 to see where the performance degrades. It might be a pain to measure that because I can only wait on up to 64 thread handles at once. I'd have to be sure that clustering several calls to `WaitForMultipleObjects` does not influence the result. — paddy, Apr 29 '13 at 23:59
Thank you so much! How do you print this graphs? Is it self made? Looks like Matlab. — dizel3d, Apr 30 '13 at 06:44
No worries. It was quite interesting to dig into it a bit. Plenty more topics around this one I reckon. The graphs were indeed done with Matlab using `boxplot` and pretty much the default settings. — paddy, Apr 30 '13 at 12:44
At last, some sanity and, actual numbers, instead of the usual textbook/FUD mantra 'only spawn as many threads as there are cores for CPU-intensive work, else you will lose performance to context-switching'. To be fair, though, the optimal number of threads is highly dependent upon the dataset being operated on - if the data occupies a complete L1 cache, or more, the stats will indeed skew heavily towards one thread per core. Nevertheless, +1 for actually trying stuff! — Martin James, May 03 '13 at 00:10
I totally agree. These tests were simply for the purpose of validating and explaining the claim made by the OP. I made a passing comment about these results being highly specific to the actual task being performed. I think that, when performance is critical, it's important to actually test theories using representative datasets and statistical methods. — paddy, May 03 '13 at 00:27

Jonathon Reinhart · Answer 2 · 2013-04-28T22:12:55.767

3

It all depends on what your threads are doing.

Your computer can only concurrently run as many threads as there are cores in the system. This includes virtual cores via features like Hyper-threading.

CPU-bound

If your threads are CPU-bound, (meaning they spend the vast majority of their time doing calculations on data that is in memory), you will see little improvement by increasing the number of threads above the number of cores. You actually lose efficiency with more threads running, because of the added overhead of having to context-swtich the threads on and off the CPU cores.

I/O-bound

Where (#threads > #cores) will help, is when your threads are I/O-bound, meaning they spend most of their time waiting on I/O, (hard disk, network, other hardware, etc.) In this case, a thread that is blocked waiting on I/O to complete will be pulled off the CPU, and a thread that is actually ready to do something will be put on instead.

The way to get highest efficiency is to always keep the CPU busy with a thread that's actually doing something. (Not waiting on something, and not context-switching to other threads.)

edited Apr 28 '13 at 22:12

answered Apr 28 '13 at 22:07

Jonathon Reinhart

132,704
33
254
328

1

Haven't you missed the point of the question? They expect the behaviour you describe, but experiments result in the opposite behaviour. – paddy Apr 28 '13 at 22:11
Looking at their code, it looks like they should only be CPU bound, and yet they see more context switches and it being slower with less threads. So, their results seem to go against your answer. – Xymostech Apr 28 '13 at 22:16
@Xymostech Sigh, you are correct. I should have read the question a little more closely. I guess I saw the title and just started answering. – Jonathon Reinhart Apr 28 '13 at 22:17
@JonathonReinhart Thanks anyways, this was a very enlightening answer for many other threading issues I've ran into in the past. – Keshav Saharia Apr 28 '13 at 23:48

score 3 · Answer 3 · answered Apr 28 '13 at 23:25

I took some code that I had "laying about" for some other purposes, and re-used it - so please beware that it's not "pretty", nor is supposed to be a good example of how you should do this.

Here's the code I came up with (this is on a Linux system, so I'm using pthreads and I removed the "WINDOWS-isms":

#include <iostream>
#include <pthread.h>
#include <cstring>

int MAX_THREADS = 4;

void * MyThreadFunction(void *) {
    volatile auto x = 1;
    for (auto i = 0; i < 800000000 / MAX_THREADS; ++i) {
        x += i / 3;
    }
    return 0;
}


using namespace std;

int main(int argc, char **argv)
{
    for(int i = 1; i < argc; i++)
    {
    if (strcmp(argv[i], "-t") == 0 && argc > i+1)
    {
        i++;
        MAX_THREADS = strtol(argv[i], NULL, 0);
        if (MAX_THREADS == 0)
        {
        cerr << "Hmm, seems like end is not a number..." << endl;
        return 1;
        }       
    }
    }
    cout << "Using " << MAX_THREADS << " threads" << endl;
    pthread_t *thread_id = new pthread_t [MAX_THREADS];
    for(int i = 0; i < MAX_THREADS; i++)
    {
    int rc = pthread_create(&thread_id[i], NULL, MyThreadFunction, NULL);
    if (rc != 0)
    {
        cerr << "Huh? Pthread couldn't be created. rc=" << rc << endl;
    }
    }
    for(int i = 0; i < MAX_THREADS; i++)
    {
        pthread_join(thread_id[i], NULL);
    }
    delete [] thread_id;
}

Running this with a variety of number of threads:

MatsP@linuxhost junk]$ g++ -Wall -O3 -o thread_speed thread_speed.cpp -std=c++0x -lpthread
[MatsP@linuxhost junk]$ time ./thread_speed -t 4
Using 4 threads

real    0m0.448s
user    0m1.673s
sys 0m0.004s
[MatsP@linuxhost junk]$ time ./thread_speed -t 50
Using 50 threads

real    0m0.438s
user    0m1.683s
sys 0m0.008s
[MatsP@linuxhost junk]$ time ./thread_speed -t 1
Using 1 threads

real    0m1.666s
user    0m1.658s
sys 0m0.004s
[MatsP@linuxhost junk]$ time ./thread_speed -t 2
Using 2 threads

real    0m0.847s
user    0m1.670s
sys 0m0.004s
[MatsP@linuxhost junk]$ time ./thread_speed -t 50
Using 50 threads

real    0m0.434s
user    0m1.670s
sys 0m0.005s

As you can see, the "user" time stays almost identical. I actually tries a lot of other values too. But the results are the same so I won't bore y'all with a dozen more that show almost the same thing.

This is running on a quad core processor, so you can see that the "more than 4 threads" times show the same "real" time as with "4 threads".

I doubt very much there is anything different in how Windows deals with threads.

I also compiled the code with a #define MAX_THREADS 50 and same again with 4. It gave no difference to the code posted - but just to cover the alternative where the compiler is optimizing the code.

By the way, the fact that my code runs some three to ten times faster indicates that the originally posted code is using debug mode?

Yes, code on Windows runs in debug mode, on Mac OS in optimized mode. Quantative results are different. but I have just wanted to show strange performance difference between 4 and 50 threads. — dizel3d, Apr 29 '13 at 06:17

score 2 · Answer 4 · answered Apr 29 '13 at 03:06

I did some tests a while ago on Windows, (Vista 64 Ultimate), on a 4/8 core i7. I used similar 'counting' code, submitted as tasks to a threadpool with varying numbers of threads, but always with the same total amount of work. The threads in the pool were given a low priority so that all the tasks got queued up before the threads, and timing, started. Obviously, the box was otherwise idle, (~1% CPU used up on services etc).

8 tests,
400 tasks,
counting to 10000000,
using 8 threads:
Ticks: 2199
Ticks: 2184
Ticks: 2215
Ticks: 2153
Ticks: 2200
Ticks: 2215
Ticks: 2200
Ticks: 2230
Average: 2199 ms

8 tests,
400 tasks,
counting to 10000000,
using 32 threads:
Ticks: 2137
Ticks: 2121
Ticks: 2153
Ticks: 2138
Ticks: 2137
Ticks: 2121
Ticks: 2153
Ticks: 2137
Average: 2137 ms

8 tests,
400 tasks,
counting to 10000000,
using 128 threads:
Ticks: 2168
Ticks: 2106
Ticks: 2184
Ticks: 2106
Ticks: 2137
Ticks: 2122
Ticks: 2106
Ticks: 2137
Average: 2133 ms

8 tests,
400 tasks,
counting to 10000000,
using 400 threads:
Ticks: 2137
Ticks: 2153
Ticks: 2059
Ticks: 2153
Ticks: 2168
Ticks: 2122
Ticks: 2168
Ticks: 2138
Average: 2137 ms

With tasks that take a long time, and with very little cache to swap out on a context-change, the number of threads used makes hardly any difference to the overall run time.

didierc · Answer 5 · 2013-04-30T19:15:02.053

The problem you encounter is tighly bound to the way you are subdividing the workload of your process. In order to make an efficient use of a multicore system on a multitasking OS, you must ensure that there will always be remaining work for all the cores as long as possible during your process lifetime.

Consider the situation where your 4 threads process executes on 4 cores, and because of the system load configuration, one of the cores manages to finish 50% faster than the others: for the remaining process time, your CPU will only be able to allocate 3/4 of its processing power to your process, since there's only 3 threads remaining. In the same CPU load scenario, but with many more threads, the workload is split in many more subtasks which can be distributed more finely between the cores, all other things being equal (*).

This example illustrate that the timing difference is not actually due to the number of threads, but rather to the way the work has been divided, which is much more resilient to an uneven availability of cores in the later case. The same programme built with only 4 threads, but where the work is abstracted to a series of small tasks pulled by threads as soon as they are available would certainly produce similar or even better results on average, even though there would be the overhead of managing the tasks queue.

The finer granularity of a process task set gives it better flexibility.

(*) In the situation of a highly loaded system, the many threads approach might not be as beneficial, the unused core being actually allocated to other OS process, hence lightening the load for the three others cores still possibly used by your process.

Why are 50 threads faster than 4?

5 Answers5

CPU-bound

I/O-bound

Linked