More threads, slower the speed, even when there are enough CPU cores

Question

I got an application which can start any number of threads, each thread does the same task : going through a vector which contains 5000 messages and then process each one of them.

between threads, there is no resource competion , no race condition at all. and there are 4 cpu Cores on the box where I run my application. no other process is doing any CPU consuming task at the time when I run my application.

however, the result I got is as below.

if there is only one thread running, it took the thread 0.45 seconds to process those 5000 messages.

if there are 4 threads running, it took each thread around o.55 seconds to process those message, more than 20% increase.

if there are more messages to process, say 150,000 messages, then there is no difference between processing time with running 1 thread or 4 threads.

I don't understand what caused the time increasing while there were 4 threads running, there were 4 CPU cores, enough for 4 threads.

and why there was no time increasing while processing time was longer?

I did my test using Linux 2.6.26. the scheduler has been improved since 2.6.18. I did the same test using 2.6.18 too, the result was worse while there were 4 threads running which proves that the scheduler did was improved.

I understand that. but in my case, each thread should be able to run on different core, and there is no race condition at all. all threads should just run as fast as when there is only thread running, shouldn't they? — 宇雷, Feb 13 '12 at 05:48

score 1 · Answer 1 · answered Feb 10 '12 at 21:23

1

Threads bring most benefits in IO-bound processes. However, given the setup you describe a 20% performance decrease is probably a sign that your threads are running into contention on some library/system calls.

answered Feb 10 '12 at 21:23

mac

5,627
1
18
21

is there a way to detect contention? does 'new' operation count as "contention on system calls"? – 宇雷 Feb 13 '12 at 05:50
1

If you do lots of allocations 'new' is definitively a multi-threading bottleneck. Best way to avoid it is to have a heap per thread. See http://stackoverflow.com/questions/470683/memory-allocation-deallocation-bottleneck – mac Feb 13 '12 at 08:48
I used valgrind (callgrind) to run my application, all I can see is that the timing increases for every call. – 宇雷 Feb 13 '12 at 10:02
I used valgrind (callgrind) to run my application, the result of callgrind is presented with percentage numbers, so if one thread calls ten methods, the result would be method1 consumes 10% of the total time, method2 consumes 20% of the time. The results barely change when I increased thread number from 1 to 4 which means time spent on every call increased by the same percentage, if there is contention, I should be able to see that certain methods consumes more time. if 'new' is the problem, I should be able to see 'new' operations consumes more time, but I didn't – 宇雷 Feb 13 '12 at 10:08
I also did a simple test, run "new char[100]" for 1 million times in 1 thread and in 4 threads, respectively. the time is the same which should mean that 'new' is not the problem. maybe? I am not sure. too many mysteries. – 宇雷 Feb 13 '12 at 10:12
'new' is one of the possibilities particularly when you start to have fragmentation. Difficult to say more. You might experiment by gradually introducing more of the "valgrind top-10" in your simple test program to see which one introduces a performance hit – mac Feb 13 '12 at 13:51
I don't think new is the problem. because when each thread needs to run longer, the time remains the same when thread number increases from 1 to 4. I did a litter search on google, someone else had the similar question, during his tests, multi-processes was much faster than multi-threads when running totally indepent algorithms on each thread/processe. there was an article explainnign this, but unfortunately, I can't find that article – 宇雷 Feb 15 '12 at 02:49
I was wrong, 'new' is the problem, yet it still remain mystery to me. because if I add following code in each thread and make them run first, then the processing time of each thread remains the same while I increase thread number. for (size_t i = 0; i < 10000; ++i) { char * str = new char [100]; delete str; } – 宇雷 Feb 15 '12 at 06:24
it needs to be "for (size_t i = 0; i < 100000; ++i)", if the number is too small like 10000, it doesn't work, myterier and myterier, huh – 宇雷 Feb 15 '12 at 06:26
If most of your memory allocations are objects of a fixed size you should write your own allocator out of a heap per thread – mac Feb 15 '12 at 09:31

More threads, slower the speed, even when there are enough CPU cores

1 Answers1