0

Using OpenMP I divided a simple algorithm in different threads but the execution time is increasing drastically. This may be due to all the threads running on same CPU core. I am aware that if my CPU is dual core or quad core then assigning number of threads more than number of CPU cores will not help much. but even with two threads the execution time is increasing.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • 4
    You can bind the threads which tells OpenMP not to move the threads around but OpenMP, at least last time I looked into it (before OpenMP 4.0), provides no methods to assign a particular thread to a particle core. If you bind the threads you're stuck with whatever topology was defined when you start your code. You have to use an non-OpenMP function to say which threads bind to a particular core. – Z boson Jul 13 '16 at 06:28
  • With hyper-threading on Windows at least the default topology puts the first two threads on the same core which is obviously not ideal. I think the same thing happens with AMD modules on Linux and Windows (first two threads on the first module). AMD modules are really only a single floating point core. If you just use the default number of threads with Intel & AMD it's usually not a problem because all cores get filled up but if you use less threads then the number of logical processors it might not scale like you expect. – Z boson Jul 13 '16 at 06:34
  • Why do you want to control the number of threads? Why not just use the default? – Z boson Jul 13 '16 at 06:35
  • related http://stackoverflow.com/questions/8325566/openmp-and-cpu-affinity – dvhh Jul 13 '16 at 06:51
  • Out of curiosity, how did you come to the conclusion that "the execution time is increasing"? How did you time it? – Gilles Jul 13 '16 at 09:17
  • Small correction, using the default number of threads on AMD actually may not be a good idea at least for floating point because for floating point operations with two threads for every core is actually less efficient. I'm looking forward to the Zen microarch and the end of the Bulldozer microarch (though XOP was a nice idea). – Z boson Jul 13 '16 at 12:51
  • 2
    The whole question is based on the completely unfounded assumption that your threads are running on the same CPU core. The first step can only be to show in detail how you came to that conclusion providing necessary information to reproduce or understand the issue. There is absolutely nothing in the question that can even serve as a starting point. All discussion here is completely hypothetical. – Zulan Jul 13 '16 at 15:51
  • @Zulan completely agreed. Hence my question about the timing process. I wouldn't be surprise we are in a CPU time vs. Elapsed time issue. – Gilles Jul 13 '16 at 16:13
  • I couldn't find the source but I remember reading about on Amazon EC2 machines, Linux loses the information of physical cores and performs sub-optimal scheduling, e.g. if the machine is 2-core and 4-thread, and the job uses 4 busy threads, then the 4 busy threads could be assigned to the same physical core. – user3528438 Jul 13 '16 at 16:36

1 Answers1

-1

Yes you can determine which CPU gets a thread.

By using for example the Thread Affinity Interface (Linux* and Windows*), as suggested by talonmies. However, notice what the article mentions though:

thread affinity can have a dramatic effect on the application speed.


For getting slower execution, two main reasons may apply:

1) If you have more threads than your cores, then the remaining threads will wait for the others, which reduces to serial execution.

With that said, assuming you have 2 cores, then having 4 threads ready for execution, will result in race conditions, since all these threads will compete for the resources (i.e. the cores), so 2 of them will execute in parallel, but the other two will have to wait.

2) Small size of your problem.

Running in parallel doesn't come for free. You have to do much housekeeping and in general orchestrate the parallel execution. You have to bring OpenMP into play and the OS has to handle more threads/processes as well.

So, unless the size of your problem is big enough, you won't see any speedup (even worse, you will see a slowdown as in your case), since the overhead of orchestrating the parallel execution will dominate the execution of your application.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • https://software.intel.com/en-us/node/522691 – talonmies Jul 13 '16 at 07:05
  • Yeah Intel has one, sure I forgot! Thanks @talonmies for the upvote and for the link. I will update my answer. – gsamaras Jul 13 '16 at 07:07
  • So does gcc. And I didn't upvote this – talonmies Jul 13 '16 at 07:07
  • @talonmies are you sure? I know about the Intel, but not about gcc. OK, sorry, I just saw the notifications together. :) [edit]I checked your profile, you know what you are talking aboit.[/edit] – gsamaras Jul 13 '16 at 07:50
  • "With that said, assuming you have 2 cores, then having 4 threads ready for execution, will result in race conditions" I think this statement is misleading because most of the computers I use have hyper-treading which does exactly this (usually 8 threads and 4 cores). Oversubscribing has benefits and hardware oversubscribing gives a boost most of the time because it's hard to squeeze the most out of a super-scalar core with one thread and having another thread do something different can make better usage of the instruction level parallelism of superscalar cores. – Z boson Jul 13 '16 at 12:48
  • @Zboson thanks for your comment, I see your point...How do you suggest I should modify that statement to address the issue in a better way? I feel that you really disliked that, judging by the downvote. – gsamaras Jul 13 '16 at 22:33
  • I did not downvote you. I rarely downvote. My downvoate rate is 1.7% (I have 41 downvotes in total out of 2313 votes). A question/answer has to be really bad for me to downvote. I can't remember the last time I downvoted. It's been months. – Z boson Jul 14 '16 at 07:41
  • 1
    I completely disagree with your answer. Thread binding is so important in most areas where OpenMP is used (HPC, engineering, etc.) that OpenMP 4.0 introduced the concept of _places_, which provides a vendor-neutral way of pinning specific threads to specific CPUs. Also, _"thread affinity can have a dramatic effect on the application speed"_ is usually understood as positive effect due to stupid OS schedulers and the cost of reloading the caches when threads get migrated. – Hristo Iliev Jul 14 '16 at 13:29
  • @HristoIliev I see, I did update the answer, is it better now? :) Z boson, OK, it's not the down vote I care about, but the improvement of the answer! :) – gsamaras Jul 14 '16 at 16:49
  • 2
    I've voted to close the question as missing vital information. I would rather wait for the OP to provide that information before even trying to comment on the issue as it could be something as embarrassing as using `clock()` to measure the wall time on Linux. – Hristo Iliev Jul 15 '16 at 06:53