4

I am developing codes for the scientific computing community particularly for solving linear system of equations (Ax=b form) iteratively.

I have used BLAS and LAPACK for primitive matrix subroutines but I now realize that there is some scope for manual parallelization. I am working on a Shared Memory system which leaves me with 2 choices: OpenMP and PThreads.

Assuming that time isn't the greatest factor (& performance of the code is), which is a better, future proof and maybe, portable (to CUDA) way of parallelizing? Is the time spent in using Pthreads worth the performance boost?

I believe that my application (which basically deals with starting many things off at once and then operating upon the "best" value from all of them), will benefit from explicit thread control but I'm afraid the coding will take up too much time and at the end there will be no performance pay off.

I have already looked at few of the similar questions here but they are all pertaining to general applications.

This one is concerning a generic multithreaded application in Linux.

This is a general question as well.

I am aware of SciComp.SE but felt it was more on topic here.

Community
  • 1
  • 1
  • "basically deals with starting many things off at once and then operating upon the "best" value from all of them" I believe that [CPlex](http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/) features an algorithm similar to yours. I don't know what they chose fur the underlying parallelization tool, but maybe you could find out (It doesn't necessarily mean that their choice would be the best for you, but it's always good to know). – François Févotte Mar 24 '12 at 19:05
  • boost threads gives a very nice interface to pthreads (or whatever) if you with c++. totally worth it IMO. But I opted for openmp myself eventually due to ease of programming. Also consider intel IPP/TBB. – Anycorn Mar 24 '12 at 19:35
  • If you're using BLAS or LAPACK why don't you just use Eigen instead? It has built in support for SIMD (SSE) and OpenMP. – Z boson Nov 07 '13 at 07:18

3 Answers3

7

Your question reads as if you expect that the coding efficiency with OpenMP will be higher than with Pthreads, and the execution efficiency higher with Pthreads than with OpenMP. In general I think that you are right. However, a while back I decided that my time was more important than my computer's time and opted for OpenMP. It's not a decision I have had cause to regret, nor is it a decision I have any hard evidence to validate.

However you are wrong to think that your choices are limited to OpenMP and Pthreads, MPI (I assume you've at least heard of this, post again if not) will also run on shared memory machines. For some applications MPI can be programmed to outperform OpenMP on shared-memory computers without much difficulty.

Three (+/- a few) years ago the essential parallelisation tools in the scientific developer's toolbox were OpenMP and MPI. Anyone using those tools was part of a large community of fellow users, larger (anecdotal evidence only) than the community of users of Pthreads and MPI. Today, with GPUs and other accelerators popping up all over the place the situation is much more fragmented and it's difficult to pick one of the winners from HMPP, ACC, Chapel, MPI-3, OpenMP4, CUDA, OpenCL, etc. I still think that OpenMP+MPI is a useful combination, but can't ignore the new kids on the block.

FWIW I work on the development of computational EM codes for geophysical applications so quite hard core 'scientific computing'.

High Performance Mark
  • 77,191
  • 7
  • 105
  • 161
  • Well, I have tried running ScaLapack instead of BLAS on Shared Memory but the Hello World itself is so difficult that it is offputting. If I am not mistaken, CUDA is based on pthread "model"? I don't have much experience in CUDA but the way the codes for CuBlas seem to be written, it looks similar to pthreads. If I was sure that my application were to be ported to GPU soon, what would you recommend then? All other factors would be of lesser importance then. –  Mar 24 '12 at 10:00
  • I don't have sufficient experience of GPU computing to offer good advice. – High Performance Mark Mar 24 '12 at 10:02
  • GPU computing != general parallel computing. Putting OpenMP/MPI/"OS threads" in the same boat as OpenCL/CUDA is just... weird. – rubenvb Mar 24 '12 at 10:09
  • 1
    I'm not sure anyone here is making the identity of GPU computing and general parallel computing which startles you @rubenvp. I do see people using GPUs for tackling scientific/engineering number-crunching problems. SO is littered with questions on the topic. – High Performance Mark Mar 24 '12 at 10:12
  • 1
    @rubenvb, are you implying that GPU/CUDA is not used by scientific community? I mean, sure, they might not be the people who use it the most but when it comes to Tesla/Fermi, we probably order them more than anyone else. –  Mar 24 '12 at 10:17
  • @Nuxonic I was only saying that CUDA/OpenCL are entirely different beasts than the classic CPU multithreading. There are greatly different constraints (on programming language and physical resources) than the "classic" multithreading approach. – rubenvb Mar 24 '12 at 10:33
  • +1 - I always enjoy High Performance Marks's posts. This one is no different. – duffymo Mar 24 '12 at 15:43
  • You can also consider Microsoft's Parallel Patterns Library: http://msdn.microsoft.com/en-us/library/dd492418.aspx – quant_dev Mar 24 '12 at 20:20
2

I realize that my answer is pretty long so I'm putting the conclusion first for the impatients :

Short answer:

I would say openMP and pthreads are essentially the same and you should pick whichever requires the least dev time for you (probably openMP if it fits your needs). But if you want to invest development time, maybe you should redesign your code so that it can adapt to other paradigms (for example vectorization to take advantage of SSE/AVX or GPUs).

Development:

If you develop linear solvers, I assume your code will be (very) long-lived (i.e. it will probably outlive the physical models which will use it). In such conditions, and especially if you don't have a large development team, I think you should base your choice primarily on development time, maintainability and

Also, you should not assume that the "best" choice today (whatever "best" might mean) will probably not still be the "best" choice tomorrow. So even if you're faced with an openMP vs pthreads problem now (and even now the spectrum is already larger than that as said in @HighPerformanceMark's answer), you should expect to have more alternatives to chose from in the future.

If you have development time to spend now, I would thus say that it would be better invested if you can abstract all computation intensive kernels in your code in such a way that you can easily adapt them to different parallelization paradigms. In this respect the most important (and difficult) thing to deal with is the data structure: benefiting from coalescing for GPGPU calculations requires putting your data in a different order than the traditional cache-optimizing way.

Which leads me to the conclusion: all thread-based solutions are essentially equivalent (both in terms of performance and code architecture) and you should pick whichever solution requires the least development time. But if you want to invest development time, maybe you should redesign your code so that it can be either parallelized or vectorized (and thus take advantage of SSE/AVX or GPUs). If you manage to do this, you'll be able to follow hardware/software evolutions and keep performance.

François Févotte
  • 19,520
  • 4
  • 51
  • 74
  • "..: all thread-based solutions are essentially equivalent (both in terms of performance and code architecture) and you should pick whichever solution requires the least development time.." If I assume that to be true, then isn't OpenMP the default winner because writing a code in OpenMP is much faster than in Pthreads? –  Mar 24 '12 at 19:56
  • @Nunoxic Yes, but pThreads can do everything OpenMP can (even though it might be more difficult for you to develop the code), whereas on the contrary there are some things which OpenMP can't do (or is not designed to do easily) but pThreads can. (As a real-life example, look at [this question](http://stackoverflow.com/q/9685403/1225607), where multiple nested OpenMP constructs are necessary to setup a lone thread doing different operations than its neighbours, when such a thing would have caused no problem in a pThreads implementation) – François Févotte Mar 24 '12 at 20:08
  • Classic case of Simplicity and Flexibility. Darn. Thanks +1 ! –  Mar 24 '12 at 20:11
1

To add to the already excellent answers: OpenMP generally does a better job of parallelizing my code than I do when I write pthreads. Given that OpenMP is also easier I always opt for it if those are my options. I suspect if you are asking this question you aren't a pthread guru, so I'd also recommend using OpenMP over pthreads.

Levi Morrison
  • 19,116
  • 7
  • 65
  • 85