What to take into account for selecting a parallelization scheme?

Question

I'm developing some code using c++ for my research in computational dynamics. My code solves sparse and dense matrices, generates meshes, and does similar operations in the most trivial sense. I needed to parallelize my code to reduce the computational time and used OpenMP for that purpose.

But after a closer look at the commercially available codes, like ANSYS CFX, I encountered that the parallelization scheme used in that software is MPICH2, which is an implementation of MPI.

So you have a lot of parallelization tools/API's :

OpenMP
MPI
Intel Threading Building Blocks
Pthreads
Microsoft PPL

I used some of these tools and managed to get 100% CPU usage in my local computer using each.

I don't know what criteria I should pay attention to while choosing the proper parallelization tool. What kind of applications require which tool? Is any of the above OK for research purposes? Which of them is used mostly in commerical softwares?

Sounds like an ideal case for CUDA. Take a look. Moaning CPU for that will soon be considered harmful. — , Jun 05 '12 at 18:02
I agree. But I was thinking for CUDA as it's an accelerator. One should be able to use both CUDA and one of the parallelization schemes above in the post — Emre Turkoz, Jun 05 '12 at 18:04
It is not accelerator, it is just really fast, easy to use hardware solution designed for parallel calculations. You load your data there through PCIe, it crunches numbers, then you load results back. You don't need both CPU and CUDA based solutions. There is also OpenCL, which is less vendor specific and is implemented for other ASICS (I think Altera puts FPGA-based OpenCL implementations, or was planning to do so) — , Jun 05 '12 at 18:07
Thanks. I also had some experience with CUDA. Implemented a small algorithm for blocked Cholesky implementation and experienced its speed. But still, I find CUDA hard to use. But it's great that it's mentioned even here, which indicates that it's becoming more and more spread — Emre Turkoz, Jun 05 '12 at 18:11
In my opinion OpenCL will become the standart for highly parallalized and realtime critical applications. The possibility to easily calculate on a GPU is awesome. GPUs perform so much faster than CPUs. One should mention here, that its also possible to run OpenCL code also on the CPU without touching the code. — Sebastian Hoffmann, Jun 05 '12 at 18:24
@VladLazarenko, CUDA _is_ an accelerator framework - you cannot target general purpose code to it and it requires a host CPU in order to operate, just like OpenCL (although the latter can also target the same host CPU). Once upon a time such modules were called _transputers_, but _accelerators_ sounds much more marketing friendly. — Hristo Iliev, Jun 05 '12 at 18:55
@Paranaix: GPUs don't perform faster than CPUs. They perform in parallel. CPUs are sequential (in most cases) no matter what you do, and multi-cpu overhead is bigger than a gain you get. So yes, OpenCL is a future, because it is down to hardware :) — , Jun 05 '12 at 19:37
@VladLazarenko Besides the higher optimazition floating point aritmetic i talked about GPUs in general. I didnt want to get into that topic because it can get pretty complex. Unfortunately i dont have my OpenCL book at my hand whichs first chapter is just about "GPU vs CPU" ;) — Sebastian Hoffmann, Jun 05 '12 at 19:50

score 7 · Accepted Answer · edited May 23 '17 at 12:28

As for many question of this type there is not a true definitive answer. You can't really say what's better because the answer is always "it depends". On what you're doing, on how your code is written, which your portability requirements are and so on.

Following your list:

OpenMP: is pretty standard and I found it's really easy to use. Even if original code has not been written with parallelization in mind this library makes a step by step approach very easy. I think it's a good entry point for parallel computing because it may make everything easy but it's hard to debug, limited in performance and it just makes code parallel (it lacks of parallel algorithms, structures, primitives and you can't span the work across a network).
Message Passing Interface: from my point of view a library based on this standard is best suited to span large computation across a cluster. If you have few computers and you want to make computation in parallel then this is a good choice, well known and stable. It's not (again in my point of view) a solution for local parallelization. If you're looking for a well-known, large used standard for grid computing then MPI is for you.
Intel Threading Building Blocks: this is a C++ library to unify the interface for multithreading across different environment (pthreads or the threading model of Windows). If you use a library like this maybe you need to be portable across compilers and environments. Moreover to use this library doesn't limit you so it can be well integrated with something else (for example MPI). You should take a look to the library to see if you like it, it's a very good choice with a good design, well documented and widely used.
Microsoft Parallel Patterns Library: this is a very big library. It's quite new so I do not feel secure to suggest someone to use it without a good test and moreover it's Microsoft specific so you're tied to its compiler. That said for what I see it's a great library. It abstracts a lot of details, it's well designed and it provides a very high level view of the concept of "parallel task". Again to use this library doesn't stop you to use, for example, MPI for clusters (but the Concurrency Runtime has its own library for this).

What to use? I do not have an answer, just try and pick what you feel more comfortable with (take a look to Boost Threads too). Please note that somehow you can mix them, for example OpenMP+MPI, MPI+TBB or even MPI+PLL). My preference is for PPL but if you're developing a real world application you may need a long test to decide what's better. Actually I like Concurrency Runtime (the base of PPL) because it's "horizontal", it provides a basic framework (with structures and algorithms) for parallel computing and a lot of "vertical" packages (Agents, PPL, TPL).

That said when you made your computation parallel you may need to improve performance of some CPU intensive routine. You may consider to use GPU for this task, I think it'll offer its best for short massive parallel computations (of course I prefer OpenCL over the proprietary CUDA even if CUDA performance may be higher). Actually you may even take a look to OpenHMPP if you're interested on this topic.

I agree with you on PPL. It's also very easy to use. The only problem is that clustering is very expensive with Windows. But for local parallelization it looks really good. — Emre Turkoz, Jun 05 '12 at 18:21
@EmreTurkoz you're right, doing clustering with Windows you may spend more money for software than for hardware anyway this doesn't stop you to mix libraries (if you need), for example using MPI in cheap Linux-based satellite computers and PPL + MPI in the "main" computer. — Adriano Repetti, Jun 05 '12 at 18:27
FYI, most of the PPL is available in TBB which is not MS specific. — Rick, Jun 05 '12 at 19:32
@Adriano: Can you initiate a parallel routine with MPI using a windows machine as "master" and other linux machines as "slave"? — Emre Turkoz, Jun 05 '12 at 19:46
@Rick of course I do not mean PPL offers structures/algorithms not available in other libraries! With "Microsoft specific" I mean that they are not portable across compilers so you're tied to VC++ — Adriano Repetti, Jun 05 '12 at 19:46
@EmreTurkoz yes, what's nice of MPI is that it's pretty standard and wide available so you can "interop" and mix different architectures. — Adriano Repetti, Jun 07 '12 at 09:24

Hristo Iliev · Answer 2 · 2012-06-05T21:41:57.643

Consider this an extended comment (and an extension) to the answer of Adriano.

OpenMP is really straightforward to master and use and it has the nice feature that both serial and parallel executables can be produced from one and the same source code. It also allows you to take the gradual paralellisation path if you need to convert an existing serial code to a parallel one. OpenMP has a set of drawbacks though. First, it targets only shared memory machines which severly limits its scalability, though large x86 SMP machines are now available (e.g. we have QPI coupled Xeon systems with 128 CPU cores sharing up to 2 TiB of shared RAM in our cluster installation, specifically targeted for large OpenMP jobs). Second, its programming model is too simple to allow implementation of some advanced concepts. But I would say that this is a strength rather than a drawback of the model since it keeps OpenMP concise.

MPI is the de facto standard message passing API nowadays. It is widely supported and runs on vast variety of parallel architectures. Its distributed memory model imposes little to no restrictions on the underlying hardware (apart from having a low latency and high bandwidth network interconnect) and this allows it to scale to hundreds of thousands of CPU cores. MPI programs are also quite portable on the source level although the algorithms themselves might not posses a portable scalability (e.g. one MPI program might run quite efficiently on Blue Gene/P and horribly slow on an InfiniBand cluster). MPI has one severe drawback - its SPMD (Single Program Multiple Data) model requires a lot of schizophrenic thinking on behalf of the programmer and is much harder to master than OpenMP. Porting serial algorithms to MPI is never as easy as it is with OpenMP and sometimes a complete rewrite is necessary in order to achieve high parallel efficiency. It is also not possibe to take the gradual parallelisation approach and to maintain easily a codebase that can produce both serial and parallel executables. MPI has an interesting feature - since it separates completely the different parts of the program that run on separate nodes and provides an abstract interface to the network, it allows for heterogeneous computing. Several MPI implementations (e.g. Open MPI) provide heterogeneous support which allows one to mix not only nodes running under different OS but also CPUs with different "bitness" and endianness.

Intel TBB is like OpenMP on steroids. It provides much richer programming model based on kernels which puts it closer to other parallel programming paradigms like CUDA or OpenCL. It draws heavily from the C++ STL algorithms in terms of applicability and extensibility. It is also supposed to be compiler neutral and in principle should work with Intel C++ Compiler, GNU g++ and MSVC. ITBB also uses the task "stealing" ideology that can potentially even out the computational imballance that the previous paradigms are prone to if no precautions are taken.

Pthreads is the portable threading interface of the most modern Unix-alikes (e.g. FreeBSD, Mac OS X, Linux, etc.). It is just a threading library and is geared towards the most general usage cases that one can imagine. It provides little to no parallel constructs and one has to explicitly program them on top of it, e.g. even simple loop iterations distribution a la OpenMP has to be hand-coded. Pthreads is to Unix exactly what Win32 threads is to Windows.

(I would skip Microsoft TPP since I don't really know that library)

Mixing those concepts is clearly the way of the future as single nodes are progressively getting more and more cores. Multiple levels of parallelism are possible with most algorithms and one can use MPI to perform the coarse-grained parallelism (running on multiple cluster nodes) while OpenMP or ITBB can be used to perform the fine-grained divison of individual node computations. Shared memory programming can usually utilise memory resources better since they are all shared between threads and things like cache reusage can speed the calucations considerably. MPI also can be used to program a multicore SMP or NUMA machine but each MPI process is a separate OS process with its own virtual address space which means that lots of (configuration) data might need to get replicated. MPI people are working towards improvements to the standard to allow it to run MPI processes as threads and "MPI endpoints" might end up in the forthcoming MPI standard version 3.0.

I would suggest to pick the one that is closest to your programming background. If you are an avid C++ programmer and breathe abstractions then pick Intel TBB (or Microsoft PPL if you are into .Net). OpenMP is really easy to master and provides good performance but is somehow simplistic. It is still the only widely available and used mechanism for writing multithreaded code in Fortran. MPI has a steep learning curve but can always be bolted on later if your program outgrows what single compute node can provide.

What to take into account for selecting a parallelization scheme?

2 Answers2