Varying times during MPI parallel runs

Question

I have written a C++ code for a finite volume solver to simulate 2D compressible flows on unstructured meshes, and parallelised my code using MPI (openMPI 1.8.1). I partition the initial mesh into N parts (which is equal to the number of processors being used) using gmsh-Metis. In the solver, there is a function that calculates the numerical flux across each local face in the various partitions. This function takes the the left/right values and reconstructed states (evaluated prior to the function call) as input, and returns the corresponding flux. During this function call, there is no inter-processor communication, since all the input data is available locally. I use MPI_Wtime to find the time taken for each such function call. With 6 processors (Intel® Core™ i7 (3770)), I get the following results:

Processor 1: 1406599932 calls in 127.467 minutes

Processor 2: 1478383662 calls in 18.5758 minutes

Processor 3: 1422943146 calls in 65.3507 minutes

Processor 4: 1439105772 calls in 40.379 minutes

Processor 5: 1451746932 calls in 23.9294 minutes

Processor 6: 1467187206 calls in 32.5326 minutes

I am really surprised with the timings, especially those from processors 1 and 2. Processor 2 makes almost 80 million more calls than processor 1 but takes 1/7 the time taken by processor 1. I re-iterate that there is no inter-processor communication taking place in this function. Could the following cause this large a variation in time?

Conditional-if loops inside the function
The magnitude of the values of input variables. For instance if a majority of the values in for a processor are close to 0.

If not these, could there be any other reason behind this disparity?

Are you sure you've got the model number correct, according to this http://ark.intel.com/products/65719/Intel-Core-i7-3770-Processor-8M-Cache-up-to-3_90-GHz the i7-3770 is only a quad core processor (although it is hyperthreaded to show 8 logical cores to the OS) — Simon Gibbons, May 18 '15 at 14:34
you could use a [profiler](http://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux) to figure out where the time is getting spent most — tcb, May 18 '15 at 14:34
Yes Simon you right, there are only 4 physical cores with hyperthreading. However, I witnessed timing issues even while working with only 2 or 4 cores. As suggested by tcb, a profiler might give a more clearer picture about what is happening. — rayd, May 19 '15 at 05:10
Just curious: What do you mean by left/right values **and** reconstructed states? Aren't they the same? Next, I suppose that you are also passing the normal face vector to the function for flux calculation. So, it is possible, depending on the partitioning done by Metis, that one of the processors is lucky to get nx = 0 or ny = 0 for most of the faces. However, that may not result in such a drastic change in timings. — Sourabh Bhat, May 19 '15 at 09:31
Well I am not using the a MUSCL type scheme where one would use the reconstructed states in the flux. The flux I am using requires both cell averages and reconstructed states, thus the distinction. And I agree, the fact that the normal components are dominantly zero for a processor should not lead to such a big difference in timings. — rayd, May 19 '15 at 14:37
Perhaps you're getting lots of floating point values that are falling into the denormal range? Denormals can cause big slowdowns. E.g., see http://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x — bg2b, May 19 '15 at 18:09

Phil Miller · Answer 1 · 2015-05-19T19:50:03.787

In general, the phenomenon you're describing is known as load imbalance. There are many potential sources of this imbalance, depending on the problem, your numerical method, your parallelization scheme, your input data, your OS/runtime/application configuration, and even your hardware!

Hardware

Your stated processor supports Intel TurboBoost, Intel RAPL, and dynamic voltage and frequency scaling. Any of these can cause threads/processes running on some cores to end up finishing faster than on others. For the sake of clarity while trying to resolve these issues, I recommend configuring your system to disable these features, as well as any idle/sleep-state power saving options, that your BIOS may offer. Then re-run your benchmark.

Configuration

You want to match up the degree and assignment of parallel execution with the actual resources available to run your program on your system. That means a few things

Shut down heavily interfering processes running at the same time
Likely best choices for process counts are the number of hardware cores, the number of hardware threads, or one less than each of those to allow the OS to schedule other processes without interfering
Ensure scheduling affinity between each process and a distinct processor core/thread. This ensures both that the OS scheduler doesn't persistently try to move your processes around to no benefit, and that they don't end up contending for the same execution unit when they should have been running independently.

Application Specific Issues

Is your partitioning of the mesh relatively even across the processes? Do they have similar numbers of elements, and similar shares of boundary and interior? Boundary conditions often have very different computational cost from interior elements. You may need to tell your partitioner unequal weights for each element, so that it can try to produce partitions of approximately even weight rather than element count.
Are parts of your problem domain more expensive to compute than others? Your question mentions if statements inside the function in question. Evaluating the condition itself may cause variation if some processes get a set that all goes one way, while another is mixed, due to branch misprediction. More severe would be that the two branches of that if represent substantially different amounts of work. If some processes take one branch much more than others, their workloads will differ accordingly.
Is this function using any sort of iterative/converging numerical method per mesh element, where different elements might require different numbers of iterations to reach a solution. Different processes may have a different distribution of element computation costs.

Varying times during MPI parallel runs

1 Answers1

Hardware

Configuration

Application Specific Issues