Embarrasingly parallel execution, no speedup (MEEP, openMPI)

Question

I've been trying to exploit parallelization to run some simulations with the MEEP simulation software a bit faster. By default the software only uses one CPU, and FDTD simulations are easily sped up by parallelization. In the end I found there was no difference between running 1 or 4 cores, the simulation times were the same.

I then figured I would instead run individual simulations on each core to increase my total simulation throughput (for example running 4 different simulations at the same time).

What I found surprising is that whenever I start a new simulation, the already started simulations would slow down, even though they run on separate cores. For example, if I run only 1 simulation on 1 core, each time step of the FDTD simulation takes around 0.01 seconds. If I start another process on another core, each simulation now spends 0.02 seconds per time step, and so on, meaning that even when I run different simulations that have nothing to do with each other on separate cores, they all slow down giving me no net increase in speed.

I'm not necessarily looking for help to solve this problem as much as I'm looking for help understanding it, because it peaked my curiosity. Each instance of the simulation requires less than 1% of my total memory, so it's not a memory issue. The only thing I can think of is the cores sharing the cache memory, or the memory bandwidth being saturated, is there any way to check if this is the case?

The simulations are fairly simple and I've ran programs which are much more memory hungry than this one and had great speedup with parallelization.

Any tips to help me understand this phenomena?

You should use an appropriate tool to read hardware performance counters that can indicate shared resources as bottleneck. There are many [parallel](https://stackoverflow.com/questions/10607750/tools-to-measure-mpi-communication-costs/10608276) [tools](https://stackoverflow.com/questions/18191635/good-profiler-for-fortran-and-mpi/18205748). But if you just run the applications separately, you could use something simple like [perf](https://perf.wiki.kernel.org/index.php/Tutorial#Sampling_with_perf_record). — Zulan, Oct 13 '16 at 15:00
Your analysis is a good start, but there are other possible issues: Reduced turbo frequency, hyperthreading / AMD bulldozer 'modules', but without more specific system and application information it is impossible to tell — Zulan, Oct 13 '16 at 15:02
whats data length? 100 MB? try to make all cores work on only 1MB part then when finished continue on another part. This way, when a core access a cell, another core can access it as a neighbouring of another cell. — huseyin tugrul buyukisik, Oct 13 '16 at 17:52
You can make a quick check of what could be going on there using `perf` Linux event profiler. Do you know if the simulation relies on file access? If several processes are competing for exclusive file access, it might also produce serialization at inter process level. — Jorge Bellon, Oct 13 '16 at 20:13

score 0 · Answer 1 · answered Oct 19 '16 at 13:34

I think it should be better look on bigger simulations because the well known issue with the turbo boost like technology (the single core performance change with the number of threads) cannot explain your result. It will explain just if have a single core processor.

So, I think that can be explain with memory cache levels. Maybe if you try simulations much bigger than L3 Cache (> 8MB for i7).

Maicon Faria · Answer 2 · 2016-10-19T18:32:01.677

MY test on a Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz Dual Core (4 Threads). All simulations for 1 mpi thread (-np 1)

10mb simulation:

Four simulation 0.0255 s/step
Two simulation 0.0145 s/step
One simulation 0.0129 s/step

100mb simulation:
Four simulation 1.13 s/step
Two simulation 0.61 s/step
One simulation 0.53 s/step

A curious thing is that two simulation with 2 threads each run at almost the same speed as two simulations with 1 thread.

Embarrasingly parallel execution, no speedup (MEEP, openMPI)

2 Answers2