0

I have build a C++ code without thinking that I would later have the need to multithread it. I have now multithreaded the 3 main for loops with openMP. Here are the performance comparisons (as measured with time from bash)

Single thread

real    5m50.008s
user    5m49.072s
sys     0m0.877s

Multi thread (24 threads)

real    1m22.572s
user    28m28.206s
sys     0m4.170s

The use of 24 cores have reduced the real time by a factor of 4.24. Of course, I did not expect the code to be 24 times faster. I did not really know what to expect actually.

- Is there a rule of thumb that would allow one to make prediction about how much faster will a given code run with n threads in comparison to a single thread?

- Are there general tips in order to improve the performance of multithreaded processes?

Remi.b
  • 17,389
  • 28
  • 87
  • 168

1 Answers1

2

I'm sure you know of the obvious like the cost of barriers. But it's hard to draw a line between what is trivial and what could be helpful to someone. Here are a few lessons learned from use, if I think of more I'll add them:

  • Always try to use thread private variables as long as possible, consider that even for reductions, providing only a small number of collective results.

  • Prefer parallel runs of long sections of code and long parallel sections (#pragma omp parallel ... #pragma omp for), instead of parallelizing loops separately (#pragma omp parallel for).

  • Don't parallelize short loops. In a 2-dimensional iteration it often suffices to parallelize the outer loop. If you do parallelize the whole thing using collapse, be aware that OpenMP will linearize it introducing a fused variable and accessing the indices separately incurs overhead.

  • Use thread private heaps. Avoid sharing pools and collections if possible, even though different members of the collection would be accessed independently by different threads.

  • Profile your code and see how much time is spent on busy waiting and where that may be occurring.

  • Learn the consequences of using different schedule strategies. Try what's better, don't assume.

  • If you use critical sections, name them. All unnamed CSs have to wait for each other.

  • If your code uses random numbers, make it reproducible: define thread-local RNGs, seed everything in a controllable manner, impose order on reductions. Benchmark deterministically, not statistically.

  • Browse similar questions on Stack Overflow, e.g., the wonderful answers here.

Community
  • 1
  • 1
The Vee
  • 11,420
  • 5
  • 27
  • 60