0

To learn more about thread spawn overhead in OpenMP, I've experimented with an empty nested loop with the inner loop parellelized and compared it to the same nested loop without multiprocessing.

time_t time = std::time(nullptr);
for (long long i = 0; i < 1e8; i++)
#pragma omp parallel for
    for (long long j = 0; j < (long long)1e2; j++) {}
std::cout << std::time(nullptr) - time << std::endl;

In this particular case, it took 79 seconds to run through the loop and 14 seconds to run through the same loop without the inner loop being paralellized (just commented out the directive). The bigger the inner parallel loop and the less the outer loop, the less time it takes to execute the code. Thus, with the outer loop consisting of 1e7 cycles and the inner one of 1e3 cycles, the time to execute the parallelized and the sequential versions of the program is comparable. With the outer loop with 1e6 cycles and the inner one with 1e4 the parallel version gets executed in 4 seconds. For the sequential version, the execution time always stayed the same no matter what the ratio between the amount of cycles was.

I haven't managed to find anything describing process creation overhead in detail. Still, this experiment shows this overhead can be really significant.

What are the means to avoid or minimise such overhead apart from moving the parallel section to the outer loop? Are there some workarounds provided by OpenMP or maybe any other sources?

Kaiyakha
  • 1,463
  • 1
  • 6
  • 19
  • 1
    What optimization level did you use? Without the `parallel for` any competent compiler would have removed that loop since it doesn't compute anything. Not sure if the same thing holds with, but it might very well. Considering that your sequential time is not zero, I'm guessing you use a low optimization level, which means that you may be measuring loop overhead, rather than thread overhead. – Victor Eijkhout Apr 07 '22 at 20:00
  • @VictorEijkhout all the optimization disabled to avoid removing empty section – Kaiyakha Apr 07 '22 at 20:02
  • @VictorEijkhout loop overhead does not seem to be the case since for the sequential version the execution time stays the same no matter what the ration between the amount of cycles in the inner and outer loops is, which I mentioned as well. Or, in case of extreme values, like swapping `1e1` and `1e9`, the time does not change that drastically – Kaiyakha Apr 07 '22 at 20:06
  • 1
    It does not make much sense to disable optimization in a benchmark measuring time. Moreover the time taken by this loops should be extremely small on a modern processor. I expect something like <1 ns per iteration, so <100 ns for the j-based for. There is no way multithreading can speed up this code. Using multiple core have a fundamental overhead that is independent of OpenMP. Moving data between cores takes time, the OS scheduler have a high latency and Amdahl's law applies here. Using multiple threads for parallel tasks below 1 us is clearly a dead end. – Jérôme Richard Apr 07 '22 at 23:27
  • @JérômeRichard the goal was to try out the overhead, not to speed up the code – Kaiyakha Apr 08 '22 at 10:54
  • I understand, but even with that, the overhead are probably not realistic due to the possible contention created by the work sharing (atomic accesses are generally slower when many threads accesses to them, especially when they update them; a saturation of the bus might also be possible). Besides this, I do not think this is reasonable for an application to do a `omp parallel` in such a case though a `omp for` is possible (critical case). Moreover, disabling optimizations can also impact the overhead of OpenMP calls too. – Jérôme Richard Apr 08 '22 at 11:15
  • Note that you can change the scheduling of the loop. Static for are very fast but it force all threads to perform part of the work (as long as there is enough work) and so you pay the cost of thread scheduling, work sharing, etc. You can use a more dynamic scheduling policy to avoid that, but in this case, they will probably have a bigger overhead (especially due to contention). You should also care about the value of `OMP_WAIT_POLICY`, `OMP_PROC_BIND` and `OMP_PLACES`. – Jérôme Richard Apr 08 '22 at 11:19

0 Answers0