C++ code compiled using Homebrew's Clang on macOS Apple Silicon runs significantly slower with OpenMP than without

Question

I am trying to compile OpenMP-enabled C++ code on macOS Monterey (12.5.1) on Apple Silicon (Apple M1 Max) to get the respective speed-ups.

The code is compute-heavy, well understood and uses OpenMP since years (e.g. on x64 Ubuntu) without problems; the computations are more or less embarrassing parallel, so the speed-ups from running the computation via OpenMP in multi-threaded manner are significant.

When I compile the code on macOS with Apple Clang everything works, except that of course the code runs single-threaded since AppleClang does not support OpenMP.

That's why I have compiled now everything with Homebrew's Clang to enable OpenMP. Unfortunately, the results are different than expected. The code compiles and uses OpenMP; I see that since I had to fix a few minor issues on firstprivate clauses where Clang is more strict than GCC and I see the execution using multiple threads.

However, the runtime with OpenMP is significantly slower than without OpenMP. For instance, the computation without OpenMP takes ~15 seconds, with OpenMP it takes ~90 seconds, and even in the worse scaling case, I'd expect a speed-up of 2x-4x. Otherwise the results of the computation are correct, it just seems to affect the speed.

I have tried to compile the software with both llvm 14 (stable 14.0.6 (bottled)) and llvm 13 (stable 13.0.1 (bottled), tried due to a suspected regression) but without success so far.

The project is using CMake and the configuration looks as follows (this one is with LLVM 13, but I had tried with the default llvm beforehand):

export PATH="/opt/homebrew/opt/llvm@13/bin:$PATH"
...
cmake ../../ -DCMAKE_BUILD_TYPE=Release  \
  -DCMAKE_PREFIX_PATH=/opt/homebrew/opt/llvm@13   \
  -DCMAKE_CXX_COMPILER=/opt/homebrew/opt/llvm@13/bin/clang++   \
  -DCMAKE_C_COMPILER=/opt/homebrew/opt/llvm@13/bin/clang   \
  -DLDFLAGS="-L/opt/homebrew/opt/llvm@13/lib"   \
  -DCPPFLAGS="-I/opt/homebrew/opt/llvm@13/include"

I would be glad for any advice on how to resolve this issue. Thanks in advance!

--

Update 1:

When OMP_DISPLAY_ENV is enabled, the following is shown:

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201611'
  [host] OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}'
  [host] OMP_ALLOCATOR='omp_default_mem_alloc'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_AFFINITY='FALSE'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='1'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED: deprecated; max-active-levels-var=1
  [host] OMP_NUM_TEAMS='0'
  [host] OMP_NUM_THREADS: value is not defined
  [host] OMP_PROC_BIND='false'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='8176k'
  [host] OMP_TARGET_OFFLOAD=DEFAULT
  [host] OMP_TEAMS_THREAD_LIMIT='0'
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_TOOL='enabled'
  [host] OMP_TOOL_LIBRARIES: value is not defined
  [host] OMP_TOOL_VERBOSE_INIT: value is not defined
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END

Once I set OMP_NUM_THREADS to a specific number, my test case yields the following runtimes (time via Boost's cpu_timer). One can clearly see that the code is fastest when running on a single thread (set via OMP_NUM_THREADS=1), even though it has been compiled with OpenMP support.

OMP_NUM_THREADS=1: 15.85s wall, 30.97s user + 0.56s system = 31.53s CPU (198.9%)
OMP_NUM_THREADS=2: 28.91s wall, 69.93s user + 0.71s system = 70.64s CPU (244.3%)
OMP_NUM_THREADS=4: 36.63s wall, 134.07s user + 1.47s system = 135.54s CPU (370.0%)
OMP_NUM_THREADS=8: 52.18s wall, 267.80s user + 12.35s system = 280.15s CPU (536.9%)
OMP_NUM_THREADS=10: 52.29s wall, 285.78s user + 10.57s system = 296.35s CPU (566.8%)

Update 2:

System load from asitop (https://github.com/tlkh/asitop):

OMP_NUM_THREADS=1	OMP_NUM_THREADS=8

Did you tried to use only few cores like setting `OMP_NUM_THREADS` to 2 and check the configuration is done properly with `OMP_DISPLAY_ENV`. It may be a good idea to check the execution is done on big cores and not little ones (ie. energy efficient) that are much slower. Using `OMP_PROC_BIND` and `OMP_PLACES` may help for that ([this post](https://stackoverflow.com/questions/71340798/problem-of-sorting-openmp-threads-into-numa-nodes-by-experiment/71343253#71343253) about that may help too). Can you provide an example of tested code? I wonder if the code is bound by synchronizations. — Jérôme Richard, Sep 03 '22 at 11:11
(Also note that the code needs to be compiled for ARM and not x86. The latter might work with Rosetta but the overhead can be pretty big in some cases.) — Jérôme Richard, Sep 03 '22 at 11:13
Thanks for the pointers, I will try that. Yes, the code is compiled for ARM and also was not bound by synchronization (at least on Intel hardware). — Stefan, Sep 03 '22 at 13:35
I have updated the question with the respective output and runtimes. — Stefan, Sep 03 '22 at 13:49
Have you already tried compiling it with g++ (installed via `brew install gcc libomp`) instead of LLVM Clang? Note that the Gnu Compiler is installed as g++-12 since g++ is just an alias for Apple Clang on macOS. I don't own an M1 Mac, but I vaguely remember having a similar problem once with LLVM Clang and OpenMP. — joni, Sep 03 '22 at 14:44
Based on the provided results, it looks like either there is a contention issue on a shared resource (eg. spin lock, cache line). The use of all P-Cores show that threads are active and not passively waiting and they are running apparently on the good cores. This is not an active wait of the runtime since `OMP_WAIT_POLICY` is `PASSIVE` so cores are busy running a user-defined part of the code, not the code of the runtime (except if there is something wrong during the init). Can you interrupt the program and see what threads are doing? (eg with a debugger). — Jérôme Richard, Sep 04 '22 at 00:16
@joni: Thanks for the hint. Unfortunately, compiling with GCC is a beast of its own. I tried but at the moment I have issues all over the place, Homebrew's Boost requires recompilation with GCC, but that does not work, etc... — Stefan, Sep 04 '22 at 14:43

C++ code compiled using Homebrew's Clang on macOS Apple Silicon runs significantly slower with OpenMP than without

0 Answers0

Linked