I am trying to compile OpenMP-enabled C++ code on macOS Monterey (12.5.1) on Apple Silicon (Apple M1 Max) to get the respective speed-ups.
The code is compute-heavy, well understood and uses OpenMP since years (e.g. on x64 Ubuntu) without problems; the computations are more or less embarrassing parallel, so the speed-ups from running the computation via OpenMP in multi-threaded manner are significant.
When I compile the code on macOS with Apple Clang everything works, except that of course the code runs single-threaded since AppleClang does not support OpenMP.
That's why I have compiled now everything with Homebrew's Clang to enable OpenMP. Unfortunately, the results are different than expected. The code compiles and uses OpenMP; I see that since I had to fix a few minor issues on firstprivate
clauses where Clang is more strict than GCC and I see the execution using multiple threads.
However, the runtime with OpenMP is significantly slower than without OpenMP. For instance, the computation without OpenMP takes ~15 seconds, with OpenMP it takes ~90 seconds, and even in the worse scaling case, I'd expect a speed-up of 2x-4x. Otherwise the results of the computation are correct, it just seems to affect the speed.
I have tried to compile the software with both llvm 14 (stable 14.0.6 (bottled)) and llvm 13 (stable 13.0.1 (bottled), tried due to a suspected regression) but without success so far.
The project is using CMake and the configuration looks as follows (this one is with LLVM 13, but I had tried with the default llvm beforehand):
export PATH="/opt/homebrew/opt/llvm@13/bin:$PATH"
...
cmake ../../ -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_PREFIX_PATH=/opt/homebrew/opt/llvm@13 \
-DCMAKE_CXX_COMPILER=/opt/homebrew/opt/llvm@13/bin/clang++ \
-DCMAKE_C_COMPILER=/opt/homebrew/opt/llvm@13/bin/clang \
-DLDFLAGS="-L/opt/homebrew/opt/llvm@13/lib" \
-DCPPFLAGS="-I/opt/homebrew/opt/llvm@13/include"
I would be glad for any advice on how to resolve this issue. Thanks in advance!
--
Update 1:
When OMP_DISPLAY_ENV is enabled, the following is shown:
OPENMP DISPLAY ENVIRONMENT BEGIN
_OPENMP='201611'
[host] OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}'
[host] OMP_ALLOCATOR='omp_default_mem_alloc'
[host] OMP_CANCELLATION='FALSE'
[host] OMP_DEFAULT_DEVICE='0'
[host] OMP_DISPLAY_AFFINITY='FALSE'
[host] OMP_DISPLAY_ENV='TRUE'
[host] OMP_DYNAMIC='FALSE'
[host] OMP_MAX_ACTIVE_LEVELS='1'
[host] OMP_MAX_TASK_PRIORITY='0'
[host] OMP_NESTED: deprecated; max-active-levels-var=1
[host] OMP_NUM_TEAMS='0'
[host] OMP_NUM_THREADS: value is not defined
[host] OMP_PROC_BIND='false'
[host] OMP_SCHEDULE='static'
[host] OMP_STACKSIZE='8176k'
[host] OMP_TARGET_OFFLOAD=DEFAULT
[host] OMP_TEAMS_THREAD_LIMIT='0'
[host] OMP_THREAD_LIMIT='2147483647'
[host] OMP_TOOL='enabled'
[host] OMP_TOOL_LIBRARIES: value is not defined
[host] OMP_TOOL_VERBOSE_INIT: value is not defined
[host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END
Once I set OMP_NUM_THREADS to a specific number, my test case yields the following runtimes (time via Boost's cpu_timer). One can clearly see that the code is fastest when running on a single thread (set via OMP_NUM_THREADS=1), even though it has been compiled with OpenMP support.
- OMP_NUM_THREADS=1:
15.85s wall, 30.97s user + 0.56s system = 31.53s CPU (198.9%)
- OMP_NUM_THREADS=2:
28.91s wall, 69.93s user + 0.71s system = 70.64s CPU (244.3%)
- OMP_NUM_THREADS=4:
36.63s wall, 134.07s user + 1.47s system = 135.54s CPU (370.0%)
- OMP_NUM_THREADS=8:
52.18s wall, 267.80s user + 12.35s system = 280.15s CPU (536.9%)
- OMP_NUM_THREADS=10:
52.29s wall, 285.78s user + 10.57s system = 296.35s CPU (566.8%)
Update 2:
System load from asitop (https://github.com/tlkh/asitop):