I'm currently profiling a heavily multi-threaded C++ currently running under Visual Studio 2017. Looking at the results under task manager, my program is keeping all CPU cores and threads running at a 100% for the processing task being profiled. The profiler shows this as ~33% of the time in my code and ~60% of the time in KernelBase.dll. My code isn't making any kernel calls so I'm wondering is this 60% being lost to context switching, or is it simply a function of using the profiler in an OMP application. To test this I'm thinking of writing a very simple multi-threaded application to see if it gives similar results, but wanted to know if there are any other factors I should be aware of?
Edit: Following on from the comment below, I've included all the kernel symbols and it appears that NtYieldExecution is the problem. There is a related question here but no responses. I'll do a bit more googling around this and see what shows up.
Edit2: As per Jérôme Richard's comments, I tried setting OMP_WAIT_TIME to ACTIVE and PASSIVE. Active gave the results I had initially, PASSIVE removed all the NtYieldExecution calls but total processor usage across all threads maxed out at ~35%, this was 100% with most of the time wasted yielding cycles.
A longer test using both alternatives took 60.3 seconds passive and 71.8 seconds active. Basically, this seems to be telling me that the MS implementation of OMP is not efficient for the granularity of threading that I'm using. Time to look at some other options.
Edit3: After some testing of various scheduling combinations and chunk sizes I'm getting slightly better results with ~60% CPU usage with OMP_WAIT_TIME=PASSIVE and
#pragma omp for schedule(dynamic, 64)
This is after trying static, dynamic and guided schedules and chunk sizes left blank, 16, 64, 256 and 1024 for both OMP_WAIT_TIME=PASSIVE and OMP_WAIT_TIME=ACTIVE I was hoping for better improvements based on chunk size but actual variations from best to worst were only around 20% with OMP_WAIT_TIME=PASSIVE slightly faster while using less overall CPU (and also, at a guess, power consumption).
Edit4: Recompiled using Visual Studio 2022,17.5.318 which resulted in a 10% drop in performance which is disappointing. I tried this with and without /openmp:llvm which didn't make an appreciable difference. See this question and this post for related info.