Q :
" ... how can I further improve the efficiency of this case? "
A :
follow the "economy-of-costs"
some visible
some less, yet nevertheless also deciding the resulting ( im )performance & ( in )efficiency
How much did we already have to pay,
before any PAR-section even started?
Let's start with the costs of the proposed adding multiple streams of code-execution :

Using the amount of ASM-instructions a simplified measure of how much work has to be done ( where all of CPU-clocks + RAM-allocation-costs + RAM-I/O + O/S-system-management time spent counts ), we start to see the relative-costs of all these ( unavoidable ) add-on costs, compared to the actual useful task ( i.e. how many ASM-instructions are finally spent on what we want to computer, contrasting the amounts of already burnt overhead-costs, that were needed to make this happen )
This fraction is cardinal if fighting for both performance & efficiency (of resources usage patterns).
For cases, where add-on overhead costs dominate, these cases are straight sin of anti-patterns.
For cases, where add-on overhead costs make less than 0.01 % of the useful work, we still may result in unsatisfactory low speed-ups (see the simulator and related details).
For cases, where the scope of useful work diminishes all add-on overhead costs, there we still see the Amdahl's Law ceiling - called so, so self-explanatory - a "Law of diminishing returns" ( since adding more resources ceases to improve the performance, even if we add infinitely many CPU-cores and likes ).

Tips for experimenting with the Costs-of-instantiation(s) :
We start with assuming, a code is 100% parallelisable ( having no pure-[SERIAL]
part, which is never real, yet let's use it ).
- let's move the NCPUcores-slider in the simulator to anything above 64-cores
- next move the Overhead-slider in the simulator to anything above a plain zero ( expressing a relative add-on cost of spawning a one of NCPUcores processes, as a number of percent, relative to the such [PARALLEL]
-section part number of instructions - mathematically "dense" work has many such "useful"-instructions and may, supposing no other performance killers jump out of the box, may spends some reasonable amount of "add-on" costs, to spawn some amount of concurrent- or parallel-operations ( the actual number depends only on actual economy of costs, not on how many CPU-cores are present, the less on our "wishes" or scholastic or even worse copy/paste-"advice" ). On the contrary, mathematically "shallow" work has almost always "speed-ups" << 1 ( immense slow-downs ), as there is almost no chance to justify the known add-on costs ( paid on thread/process-instantiations, data SER/xfer/DES if moving params-in and results-back, the worse if among processes )
- next move the Overhead-slider in the simulator to the rightmost edge == 1
. This shows the case, when the actual thread/process-spawning overhead-( a time lost )-costs are still not more than a just <= 1 %
of all the computing-related instructions next, that are going to be performed for the "useful"-part of the work, that will be computed inside the such spawned process-instance. So even such 1:100 proportion factor ( doing 100x more "useful"-work than the lost CPU-time, burnt for arranging that many copies and making O/S-scheduler orchestrate concurrent execution thereof inside the available system Virtual-Memory ) has already all the warnings visible in the graph of the progression of Speed-up-degradation - just play a bit with the Overhead-slider in the simulator, before touching the others...
- only here touch and move the p-slider in the simulator to anything less than 100% ( having no [SERIAL]
-part of the problem execution, which was nice in theory so far, yet never doable in practice, even the program launch is a pure-[SERIAL]
, by design )
So,
besides straight errors,
besides performance anti-patterns,
there are lot of technical reasoning for ILP, SIMD-vectorisations, cache-line respecting tricks, that first start to squeeze out the maximum possible performance the task can ever get
- refactoring of the real problem shall never go against a collected knowledge about performance as repeating the things that do not work will not bring any advantage, will it?
- respect your physical platform constraints, ignoring them will degrade your performance
- benchmark, profile, refactor
- benchmark, profile, refactor
- benchmark, profile, refactor
no other magic wand available here.
Details matter, always. The CPU / NUMA architecture details matter, and a lot. Check any possibilities for the actual native-architecture possibilities, as without all these details, the runtime performance will not reach the capabilities technically available.