Python: Using multiprocessing is much slower than loop for optimisation problem. What am I doing wrong?

Question

An obligatory assurance that I have read the many posts on the topic before posting this. I'm aware that multiprocessing entails a fixed cost, but to the best of my knowledge this doesn't seem to be the problem here.

I basically have a number of separate optimisation problems, and want to solve them in parallel. The following code is a simple example:

import psutil
import multiprocessing as mp
import time
from scipy.optimize import minimize
import numpy as np

pset = np.random.uniform(-10,10,500)

def func(x,p):
    out= (x-p)**2
    return out

def object(p):
    def func2(x):
        return func(x,p)    
    output = minimize(func2,
                  x0,
                  method = 'trust-constr')
    xstar = output.x
    return xstar

# 1. Loop
tic = time.perf_counter()
out_list = []
x0 = 0

for p in pset:
    xstar = object(p)
    out_list.append(xstar)

#print(np.vstack(out_list))

toc = time.perf_counter()
print(f'Loop done in {toc - tic:0.4f} seconds')

# 2. Pool
n_cpu = psutil.cpu_count(logical = False)
if __name__ == '__main__':
    pool = mp.Pool(n_cpu)
    #results = pool.map_async(object, pset).get()
    results = pool.map(object,pset)
    pool.close()
    pool.join()
    #print(np.vstack(results))

toc2 = time.perf_counter()
print(f'Pool done in {toc2 - toc:0.4f} seconds')

As you can see, the 'pool' method takes longer and increasingly so the more problems there are to solve (hence my conjecture that this isn't a fixed cost issue). My optimisation problem is actually a lot more complicated than this, and while loop will take a few minutes to solve say 3 problems, 'pool' will keep on running for a long long time, at least 15 minutes before I decide to force terminate.

What could be the reason for this inferior performance? Is there a problem with using parallel computing for optimisation problems? What other tricks could I try to speed things up?

In general, multiprocessing can incur many costs, not necessarily fixed. There is also no guarantee that parallelizing your algorithm will improve performance. Just glancing at what you are doing, `scipy.optimize` may even be parallelized already, in which case, throwing more processes at it will almost certainly *slow things down*. — juanpa.arrivillaga, Nov 01 '20 at 20:15
the work you shard out is very short. If possible find longer tasks to compensate for coordination overhead. — Jörg Beyer, Nov 01 '20 at 20:20
@juanpa.arrivillaga Thanks. I don't fully understand what you meant by scipy.optimize already using parallelized. For what it's worth, with the loop version Task Manager reports about 20% of CPU used. With 'Pool', it's always close to 100%. And since all optim problems are independent, it seems intuitive that I should save time with more workers. — nutix, Nov 01 '20 at 21:12

score 1 · Accepted Answer · edited Feb 13 '22 at 00:06

Q : "What could be the reason for this inferior performance?"

Welcome to the real-world of how computing devices actually work.

Your code makes a few performance sins, so let's dig them up, ok?

The code makes 500 .map()-ed calls of an object() function and pays immense upfront costs of spawning n-copies of the whole python session into replicated processes (so as to escape from the monopolistic GIL-lock ownership re-[SERIAL]-isation, that would appear otherwise - read other great posts on this subject, if not aware about GIL-details ), yet the actual work, that is being delegated to such expensive "remote"-d processes is to just run a .minimize()-method, driven by a square of a difference. In other words, all the .cpu_count()-times replicated ( memory-IO allocated (thus swapping, if having headbanged into any of the physical RAM-size and O/S memory-manager ceilings ) + copied the full range of the main-python interpreter process memory-image ... all data-structures included ... - yes, that is what happens to the win O/S, somehow least devastating costs on Linux O/S-es ).

So quite an expensive product of calling just an innocent pool.map() SLOC, isn't it?

Next comes the SER/DES + communication costs when passing parameters there and back. This bears a few kB here for the payloads on a way "there" and a few B on the way of the results going "back", so you happily do not sense much of this kind of pain, that may hurt your code in some other, less happy use-cases, yet you still do pay it... yes, again by some additional add-on overhead times, each time of the 500 .map()-ed calls.

Next comes the worse part. Having requested .cpu_count()-many processes to stand in the O/S process-scheduler queue, so as to get some time to grab a CPU-core and execute (for some O/S process-scheduler granted time, before being forcibly moved out-of-the-CPU-core, so as to let some other O/S-assigned process to move in & execute - this is how process-scheduling works), you might already feel the smoke, that this comes at an additional cost of add-on overhead times, consumed by the (heavy) lifting, once many processes stay waiting in the queue for their respective turn.

Next comes the worst penalty - you easily, almost granted, start to pay ~ 300x ~ higher memory-access costs, as upon your scheduled process re-entry to the CPU-core has lost the chance to reuse the fastest computing resource, the L1_data on-core cache, first by having any such fast-2b-re-used data overwritten by some other, stored there by a processes, that has used this CPU-core before your process gets its next CPU-core-share to continue, after a previous process has been forcedly removed by the O/S scheduler, leaving but LRU-cache data you never need to reuse & so your process will pay extra costs to re-fetch your data again (paying costs of not less than ~100 ~380 \[ns\] to re-fetch data from main RAM, ... if memory-I/O channels permit to be served without further waiting for a free channel ... ). This will most probably happen per each O/S-process-scheduler process move-out/move-in cycle, as in most cases you do not even get to the same CPU-core as you have been camped last time (so even no chance to speculate on some residuals, that might "remain" in the L1_data on-core cache from your "last round" the process was assigned to some of the available CPU-cores.

Last but not least - contemporary CPU perform another level of process-2-core "optimisation" strategy, so as to reduce the risk of hitting the thermal-ceiling. So, the processes move "around" even more often, so as to allow them to work on colder CPU-cores and leaving "hot"-cores cold down. That strategy works fine for light workloads, where a few computing-intensive processes may enjoy thermal-motivated jumping from a hot CPU-core to another, colder CPU-core, because otherwise, if it were left on the hot CPU-core, the hot silicon-fabric will slow the frequency of such CPU-core, so as to permit the thermal-ceiling not be overcome - and yes, you get it - your processing will get slower as at the reduced CPU-frequency you get less CPU-clocks and less computing will take place, until this hot-CPU-core gets colder (which is a sort of oxymoron for heavy-computing jobs, isn't it?). If this were for a few processes on a multicore-CPU, such thermal-strategy may seem attractive to show high-GHz-clocking and thermal-jumps from hot CPU-core to colder CPU-core, but that -for obvious reasons...- stops working if you .map()-processes to "cover" all CPU-cores (not mentioning all other O/S-orchestrated processes in the queue), so the only result will be, that all-hot CPU-cores will slow down the CPU-frequency and will work in slow-motion so as to withstand the thermal-ceiling limitations.

Weird?

No. This is how the contemporary silicon toys work. Speculative strategies work nice, but only for a few and rather light-weight workloads. Then you start suffering from the reality of all the constraints that the laws of physics drive (which were until that hidden by the over-hyped marketing slogans, valid only for vast-excess of (cold) resources and light-weight computing / memory-I/O traffic patterns).

More reads on this + an analysis on modern criticism of the original ( overhead-naive ) Amdahl's Law argument is here.

So, welcome to the reality of computing :o)

Thanks so much for this detailed explanation. Admittedly a lot flew over my head, but I now have a clearer idea of what the 'upfront' cost can come from (which I understand now isn't fixed - just upfront). — nutix, Nov 06 '20 at 16:05
Always welcome, @nutix Feel free to click an Accept check mark right under the Answer's score to sign-off the "accept"-ance of the answer delivered :o) This is how StackOverflow works since ever :o) — user3666197, Nov 06 '20 at 18:45

Python: Using multiprocessing is much slower than loop for optimisation problem. What am I doing wrong?

1 Answers1

Linked