I have seen other related questions (like this one) but none of them actually answers my questions, so here it goes:
I have an obviously embarassingly parallel task to perform, my own rolled version of GridSearch. In simple words I have a set of parameters and want to evaluate my model on each of those. There is no dependence between those runs so the code looks like that:
pool = multiprocessing.Pool(processes=4)
scores = pool.map(evaluator, permutations)
where the evaluator
is a function that computes a score given a dict of parameters, and permutations
is a list of such dictionaries
(of length 4 in this case).
Now my assumption is that using 4 processes (on an 8 core machine) should give me a 4x speedup (note that the evaluator takes the same amount of time regardless of the set of parameters so the load is perfectly balanced).
Instead my timing has yielded those results:
Using 4 processes, each evaluation takes 82 sec to complete, as a result the total time is 84 sec.
Using 1 process, each evaluation takes 43 sec to complete, as a result the total time is 170 sec.
So in the end I get a 2x speedup using 4 cores. Why is each process faster when there are fewer processes?