Why is parallel programing through multiprocessing library taking longer than usual programming?

Question

EDIT: I changed the cubic function and defined instead the actual optimization problem in which I noticed this issues. I also changed the way chunks are generated, based on the comments received.

I am trying to parallelize some code in order to reduce the computation time, but executing the parallelized version of the code is taking longer than the non-parallelized version. I will show an easy example:

import multiprocessing as mp
import time
import numpy as np
from cvxpy import *
import functools
from sklearn.datasets import load_boston

boston = load_boston()
x = boston.data
y = boston.target

def lm_l_solver(x, y, lambda1):
    n = x.shape[0]
    m = x.shape[1]
    lambda_param = Parameter(sign="positive")
    beta_var = Variable(m)
    lasso_penalization = lambda_param * norm(beta_var, 1)
    lm_penalization = (1/n) * sum_squares(y - x * beta_var)
    objective = Minimize(lm_penalization + lasso_penalization)
    problem = Problem(objective)
    # Solve optimization problem
    beta_sol_matrix = np.zeros((len(lambda1), 1, m))
    for i in range(len(lambda1)):
        lambda_param.value = lambda1[i]
        problem.solve(solver=CVXOPT)
        beta_sol = np.asarray(np.row_stack([b.value for b in beta_var])).flatten()
        beta_sol_matrix[i, :] = beta_sol
    beta_sol_matrix[np.abs(beta_sol_matrix) < 1e-4] = 0
    # Generate response
    response = dict(solution=beta_sol_matrix, lambda1=lambda1)
    return response

if __name__ == '__main__':
    vector = np.arange(1, 100, 1)
    start_time = time.time()
    chunks = np.array_split(vector, mp.cpu_count())
    pool = mp.Pool(processes=mp.cpu_count())
    results = pool.map(functools.partial(lm_l_solver, x, y), chunks)
    pool.close()
    pool.join()
    end_time_1 = time.time()
    results2 = lm_l_solver(x, y, vector)
    end_time_2 = time.time()
    print('Parallel programming took {} seconds'.format(round(end_time_1-start_time, 2)))
    print('Non parallel programming took {} seconds'.format(round(end_time_2 - end_time_1, 2)))

The function lm_l_solver receives a data matrix x, a response vector y and a vector of possible lambda values, and solves a penalized linear model for each of the lambda values.

Executing this piece of code generates the following output:

Parallel programming took 5.28 seconds
Non parallel programming took 0.4 seconds

Why such a difference? The parallelized version of "lm_l_solver" took 13 times longer than the non-parallelized version. Am I doing something wrong here?

It's just a guess, but numpy is extremely efficient and is coded in C. Doing `cubic(vector)` is calling the function on the complete vector, which is extremely fast... The parallel version will have to spawn the process, do the chunk and then compute on each chunk the same vectorize cubic function. However, the difference still feels a bit important... — Mathieu, Jun 13 '18 at 08:54
I think that what you suggest may affect but cannot be the only answer. This cubic function is an easy example, but in the code in which I noticed this problem, the objective function to be parallelized works solving an optimization problem over a parameter, so numpy is not directly involved there. — Álvaro Méndez Civieta, Jun 13 '18 at 09:01
I am also doing optimization using the multiprocessing library, and I do not have the same behavior... Really weird :/ Could you try adding the number of processes to spawn in the `Pool`: `pool = mp.Pool(processes = mp.cpu_count())` but I think it already does spawn the correct number by default... — Mathieu, Jun 13 '18 at 09:04
Your `get_chunks()` function is flawed. Try this instead: `chunks = np.array_split(vector, mp.cpu_count())` — grepe, Jun 13 '18 at 09:26
I have modified the piece of code so now it shows the optimization problem I am trying to solve, and also includes the split-chunks you suggest @grepe. But as you can see, the problem persists. — Álvaro Méndez Civieta, Jun 13 '18 at 09:43
I tried your code but i put the `start_time` line after the pool was created and made `end_time` a global variable set from within the `cube()` function (so that the last call that finishes sets it). When I don't count the process pool initialization and closing, I'm getting slightly better times for the calculation itself than pure numpy, but not by much (even if I increase the size of the vector). — grepe, Jun 13 '18 at 10:14
How long overall is the optimization taking? There are many questions on here about "why is multiprocessing slower than single-process" and usually the answer is "because your task is too easy and the overhead of creating multiple processes outweighs the speed gain". — BrenBarn, Jun 16 '18 at 07:30
As mentioned above, there can be many causes for this problem. My first question is : what are the hardware configuration specification of your machine ? Is it the same PC and your are running this piece of code ? I am assuming this is the case. Now imaging your X and Y fit in your memory, in this case, your serial case will load both X and Y into your memory once and solve the system everytime with a different lambda. So there is no need to load the data everytime. — AdityaG, Jun 17 '18 at 18:41
Now consider the parallel execution : Since all the cores share the memory, when this code is run with lets say 6 cpus, each of them will try to load the X and Y into memory, so its 6 times the memory which may not fit into your physical memory. This will mean that, every time a cpu wants to solve with a lamda, it has to wait till the memory is free or access it from the hard disk. This will slowdown the whole process. In case if all the 6 * (sizeof(X) + sizeof(Y)) fits into your memory (Which I think is not the case) you can expect speedup. — AdityaG, Jun 17 '18 at 18:45
Also think of Cache misses which have a lot of influence on your performance. — AdityaG, Jun 17 '18 at 18:55
Ok I see your point, so would it be possible to modify the code in a way in which X and Y are loaded once and the parallelized code use the same X, Y all along the lambdas? Speaking about the hardware configurations, I am running this code on a windows machine, intel core I5 CPU 16GB ram memory. — Álvaro Méndez Civieta, Jun 18 '18 at 08:15
I'm failing to get your program to run. sign="positive" is not a valid kwarg for cvxpy's Parameter(). The docs tell me to use nonneg=True; Even with this, the solver complains about incompatible dimensions. Apart from that the dataset should easily fit into memory, as it's a few tens of kB (506*14*8=~57000) — Obay, Jun 18 '18 at 15:06
using partial is not a good idea I think. the shared variables creation as defined in https://stackoverflow.com/questions/5549190/is-shared-readonly-data-copied-to-different-processes-for-multiprocessing will work I think. — AdityaG, Jun 18 '18 at 17:15
Which python version? Can you post your `requirements.txt`? That would help reproducing, thanks. (`sign` is unexpected keyword with Python3.6 and latest libraries) — Hugues Fontenelle, Jun 20 '18 at 08:10

Hugues Fontenelle · Answer 1 · 2018-06-20T09:42:06.323

It's a bit hard to reproduce exactly, as it depends on the configuration of your cluster. In my case I don't get much worse behaviour parallelising, whether I use my own 4-cores CPU or a small 24-cores server.

Anyway, the culprit is that your solver, CVXOPT, which uses BLAS, is already multithreaded. By trying to parallelise your code, you are competing with that linear algebra library. To prove my point, I forced BLAS to use only one thread. In that case, multiprocessing can show some advantage:

$ python solver.py
Parallel programming took 2.0 seconds
Non parallel programming took 2.73 seconds

$ OMP_NUM_THREADS=1 python solver.py
Parallel programming took 0.57 seconds
Non parallel programming took 2.73 seconds

Setting OMP_NUM_THREADS=1 basically turns off the OpenMP multi-threading, so each of your Python processes remains single-threaded; and BLAS also uses one thread.

For your application, you'll have to balance the number of threads (with OMP_NUM_THREADS) and the number of processes (in mp.Pool(processes=24) for example).

Reading references

Reproducibility

requirements.txt with Python 2.7.5:

numpy
cvxpy==0.4.11
sklearn
cvxopt

Chen A. · Accepted Answer · 2018-06-20T12:53:13.450

multiprocessing is used for CPU-bound functions. You can't really make performance comparison on tasks that lasts 0.5 seconds on a single process. Your testing scenario needs to be updated.

There is a lot of overhead when using multiprocessing compared to a single process. Python needs to spawn the new processes, handover the tasks, communicate the results back, etc.

Numpy actually executes external C code behind the scenes, which is optimized. That means it will execute very fast serially, on a single CPU.

The dispatch overhead dominates the actual computation time in your question.

You should increase your data input, so it takes Numpy about 1-2 minutes to complete on a single process, and then move to multiprocessing to make the comparison.

Besides that, you might get better results in multiprocessing if you would use imap_unordered instead of map, if the order is not important.

There's a good blogpost that makes some comparison between multithreading and multiprocessing, and it addresses Numpy operations too. I think you might find it helpful.

Why is parallel programing through multiprocessing library taking longer than usual programming?

2 Answers2