I have a a problem where I need to solve thousands of independent nonnegative least squares problem using nnls in scipy. All problems are small about 100x100 matricies. To speed it up I've tried to use the multiprocessing module in python with the Pool class. I get about a factor 2 improvement if I set number of threads in numpy to 1 and use multiprocessing vs using multithreaded numpy and no multiprocessing. But the performance is very unpredictable. For instance, if I move sections of code into a separate function (to make it easier to read) or call the pool.map function in a class method the performance can decrease with 50%. So it seems like the multiprocessing module is too unreliable to be used.
Does anyone know what can cause this behaviour or know of a better alternative to multiprocessing?