2

I have a strong background in numeric compuation using FORTRAN and parallelization with OpenMP, which I found easy enough to use it on many problems. I switched to PYTHON since it much more fun (at least for me) to develop with, but parallelization for nummeric tasks seem much more tedious than with OpenMP. I'm often interested in loading large (tens of GB) data sets to to the main Memory and manipulate it in parallel while containing only a single copy of the data in main memory (shared data). I started to use the PYTHON module MULTIPROCESSING for this and came up with this generic example:

#test cases    
#python parallel_python_example.py 1000 1000
#python parallel_python_example.py 10000 50

import sys
import numpy as np
import time
import multiprocessing
import operator

n_dim = int(sys.argv[1])
n_vec = int(sys.argv[2])

#class which contains large dataset and computationally heavy routine
class compute:
    def __init__(self,n_dim,n_vec):
        self.large_matrix=np.random.rand(n_dim,n_dim)#define large random matrix
        self.many_vectors=np.random.rand(n_vec,n_dim)#define many random vectors which are organized in a matrix
    def dot(self,a,b):#dont use numpy to run on single core only!!
        return sum(p*q for p,q in zip(a,b))
    def __call__(self,ii):# use __call__ as computation such that it can be handled by multiprocessing (pickle)
        vector = self.dot(self.large_matrix,self.many_vectors[ii,:])#compute product of one of the vectors and the matrix
        return self.dot(vector,vector)# return "length" of the result vector

#initialize data
comp = compute(n_dim,n_vec)

#single core
tt=time.time()
result = [comp(ii) for ii in range(n_vec)]
time_single = time.time()-tt
print "Time:",time_single

#multi core
for prc in [1,2,4,10]:#the 20 case is there to check that the large_matrix is only once in the main memory
  tt=time.time()
  pool = multiprocessing.Pool(processes=prc)
  result = pool.map(comp,range(n_vec))
  pool.terminate()
  time_multi = time.time()-tt  
print "Time using %2i processes. Time: %10.5f, Speedup:%10.5f" % (prc,time_multi,time_single/time_multi)

I ran two test cases on my machine (64bit Linux using Fedora 18) with the following results:

andre@lot:python>python parallel_python_example.py 10000 50
Time: 10.3667809963
Time using  1 processes. Time:   15.75869, Speedup:   0.65785
Time using  2 processes. Time:   11.62338, Speedup:   0.89189
Time using  4 processes. Time:   15.13109, Speedup:   0.68513
Time using 10 processes. Time:   31.31193, Speedup:   0.33108
andre@lot:python>python parallel_python_example.py 1000 1000
Time: 4.9363951683
Time using  1 processes. Time:    5.14456, Speedup:   0.95954
Time using  2 processes. Time:    2.81755, Speedup:   1.75201
Time using  4 processes. Time:    1.64475, Speedup:   3.00131
Time using 10 processes. Time:    1.60147, Speedup:   3.08242

My question is, am I misusing the MULTIPROCESSING module here? Or is this the way it goes with PYTHON (i.e. don't parallelize within python but rely totally on numpy's optimizations)?

Andre
  • 427
  • 4
  • 17
  • Numerical calculations are fine to run over threads. – Jakob Bowyer Jul 24 '13 at 08:36
  • That is what the second example shows, but in the first one (huge matrix with some multiplications) the seem to be an enormous overhead - which I cant remember to see on OpenMP and FORTAN – Andre Jul 24 '13 at 08:39
  • @JakobBowyer: Talking about the Python `threading` module? Then, no, CPU-bound applications won't gain from Python threading as of the GIL. Native threading, however, is awesome, but this must be supported by the underlying library. – Dr. Jan-Philip Gehrcke Jul 24 '13 at 09:51
  • @Jan-PhilipGehrcke: incorrect. Vectorized numpy computations that you should use anyway for performance *may release GIL* i.e., CPU-bound applications can gain performance even using `threading` module. – jfs Jul 24 '13 at 10:10
  • @J.F.Sebastian: I tried to cover this with "this must be supported by the underlying library", but this is quite a special case. I think it is important to make people aware of the fact that, generally spoken, Python's threading does not help when an application is CPU-bound. – Dr. Jan-Philip Gehrcke Jul 24 '13 at 13:11

3 Answers3

3

While there is no general answer to your question (in the title), I think it is valid to say that multiprocessing alone is not the key for great number-crunching performance in Python.

In principle however, Python (+ 3rd party modules) are awesome for number crunching. Find the right tools, you will be amazed. Most of the times, I am pretty sure, you will get better performance with writing (much!) less code than you have achieved before doing everything manually in Fortran. You just have to use the right tools and approaches. This is a broad topic. A few random things that might interest you:

  • You can compile numpy and scipy yourself using Intel MKL and OpenMP (or maybe a sys admin in your facility already did so). This way, many linear algebra operations will automatically use multiple threads and get the best out of your machine. This is simply awesome and probably underestimated so far. Get your hands on a properly compiled numpy and scipy!

  • multiprocessing should be understood as a useful tool for managing multiple more or less independent processes. Communication among these processes has to be explicitly programmed. Communication happens mainly through pipes. Processes talking a lot to each other spend most of their time talking and not number crunching. Hence, multiprocessing is best used in cases when the transmission time for input and output data is small compared to the computing time. There are also tricks, you can for instance make use of Linux' fork() behavior and share large amounts of memory (read-only!) among multiple multiprocessing processes without having to pass this data around through pipes. You might want to have a look at https://stackoverflow.com/a/17786444/145400.

  • Cython has already been mentioned, you can use it in special situations and replace performance-critical code parts in your Python program with compiled code.

I did not comment on the details of your code, because (a) it is not very readable (please get used to PEP8 when writing Python code :-)) and (b) I think especially regarding number crunching it depends on the problem what the right solution is. You have already observed in your benchmark what I have outlined above: in the context of multiprocessing, it is especially important to have an eye on the communication overhead.

Spoken generally, you should always try to find a way from within Python to control compiled code to do the heavy work for you. Numpy and SciPy provide great interfaces for that.

Community
  • 1
  • 1
Dr. Jan-Philip Gehrcke
  • 33,287
  • 14
  • 85
  • 130
  • Thanks for this detailed answer to my question! Since you also mention cython, which I will definitively have a look into, and other points I'm going to mark this as answer to my question. – Andre Jul 24 '13 at 10:18
0

Number crunching with Python... You probably should learn about Cython. It is and intermediate language between Python and C. It is tightly interfaced with numpy and has support for paralellization using openMP as backend.

hivert
  • 10,579
  • 3
  • 31
  • 56
0

From the test results you supplied, it appears that you ran your tests on a two core machine. I have one of those and ran your test code getting similar results. What these results show is that there is little benefit to running more processes than you have cores for numerical applications that lend themselves to parallel computation.

On my two core machine, approximately 20% of the CPU is absorbed simply in keeping my environment going, so when I see a 1.8 improvement running two processes I am confident that all the available cycles are being used for my work. Basically, for parallel numerical work the more cores the better as this raises the percentage of the computer that is available to do your work.

The other posters are entirely correct in pointing you at Numpy, Scipy, Cython etc. Basically you first need to make your computation use as few cycles as possible and then use multiprocessing in some form to find more cycles to apply to your problem.

Jonathan
  • 2,635
  • 3
  • 30
  • 49
  • Hm, seem odd to me, I have an 4Core system (i7) with 8 logical cores. So I was hoping to see an speedup of approx 3.5. – Andre Jul 24 '13 at 12:05
  • @Andre You should use system monitor to see how much of your system is going to other processes while your test runs. – Jonathan Jul 24 '13 at 12:43