2

I have an understanding problem/question concerning the multiprocessing library of Python:
Why do different processes started (almost) simultaneously at least seem to execute serially instead of parallely?

The task is to control a universe of a large number of particles (a particle being a set of x/y/z coordinates and a mass) and perform various analyses on them while taking advantage of a multi-processor environment. Specifically for the example shown below I want to calculate the center of the mass of all particles.
Because the task specifically says to use multiple processors I didn't use the thread library as there is this GIL-thingy in place that constrains the execution to one processor.
Here's my code:

from multiprocessing import Process, Lock, Array, Value
from random import random
import math
from time import time

def exercise2(noOfParticles, noOfProcs):
    startingTime = time()
    particles = []
    processes = []
    centerCoords = Array('d',[0,0,0])
    totalMass = Value('d',0)
    lock = Lock()

    #create all particles
    for i in range(noOfParticles):
        p = Particle()
        particles.append(p)

    for i in range(noOfProcs):
        #determine the number of particles every process needs to analyse
        particlesPerProcess = math.ceil(noOfParticles / noOfProcs)
        #create noOfProcs Processes, each with a different set of particles        
        p = Process(target=processBatch, args=(
            particles[i*particlesPerProcess:(i+1)*particlesPerProcess],
            centerCoords, #handle to shared memory
            totalMass, #handle to shared memory
            lock, #handle to lock
            'batch'+str(i)), #also pass name of process for easier logging
            name='batch'+str(i))
        processes.append(p)
        print('created proc:',i)

    #start all processes
    for p in processes:
        p.start() #here, the program waits for the started process to terminate. why?

    #wait for all processes to finish
    for p in processes:
        p.join()

    #normalize the coordinates
    centerCoords[0] /= totalMass.value
    centerCoords[1] /= totalMass.value
    centerCoords[2] /= totalMass.value

    print(centerCoords[:])
    print('total time used', time() - startingTime, ' seconds')


class Particle():
    """a particle is a very simple physical object, having a set of x/y/z coordinates and a mass.
    All values are randomly set at initialization of the object"""

    def __init__(self):
        self.x = random() * 1000
        self.y = random() * 1000
        self.z = random() * 1000
        self.m = random() * 10

    def printProperties(self):
        attrs = vars(self)
        print ('\n'.join("%s: %s" % item for item in attrs.items()))

def processBatch(particles,centerCoords,totalMass,lock,name):
    """calculates the mass-weighted sum of all coordinates of all particles as well as the sum of all masses.
    Writes the results into the shared memory centerCoords and totalMass, using lock"""

    print(name,' started')
    mass = 0
    centerX = 0
    centerY = 0
    centerZ = 0

    for p in particles:
        centerX += p.m*p.x
        centerY += p.m*p.y
        centerZ += p.m*p.z
        mass += p.m

    with lock:
        centerCoords[0] += centerX
        centerCoords[1] += centerY
        centerCoords[2] += centerZ
        totalMass.value += mass

    print(name,' ended')

if __name__ == '__main__':
    exercise2(2**16,6)

Now I'd expect all processes to start at about the same time and parallelly execute. But when I look at the output of the programm, this looks as if the processes were executing serially:

created proc: 0
created proc: 1
created proc: 2
created proc: 3
created proc: 4
created proc: 5
batch0  started
batch0  ended
batch1  started
batch1  ended
batch2  started
batch2  ended
batch3  started
batch3  ended
batch4  started
batch4  ended
batch5  started
batch5  ended
[499.72234074100135, 497.26586187539453, 498.9208784328791]
total time used 4.7220001220703125  seconds

Also when stepping through the programm using the Eclipse-debugger, I can see how the program always waits for one process to terminate before starting the next one at the line marked with a comment ending in 'why?'. Of course, this might just be the debugger, but when I look at the output produced in a normal run, this shows exactly the above picture.

  • Are those processes executing parallelly and I just can't see it due to some sharing problem of stdout?
  • If the processes are executing serially: why? And how can I make them run in parallel?

Any help on understanding this is greatly appreciated.

I executed the above code from PyDev and from command line using Python 3.2.3 on a Windows 7 Machine with a dual core Intel processor.


Edit:
Due to the output of the program I misunderstood the problem: The processes are actually running in parallel, but the overhead of pickling large amounts of data and sending it to the subprocesses takes so long that it completely distorts the picture.
Moving the creation of the particles (i.e. the data) to the subprocesses so that they don't have to be pickled in the first place removed all problems and resulted in a useful, parallel execution of the program.
To solve the task, I will therefore have to keep the particles in shared memory so they don't have to be passed to subprocesses.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285

1 Answers1

2

I ran your code on my system (Python 2.6.5) and it returned almost instantly with results, which makes me think that perhaps your task size is so small that the processes finish before the next can begin (note that starting a process is slower than spinning up a thread). I question the total time used 4.7220001220703125 seconds in your results, because that's about 40x longer than it took my system to run the same code. I scaled up the number of particles to 2**20, and I got the following results:

('created proc:', 0)
('created proc:', 1)
('created proc:', 2)
('created proc:', 3)
('created proc:', 4)
('created proc:', 5)
('batch0', ' started')
('batch1', ' started')
('batch2', ' started')
('batch3', ' started')
('batch4', ' started')
('batch5', ' started')
('batch0', ' ended')
('batch1', ' ended')
('batch2', ' ended')
('batch3', ' ended')
('batch5', ' ended')
('batch4', ' ended')
[500.12090773656854, 499.92759577086059, 499.97075039983588]
('total time used', 5.1031057834625244, ' seconds')

That's more in line with what I would expect. What do you get if you increase the task size?

Brendan Wood
  • 6,220
  • 3
  • 30
  • 28
  • Hi Brendan. Thanks a lot for your help. Your computer seems to be far more powerful than mine, as mine really works on this task size. Between two processes, there is a gap of about one second. When I add more particles, the behaviour is the same, but it takes 70 seconds to finish. The peculiar thing is that the gap is between one process ending and the next one starting and not between the 'start' and the 'end' message of the same process – Philip Schaffner Jun 04 '12 at 19:06
  • Could you please modify your code to print a timestamp at the beginning and end of each process? That is, add a third argument `time()` to the `print(...)` statements in the `processBatch` function. – Brendan Wood Jun 04 '12 at 19:13
  • That was a very good idea! `batch0 started at time 11.016999959945679 batch0 ended at time 11.111000061035156 batch1 started at time 11.046000003814697 batch1 ended at time 11.141000032424927 ...[and so on]...` I can see now that all processes are indeed running parallelly. The funny thing is that, while 'batch0 started' and 'batch0 ended' appeared almost simultaneously, after the output of 'batch0 ended at', for about 10 seconds, nothing happens. And only after that time, batch1 started appears. This gap between the messages scales with the problem size – Philip Schaffner Jun 04 '12 at 19:19
  • Okay, so things are happening in parallel, but the print statements aren't getting flushed in chronological order and there's some strange blocking happening at some point. Try forcing the flush by adding `import sys` at the top of your code and `sys.stdout.flush()` after each print statement in `processBatch`. http://stackoverflow.com/questions/230751/how-to-flush-output-of-python-print – Brendan Wood Jun 04 '12 at 19:32
  • Thanks for the link. I already found this piece before and tried it, but sadly it doesn't change anything. I just can't figure out what is happening between one process ending and the next one starting that takes so much time... – Philip Schaffner Jun 04 '12 at 19:37
  • Two more suggestions: 1) Try using a 2.x version of Python for comparison purposes (works fine for me); 2) The one last suggestion I have would be to remove the lock that you're using to prevent race conditions. I know your code is technically wrong without it, but it would be interesting to see if that helps. If it does, you need to figure out a different way of getting around this (i.e., perhaps having each process with its own `centerCoords` variable and merging them in the main thread at the end). – Brendan Wood Jun 04 '12 at 19:39
  • 2
    What's taking so long is pickling your list of particles and sending it to the subprocesses. Seems this is way more inefficient on windows then on linux. – mata Jun 04 '12 at 19:43
  • @mata that's an interesting idea. Do you have a link with more information about that? – Brendan Wood Jun 04 '12 at 19:46
  • 2
    Thank you so much for your input. Mata's comment is absolutely correct: When I move the creation of the particles into the subprocesses so that they don't have to be pickled, the whole program terminates within 0.2 seconds. That was very enlightening, really appreciate the effort of you two. I will therefore accept this answer and see if I can find a way to also reward mata for this. – Philip Schaffner Jun 04 '12 at 19:54
  • Threads are lightweight processes on Linux and there is very little difference in start up times – Spaceghost Jun 05 '12 at 12:18
  • @Spaceghost I just ran a test and the difference is quite significant. The overhead of starting a `multiprocessing.Process` versus a `threading.Thread` is about 10x. Not a big deal if each thread/process has enough work to make the startup time insignificant, but something to think about if your threads/processes are short-lived. Here's the test script I used: http://pastebin.com/ZhHmPcfm – Brendan Wood Jun 05 '12 at 12:39
  • Brendan, careless wording on my part perhaps.. creating a thread using appropriate OS calls and switching between threads is not as expensive in Linux as in Windows but you are using a library in an interpreted language. – Spaceghost Jun 05 '12 at 14:13