Loop through collision detection with multiple processes

Question

I'm using python to write an ideal gas simulator, and right now the collision detection is the most intensive part of the program. At the moment though, I'm only using one of my 8 cores. (I'm using an i7 3770 @ 3.4GHz)

After minimal googling I found the multiprocessing module for python (2.7.4). And I've tried it. With a bit of thought I've realised the only thing I can really run in parallel is here, where I loop through all the particles to detect collisions:

for ball in self.Objects:   
        if not foo == ball:
            foo.CollideBall(ball, self.InternalTimestep)

Here foo is the particle that I'm testing against all the others. So I tried doing this:

for ball in self.Objects:   
        if not foo == ball:
            p = multiprocessing.Process(target=foo.CollideBall, args=(ball, self.InternalTimestep))
            p.start()

Although the program does run a little faster, it's still only using 1.5 cores to their fullest extent, the rest are just in idle and it's not detecting any collisions either! I've read that if you create too many processes at once (more than the number of cores) then you get a backlog (this is a loop of 196 particles), so this might explain the lower speed than I was expecting, but it doesn't explain the fact I'm still not using all my cores!

Either way it's too slow!!! So is there a way I can create 8 processes, and only create a new one when there are less than 8 processes running already? Will that even solve my problem? And how do I use all of my cores/why is this code not already?

I only found out about multiprocessing in python yesterday, so I'm afraid any answers will have to be spelt out to me.

Thanks for any help though!

---EDIT---

In response to Carson, I tried adding p.join directly after p.start and this slowed the programme right down. Instead of taking o.2 seconds per cycle it's taking 24 seconds per cycle!

You *must* read about the GIL problem of Python if you want to speed up things using concurrency: http://stackoverflow.com/questions/1294382/what-is-a-global-interpreter-lock-gil — Alfe, Nov 05 '13 at 09:31
I don't know about your code, but most of the times, a p.join() should be added. (http://docs.python.org/2.7/library/multiprocessing.html#multiprocessing.Process.join) — hellerve, Nov 05 '13 at 09:34
@Alfe : the GIL only impacts the threads (threading module); The mutliprocessing module circumvent this limitation using process (which are not subject to the GIL). — lucasg, Nov 05 '13 at 09:35

No answer · Answer 1 · 2013-11-05T09:45:14.493

As far as I understand, you test one particle against all others and then perform that operation on each particle in turn. Based on this, I'd say your problem is that you try to optimize your code to work on all cores without trying to optimize your code itself.

Instead you could partitionate your particles so that you only check those that are close to each other. One possible mean to do so is a quad tree: See http://en.wikipedia.org/wiki/Quadtree.

In a second step you can parallelize everything. For quad trees you resolve the upmost level by hand and create a new process for each sub tree. By this, the processes are independent from each other and don't block. I'd expect a quadratic speed up (think of square root of your current run time) by the quad tree and the enabling of a further linear speed up (divide by number of processes) through parallelization.

Sorry, I can't spell it out in Python.

Is that the only way I can use all cores? I have tried implementing quadtrees and haven't had much success — jezza, Nov 05 '13 at 09:46
ok well I've now ironed out the bugs in a previous attempt to implement a quadtree, and it's sped it up a hell of a lot. But it's still only using one core, so what do I need to do now? — jezza, Nov 05 '13 at 11:31

score 0 · Answer 2 · edited May 23 '17 at 10:32

With a working quad tree, you could set up a thread pool (as a class) and define jobs (another class) that are allocated to individual threads (yet another class, if possible from a threading framework). In your case a job contains a list of quad tree nodes that have to be inspected. Initially each top level quad tree node (4 in 2D / 8 in 3D) resides in its own job.

So you can have up to 4 (respective 8) threads, each of which inspecting an independent subtree of the quadtree. If you need more threads to fully use your machines processing power you can have threads put back part of their jobs to the thread pool, if they encounter many deep subtrees.

For this, I'd use a BFS (breadth first search) with the list of quadtree nodes from the job. If the list gets longer than expected, I'd put part of it back to the thread pool. Knowledge in maths/statistics/stochastics helps finding a good parameterization for what length is to be expected.

I've also written a quad tree implementation that parameterizes itself according to expected number of objects given the "world" size and calculating the average object size.

Search for the open source project d-collide. Although it's in C++ there should be some usefull sample code. But please regard its licensing, which is not asked much as it's BSD style.

I added this as a second answer, because the first one was about optimizing your code to achieve your implied goal: better run time (although it's via better efficiency)

This second answer is about achieving your written goal: stronger parallelization. However the quad tree enables this second step, but don't expect the second speed up to be as much as the first. Especially when it comes to many objects, nothing beats an optimized algorithm. But don't lose yourself in micro optimizations: see the runtime discussion in Cancelling a Task is throwing an exception

I just noticed a related question, which answer removes the need for self-implemented ThreadPool and Job classes. Obviously there's something like multiprocessing.Queue for that, see: http://stackoverflow.com/questions/2359253/solving-embarassingly-parallel-problems-using-python-multiprocessing?rq=1 — No answer, Nov 05 '13 at 17:09

Loop through collision detection with multiple processes

2 Answers2