Parallel computing with Python wrapped C++ classes: performing operations in every element of a vector

Question

Consider C++ classes. The first one is a branch class:

class Branch{
  map<string,double> properties;
}

i.e. a branch object is only caracterised by its properties, which are stocked in a map. Each property has a name and is associated with a double value. The second one is a tree class, composed of many branches:

class Tree{
  vector<*Branch> tree;

  void addBranch(int index); //adds a branch to tree vector at index
  void removeBranch(int index); //removes branch at index and and its descendents
  double getProperty(int index, string name);//gets value of property name of branch at index
  void addProperty(int index, string name, double value);
  void setProperty(int index, string name, double value);
}

Now assume that the tree class is wrapped using cython. Then, in Python we can manipulate a PyTree object, add and remove branches and manipulate the properties of every branch. Consider the following python program:

tree=PyTree()
for i in range(TotalTime):
  k=random.random()
  if k>0.1:
    tree.addBranch(random_index) #it is not important how we get the index
    tree.addProperty(random_index,'prop1',1)
    tree.addProperty(random_index,'prop2',1)
  k=random.random()
  if k>0.9:
    tree.removeBranch(random_index)
  for j in range(NumberOfBranches): #it's not important how we get the number of branches
    operation1(j,'prop1') # assume this functions were defined 
    operation2(j,'prop2')

In this program I add and remove branches randomly. Each branch has two properties prop1 and prop2. There's an operation1 function performing an operation involving getProperty and setProperty functions on 'prop1', and an operation2 function doing the same on 'prop2'.

What I want is to have one processor(or thread) performing each operation. Since the program is continiously calling an external C++ library, should I use threading instead of multiprocessor?

How should I implement the parallelization? I tried to do it inspiring myself with this https://www.quantstart.com/articles/parallelising-python-with-threading-and-multiprocessing, but when I use both threading or multiprocessor I get a slower program...

Are `operationX` Python functions, Cython functions, C++ functions or something else? And is it OK to start `operation1` for `j+1` while `operation2` for `j` is still going (and vice versa)? — DavidW, Mar 13 '17 at 19:06
@DavidW `operationX` are Python functions using wrapped C++ methods and performing a simple calculation. Starting `operation1` for `j+1` while `operation2` for `j` is still going could be considered. The order in which we traverse the tree is not important at all. In fact it would be even best if the calculations for each branch were performed in parallel! Imagine that `prop1/2` are length and radius of a branch, and `operation1/2` axial and radial growth mechanisms. The growth of each branch should be calculated in parallel if we seek the maximum biological relevance. — bengo, Mar 14 '17 at 09:32
@DavidW Of course, properties are not completely independent (for ex: the ratio length/radius might be inside some range), and neither are the branches: the growth of one branch influences the growth of the others (by putting in the shade other branches for example). — bengo, Mar 14 '17 at 09:35

score 1 · Accepted Answer · answered Mar 15 '17 at 10:23

I'd recommend using threading - since the main work is being done in your wrapped C++ functions then you should be able to release the GIL and have everything work in parallel reliably.

A good general rule is to make new threads as few times as possible (this can be a somewhat slow operation) and then feed them data until you're done. You say in the comments that you aren't at all concerned about the order the operations are performed in (good!). With that in mind I'm going to suggest feeling lambda functions containing the operations into a Queue and having the threads pick them off and run time:

The following code then goes inside your for i in range(TotalTime): loop in place of the for j in range(NumberOfBranches)::

q = Queue()

def thread_func():
    """A function for each thread to run.

    It waits to get an item off the queue and then runs that item.
    None is used to indicate that we're done. If we get None, we 
    put it pack on the Queue to ensure every thread terminates"""
    while True:
        f = q.get()
        if f is None:
            q.put(None)
            return     
        f()

# make and start the threads.
# "no_threads" should be something similar to the number of cores you have.
# 4-8 might be a good number to try?
threads = [ threading.Thread(target=thread_func) for n in range(no_threads) ]
[ t.start() for t in  threads ]

# put the required operations on the Queue
for j in (NumberOfBranches):
    # note the awkward syntax to 
    # ensure we capture j: http://stackoverflow.com/a/7514158/4657412
    q.put(lambda j=j: operation1(x,"prop1"))
    q.put(lambda j=j: operation2(x,"prop2"))

q.put(None) # to terminate

# wait for threads to finish
[ t.join() for t in threads ]

In order to allow the threads to actually work in parallel you'll need to ensure that the GIL is released inside your Cython wrapper:

def operation1(a,b):
   with nogil:
      cplusplus_operation1(a,b)

I have two concerns with using multiprocessing

it isn't hugely portable (it works differently on Windows).
if operation1/2 modifies the C++ data then you may find the modified data isn't shared across processes without special effort on your part (which would defeat the point of doing the operation!)

Thanks for your answer. I will try what you suggest. I also tried to parallelize by dividing the N branches in 8 cores. When the number of branches is large (~10⁶) and the `operation1/2` is costly (ex: matrix inversion) `multiprocessing` goes much faster than `threading`. However, I'm interested in performing lots of very simple calculations (properties variations will be mostly driven by simple differential equations) and someone adviced me to use GPU parallelism with CUDA. What do you think? — bengo, Mar 15 '17 at 10:36
My belief is that the main differences in speed between `multiprocessing` and `threading` are 1) communication between processes is slower in `multiprocessing` 2) `multiprocessing` is faster if you're using Python since `threading` only works well when you can release the GIL (e.g. when calling C++). — DavidW, Mar 15 '17 at 10:45
It sounds like CUDA might be a good option. `numba` for Python is a moderately simple way to use if so perhaps give that a look? Transferring data to/from the GPU is always somewhat slow if you need to do that a lot. Most non-specialist GPUs are much better at `float` than `double` so think about what precision you need. — DavidW, Mar 15 '17 at 10:48
Finally I get better results using CUDA with numba package. However, I thank you for the answer since it shed light on how to use multithreading with Python. — bengo, Mar 17 '17 at 10:13
why should number of threads be similar to the number of cores I have? — bengo, Mar 17 '17 at 14:35
Each core can run one thread at once. Any less and you won't be using all the available resources and any more. There's a time cost to switching threads so you don't want too many more either. I think some CPUs have [a feature where they can keep two threads running per core efficiently](https://en.wikipedia.org/wiki/Hyper-threading) with some gains from quickly switching if one thread hits a bottleneck. Therefore, between having the number of threads be 1-2 times the number of cores is probably a good place to be. — DavidW, Mar 17 '17 at 14:46
I wouldn't worry about making it match exactly, but I'd guess that starting hundreds or thousands of threads will make things noticeably worse. — DavidW, Mar 17 '17 at 14:49

Parallel computing with Python wrapped C++ classes: performing operations in every element of a vector

1 Answers1