How to parallelise code with shared writeable memory

Question

I have a program that is is currently quite slow and CPU heavy, and I'm confused about how to parallelise it.

My Problem:

The algorithm works like this (with random numbers):

import numpy as np
import scipy.spatial as spatial
from collections import Counter

# Particle dictionary
num_parts = int(1e4)
particles = {'coords': np.random.rand(num_parts, 3),
             'flags': np.random.randint(0, 4, size=num_parts)}

# Object dictionary
num_objects = int(1e3)
objects_dict = {'x': [np.random.rand(1)
                      for i in range(0, num_objects)],
                'y': [np.random.rand(1)
                      for i in range(0, num_objects)],
                'z': [np.random.rand(1)
                      for i in range(0, num_objects)],
                'prop_to_calc': [np.array([])
                                 for i in range(0, num_objects)]}

# Build KDTree for particles
tree = spatial.cKDTree(particles['coords'])

# Loop through objects to calculate 'prop_to_calc' based
# on nearby particles flags within radius, r.
r = 0.1
for i in range(0, num_objects):

    # Find nearby indices of particles.
    indices = tree.query_ball_point([objects_dict['x'][i][0],
                                     objects_dict['y'][i][0],
                                     objects_dict['z'][i][0]],
                                    r)

    # Extract flags of nearby particles.
    flags_array = particles['flags'][indices]

    # Find most common flag and store that in property_to_calculate
    c = Counter(flags_array)
    objects_dict['prop_to_calc'] = np.append(objects_dict['prop_to_calc'],
                                             c.most_common(1)[0][0])

There are two data sets particles and objects_dict. I want to calculate objects_dict['prop_to_dict'] by searching nearby particles within radius, r and finding their most common flag. This is done with cKDTree and query_ball_point.

For these numbers the time is: 10 loops, best of 3: 55.8 ms per loop in Ipython 4.2.0

However, I want num_parts=1e6 and num_objects=1e5, which results in some serious slow down.

My question:

As it is CPU heavy, I want to try to parallelise it to get some speed up.

I have looked at both multiprocessing and multi threading. However the docs confuse me quite a lot, and I'm not sure how to apply the examples to this problem.

Specifically, I'm concerned with how I share both dictionarys between processes and write into objects_dict at the end.

Thanks for the help in advance.

If you want speed, try out the python-compiler Cython or just write your programs in C(++). — cmdLP, Jun 06 '17 at 11:56
I have thought about Cython. However, I wanted to first see whether it is possible to use parallel computing for my example. — j.arthur5, Jun 06 '17 at 12:59

cmdLP · Answer 1 · 2017-06-06T11:51:51.533

0

Try to split the code and make local lists and dictionaries. Union the dictionaries and lists at the end. Just import threading and make a class CalculationThread with subclass threading.Thread and write the parallel code in the method run of the class and start the thread by thread = CalculationThread() and thread.start()

The final code for this is:

import threading

class CalculationThread(threading Tread):
    def __init__(self, range):
        self.range = range
        # use this copy instead for the thread
        self.objects_dict = dict(objects_dict)
    def run(self):
        for i in self.range:
            # your code

threads = []
threadCount = 8

lastThreadStart = num_objects
tLen = num_objects // threadCount
for i in reversed(range(threadCount)):
    threadStart = i*tLen
    t = CalculationThread(range(threadStart, lastThreadStart))
    lastThreadStart = threadStart

# wait for all to finish
for t in threads: t.join()

# code to union all objects_dicts comes here

edited Jun 06 '17 at 11:51

answered Jun 06 '17 at 11:44

cmdLP

1,658
9
19

This will not speed-up the processing one iota, in fact splitting the execution to threads will actually **slow down** the processing as now not only does Python have to calculate your results, but it also has to deal with GIL and context switches. – zwer Jun 06 '17 at 11:51
You're making no sense. Pretty much every standalone Python interpreter (CPython, PyPy...) treats threads and processes differently - threads execute within the same process and are walled off from each other merely by GIL - this allows them to share memory, but having GIL and thread scheduling in place actually slows down the execution as the interpreter has to do extra work. It's useful when your code does a lot of waiting for external events (I/O for example) but for calculations it will **always** execute slower than running everything in a single thread. – zwer Jun 06 '17 at 12:03
I think, if you put the code into a function in a string, exec it, the bytecode is generated seperatly per thread, so it should not be the same bytecode and then it should not be affected by the GIL. – cmdLP Jun 06 '17 at 12:12
You think wrong. Unless you launch sub-processes (in which case why not go with the `multiprocessing` module in the first place) for each code snippet, it will all run in the same process. And if you launch different processes, you won't be able to share memory (and, btw., that's precisely what the `multiprocessing` module does) – zwer Jun 06 '17 at 12:48
@zwer if I understand correctly you can share read-only memory between a Pool of processes with [docs](https://docs.python.org/2/library/multiprocessing.html#synchronization-between-processes). Can you share memory that each process can write too and combine at the end? – j.arthur5 Jun 06 '17 at 13:05
@j.arthur5 - it's not sharing the memory - what's hidden away is that `multiprocessing.Process` actually pickles all multiprocessing-unaware arguments. You can pass a `multiprocessing.Lock` around as it just carries an identifier to recreate itself at the end (essentially to re-bind to the underlying OS _semaphore_) but your data doesn't have that luxury - it will be pickled and your only way to 'share' it is to be aware of that (check [this](https://stackoverflow.com/a/44186168/7553525) as an example). You can use `multiprocessing.Manager` to pass data around if you need it bi-directional. – zwer Jun 06 '17 at 14:44

How to parallelise code with shared writeable memory

1 Answers1