Parallelizing using cython

Question

Is there a way the code below can be parallelized? I looked into cyton's prange, but couldn't figure out how it works. Does the prange parallelize the internal loops on different cores? For the code below how can I parallelize it?

@cython.boundscheck(False)
def gs_iterate_once(double[:,:] doc_topic,
                    double[:,:] topic_word,
                    double[:] topic_distribution,
                    double[:] topic_probabilities,
                    unsigned int[:,:] doc_word_topic,
                    int num_topics):
  cdef unsigned int doc_id
  cdef unsigned int word_id
  cdef unsigned int topic_id
  cdef unsigned int new_topic
  for i in xrange(doc_word_topic.shape[0]):
    doc_id = doc_word_topic[i, 0]
    word_id = doc_word_topic[i, 1]
    topic_id = doc_word_topic[i, 2]

    doc_topic[doc_id, topic_id] -= 1
    topic_word[topic_id, word_id] -= 1
    topic_distribution[topic_id] -= 1

    for j in xrange(num_topics):
      topic_probabilities[j] = (doc_topic[doc_id, j] * topic_word[j, word_id]) / topic_distribution[j]

    new_topic = draw_topic(np.asarray(topic_probabilities))

    doc_topic[doc_id, new_topic] += 1
    topic_word[new_topic, word_id] += 1
    topic_distribution[new_topic] += 1
    # Set the new topic
    doc_word_topic[i, 2] = new_topic

score 3 · Answer 1 · edited May 23 '17 at 12:18

prange uses OpenMP that is indeed shared-memory parallelism. So, on a single computer it will create threads that run on the different cores available, with access to the same pool of memory.

For the routine that you show, the first step is to understand what part can be parallelized. Typically, with data using as first index i, operating only on element i and not, say, i-1 or i+1, makes the problem parallelizable. This is not the case here, so you need to find a way to make the computation more independent.

Actually finding the specific parallel pattern is beyond a SO answer but I'll mention a few tips:

What is inside the prange must be all cythonized. Python calls are not possible in a thread. + suggestion by @DavidW: Python calls are possible when part of a with gil block.
A typical advice here is to check, once your code has been made independent of the loop ordering, wheter your results are the same when running the index from n-1 to 0 instead of from 0 to n-1
A few commented and illustrative examples: https://homes.cs.washington.edu/~jmschr/lectures/Parallel_Processing_in_Python.html Cython prange slower for 4 threads then with range http://nealhughes.net/parallelcomp2/ http://www.perrygeo.com/parallelizing-numpy-array-loops-with-cython-and-mpi.html

Re point 1 - can I suggest an addition: Python calls are allowed if you put them within a `with gil:` block. Provided that this is a small part of the loop then the performance cost isn't too bad. — DavidW, May 05 '17 at 09:31

DavidW · Answer 2 · 2017-05-06T07:44:40.293

@PierredeBuyl's answer gives a good outline of what prange does and how to use it.

This is more a few specific comments relating to your code:

You can't parallelize the outer loop:
```
doc_topic[doc_id, topic_id] -= 1
```
and the similar ones for other variables and for +=1. These modify a variable that is shared between all the loops, and are going to cause inconsistent results.
A similar problem exists with topic_probabilities[j] = ... if you're parallelizing the outer loop.
You could easily parallelize the inner loop for j in xrange(num_topics): - this only modifies stuff that depends on the index 'j' so there's no issue with the threads fighting to modify the same data. (However, there's a performance cost each time you start a multithreaded region, so you usually try to parallelize the outer loop instead to avoid this - depending on the size of the arrays you may not gain much)

Parallelizing using cython

2 Answers2