2

I just wrote a trivial program to test how cython's prange performs, and here is the code:

from cython.parallel import prange
import numpy as np

def func(int r, int c):
  cdef:
    double[:,:] a = np.arange(r*c, dtype=np.double).reshape(r,c)
    double total = 0
    int i, j

  for i in prange(r, nogil=True, schedule='static', chunksize=1):
    for j in range(c):
      total += a[i,j]

  return total

On Mac Book pro, with OMP_NUM_THREADS=3, the above code takes almost 18 sec for (r,c) = (10000, 100000), and with single thread, it takes about 21 sec.

Why there is so little performance boost? Am I using this prange correctly?

ali_m
  • 71,714
  • 23
  • 223
  • 298
avocado
  • 2,615
  • 3
  • 24
  • 43
  • Turn off `boundschecking` first, experiment with different schedule types and chunksizes and see what you get. Have a look at [this](http://stackoverflow.com/questions/33193851/cython-prange-slower-for-4-threads-then-with-range/33204449#33204449) question. You are performing `sum-reduction` and compilers do a great job of optimizing such loops in serial mode and with `OpenMP` pragmas, the resulted assembly code may not be that optimal in this case. – romeric Dec 24 '15 at 15:14
  • @romeric, tried, results are no better at all. – avocado Dec 25 '15 at 01:25

1 Answers1

4

Have you timed how long it takes just to allocate a? A 10000 x 100000 float64 array takes up 8GB of memory.

a = np.ones((10000, 100000), np.double)

takes over six seconds on my laptop with 16GB of RAM. If you don't have 8GB free then you'll hit the swap and it will take a lot longer. Since func spends almost all of its time just allocating a, parallelizing your outer for loop can therefore only gain you a small fractional improvement on the total runtime.

To demonstrate this, I have modified your function to accept a as an input. In tmp.pyx:

#cython: boundscheck=False, wraparound=False, initializedcheck=False

from cython.parallel cimport prange

def serial(double[:, :] a):
    cdef:
        double total = 0
        int i, j
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            total += a[i, j]
    return total

def parallel(double[:, :] a):
    cdef:
        double total = 0
        int i, j
    for i in prange(a.shape[0], nogil=True, schedule='static', chunksize=1):
        for j in range(a.shape[1]):
            total += a[i, j]
    return total

For example:

In [1]: import tmp

In [2]: r, c = 10000, 100000

In [3]: a = np.random.randn(r, c)   # this takes ~6.75 sec

In [4]: %timeit tmp.serial(a)
1 loops, best of 3: 1.25 s per loop

In [5]: %timeit tmp.parallel(a)
1 loops, best of 3: 450 ms per loop

Parallelizing the function gave about a 2.8x speed-up* on my laptop with 4 cores, but this is only a small fraction of the time taken to allocate a.

The lesson here is to always profile your code to understand where it spends its most of its time before you dive into optimizations.


* You could do a little better by passing larger chunks of a to each worker process, e.g. by increasing chunksize or using schedule='guided'

ali_m
  • 71,714
  • 23
  • 223
  • 298