Cython's prange not improving performance

Question

I'm trying to improve the performance of some metric computations with Cython's prange. Here are my codes:

def shausdorff(float64_t[:,::1] XA not None, float64_t[:,:,::1] XB not None):
    cdef:
        Py_ssize_t i
        Py_ssize_t n  = XB.shape[2]
        float64_t[::1] hdist = np.zeros(n)

    #arrangement to fix contiguity
    XB = np.asanyarray([np.ascontiguousarray(XB[:,:,i]) for i in range(n)])

    for i in range(n):
        hdist[i] = _hausdorff(XA, XB[i])
    return hdist

def phausdorff(float64_t[:,::1] XA not None, float64_t[:,:,::1] XB not None):
    cdef:
        Py_ssize_t i
        Py_ssize_t n  = XB.shape[2]
        float64_t[::1] hdist = np.zeros(n)

    #arrangement to fix contiguity (EDITED)
    cdef float64_t[:,:,::1] XC = np.asanyarray([np.ascontiguousarray(XB[:,:,i]) for i in range(n)])

    with nogil, parallel(num_threads=4):
        for i in prange(n, schedule='static', chunksize=1):
            hdist[i] = _hausdorff(XA, XC[i])
    return hdist

Basically, in each iteration the hausdorff metric is computed between XA and each XB[i]. Here is the signature of the _hausdorff function:

cdef inline float64_t _hausdorff(float64_t[:,::1] XA, float64_t[:,::1] XB) nogil:
    ...

my problem is that both the sequential shausdorff and the parallel phausdorff have the same timings. Furthermore, it seems that phausdorff is not creating any thread at all.

So my question is what is wrong with my code, and how can I fix it to get threading working.

Here is my setup.py:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
from Cython.Distutils import build_ext

ext_modules=[
    Extension("custom_metric",
              ["custom_metric.pyx"],
              libraries=["m"],
              extra_compile_args = ["-O3", "-ffast-math", "-march=native", "-fopenmp" ],
              extra_link_args=['-fopenmp']
              ) 
]

setup( 
  name = "custom_metric",
  cmdclass = {"build_ext": build_ext},
  ext_modules = ext_modules
)

EDIT 1: Here is a link to the html generated by cython -a: custom_metric.html

EDIT 2: Here is an example on how to call the corresponding functions (you need to compile the Cython file first)

import custom_metric as cm
import numpy as np

XA = np.random.random((9000, 210))
XB = np.random.random((1000, 210, 9))

#timing 'parallel' version
%timeit cm.phausdorff(XA, XB)

#timing sequential version
%timeit cm.shausdorff(XA, XB)

Have you tried printing the equivalent to `omp_get_thread_num()` within the loop-body of the `prange`. See http://cython.readthedocs.io/en/latest/src/userguide/parallelism.html — Harald, Aug 19 '16 at 08:09
Could be that `XB` is a Python object? Run `cython -a custom_metric.pyx` with annotation. — cgohlke, Aug 19 '16 at 08:54
Is there any change if `phausdorff` is decorated with `@cython.boundscheck(False)` and `@cython.wraparound(False)`? — J.J. Hakala, Aug 19 '16 at 09:05
@cgohlke I have changed that, explicity defining it as a memory view and theres is no change. @J.J. Hakala the macros `#cython: boundscheck=False` and `#cython: wraparound=False` are included at the beginning of the file. — mavillan, Aug 19 '16 at 15:06
@Harald when I try to run that, Cython says: `Converting to Python object not allowed without gil` — mavillan, Aug 19 '16 at 15:16
@mavillan Are you using OSX? Since the information I can see on the internet suggests that the version of clang they supply doesn't support OpenMP. See http://stackoverflow.com/questions/33668323/clang-omp-in-xcode-under-el-capitan possibly, but I'm not using OSX myself so can't confirm... — DavidW, Aug 22 '16 at 20:20
@DavidW no, I'm using Ubuntu 16.04 with Anaconda environment. — mavillan, Aug 22 '16 at 20:22
@mavillan, can you provide a small example on how to call `shausdorff` and `phausdorff` ? — Harald, Aug 24 '16 at 09:46
Have you considered using a greater `chunk_size`? Depending on the algorithm you can have a very poor data locality using a `chunk_size` of `1`. — smateo, Aug 27 '16 at 09:35

DavidW · Answer 1 · 2016-08-27T08:54:04.323

I think this the parallelization is working, but the extra overhead of the parallelization is eating up the time it would have saved. If I try with different sized arrays then I do begin to see a speed up in the parallel version

XA = np.random.random((900, 2100))
XB = np.random.random((100, 2100, 90))

Here the parallel version takes ~2/3 of the time of the serial version for me, which certainly isn't the 1/4 you'd expect, but does at least show some benefit.

One improvement I can offer is to replace the code that fixes contiguity:

XB = np.asanyarray([np.ascontiguousarray(XB[:,:,i]) for i in range(n)])

with

XB = np.ascontiguousarray(np.transpose(XB,[2,0,1]))

This speeds up both the parallel and non-parallel functions fairly significantly (a factor of 2 with the arrays you originally gave). It does make it slightly more obvious that you're being slowed down by overhead in the prange - the serial version is actually faster for the arrays in your example.

(Posted as community wiki since this doesn't present a solution so I wanted to remove it from contention for the bounty) — DavidW, Aug 27 '16 at 08:20

Cython's prange not improving performance

1 Answers1