Consider two ways of calculating random numbers, one in one thread and one multithread using cython prange with openmp:
def rnd_test(long size1):
cdef long i
for i in range(size1):
rand()
return 1
and
def rnd_test_par(long size1):
cdef long i
with nogil, parallel():
for i in prange(size1, schedule='static'):
rand()
return 1
Function rnd_test is first compiled with the following setup.py
from distutils.core import setup
from Cython.Build import cythonize
setup(
name = 'Hello world app',
ext_modules = cythonize("cython_test.pyx"),
)
rnd_test(100_000_000) runs in 0.7s.
Then, rnd_test_par is compiled with the following setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
ext_modules = [
Extension(
"cython_test_openmp",
["cython_test_openmp.pyx"],
extra_compile_args=["-O3", '-fopenmp'],
extra_link_args=['-fopenmp'],
)
]
setup(
name='hello-parallel-world',
ext_modules=cythonize(ext_modules),
)
rnd_test_par(100_000_000) runs in 10s!!!
Similar results are obtained using cython within ipython:
%%cython
import cython
from cython.parallel cimport parallel, prange
from libc.stdlib cimport rand
def rnd_test(long size1):
cdef long i
for i in range(size1):
rand()
return 1
%%timeit
rnd_test(100_000_000)
1 loop, best of 3: 1.5 s per loop
and
%%cython --compile-args=-fopenmp --link-args=-fopenmp --force
import cython
from cython.parallel cimport parallel, prange
from libc.stdlib cimport rand
def rnd_test_par(long size1):
cdef long i
with nogil, parallel():
for i in prange(size1, schedule='static'):
rand()
return 1
%%timeit
rnd_test_par(100_000_000)
1 loop, best of 3: 8.42 s per loop
What am I doing wrong? I am completely new to cython, this is my second time using it. I had a good experience last time so I decided to use for a project with monte-carlo simulation (hence the use of rand).
Is this expected? Having read all the documentation, I think prange should work well in an embarrassingly parallel case like this. I don't understand why this is failing to speed up the loop and even making it so much slower.
Some additional information:
- I am running python 3.6, cython 0.26.
- gcc version is "gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
- CPU usage confirms the parallel version is actually using many cores (90% vs 25% of the serial case)
I appreciate any help you can provide. I tried first with numba and it did speed up the calculation but it has other problems that make me want to avoid it. I'd like Cython to work in this case.
Thanks!!!