Prange slowing down Cython loop

Question

Consider two ways of calculating random numbers, one in one thread and one multithread using cython prange with openmp:

def rnd_test(long size1):
    cdef long i
    for i in range(size1):
        rand()
    return 1

and

def rnd_test_par(long size1):
    cdef long i
    with nogil, parallel():
        for i in prange(size1, schedule='static'):
             rand()
    return 1

Function rnd_test is first compiled with the following setup.py

from distutils.core import setup
from Cython.Build import cythonize

setup(
  name = 'Hello world app',
  ext_modules = cythonize("cython_test.pyx"),
)

rnd_test(100_000_000) runs in 0.7s.

Then, rnd_test_par is compiled with the following setup.py

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize

ext_modules = [
    Extension(
        "cython_test_openmp",
        ["cython_test_openmp.pyx"],
        extra_compile_args=["-O3", '-fopenmp'],
        extra_link_args=['-fopenmp'],
    )

]

setup(
    name='hello-parallel-world',
    ext_modules=cythonize(ext_modules),
)

rnd_test_par(100_000_000) runs in 10s!!!

Similar results are obtained using cython within ipython:

%%cython
import cython
from cython.parallel cimport parallel, prange
from libc.stdlib cimport rand

def rnd_test(long size1):
    cdef long i
    for i in range(size1):
        rand()
    return 1

%%timeit
rnd_test(100_000_000)

1 loop, best of 3: 1.5 s per loop

and

%%cython --compile-args=-fopenmp --link-args=-fopenmp --force
import cython
from cython.parallel cimport parallel, prange
from libc.stdlib cimport rand

def rnd_test_par(long size1):
    cdef long i
    with nogil, parallel():
        for i in prange(size1, schedule='static'):
                rand()
    return 1

%%timeit
rnd_test_par(100_000_000)

1 loop, best of 3: 8.42 s per loop

What am I doing wrong? I am completely new to cython, this is my second time using it. I had a good experience last time so I decided to use for a project with monte-carlo simulation (hence the use of rand).

Is this expected? Having read all the documentation, I think prange should work well in an embarrassingly parallel case like this. I don't understand why this is failing to speed up the loop and even making it so much slower.

Some additional information:

I am running python 3.6, cython 0.26.
gcc version is "gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
CPU usage confirms the parallel version is actually using many cores (90% vs 25% of the serial case)

I appreciate any help you can provide. I tried first with numba and it did speed up the calculation but it has other problems that make me want to avoid it. I'd like Cython to work in this case.

Thanks!!!

`rand` probbly isn't thread safe (and if it is it will have expensive thread syncronization internally). See https://stackoverflow.com/questions/27824959/thread-safe-random-number-generation-with-cython/ and https://stackoverflow.com/questions/40976880/canonical-way-to-generate-random-numbers-in-cython. Essentially your issues could be down to using `rand` rather than `prange` (and even if not it's probably a bad idea). — DavidW, Sep 16 '17 at 19:23
Thanks @DavidW, I think you are right, I tried with sin() instead of rand() and the parallel version is faster, as expected. I'll post a new solution once I have it, probably based on your c++ example on the second link. — Luk17, Sep 16 '17 at 21:48
you should probably post that solution as an answer rather than an edit. thread_id will always be 0, 1, 3, 4... so you'll always get the same "random" numbers each time you run it. You might do something like time+thread_id — DavidW, Sep 17 '17 at 08:07
Thanks again! I am interested in a setting up a distributed monte carlo framework and this has helped me realize the particular challenge of parallel random number generators. — Luk17, Sep 17 '17 at 13:59

Luk17 · Accepted Answer · 2017-09-17T14:01:23.677

With DavidW's useful feedback and links, I have a multithreaded solution for random number generation. However, the time savings over single-threaded (vectorized) Numpy solution are not that massive. The numpy approach generates 100 million numbers (5GB in memory) in 1.2s versus 0.7s of the multithreaded approach. Given the increased complexity (using c++ libraries for example), I wonder if it's worth it. Maybe I will leave the random number generation single-threaded and work on parallelizing the calculations that follow this step. The exercise is, however, very useful to understand the problems of randon number generators. Ultimately, I'd like to have framework that could work in a distributed environment and I can see now that the challenge would be even larger in regards to the random number generator due to generators essentially having a state that cannot be ignored.

%%cython --compile-args=-fopenmp --link-args=-fopenmp --force
# distutils: language = c++
# distutils: extra_compile_args = -std=c++11
import cython
cimport numpy as np
import numpy as np
from cython.parallel cimport parallel, prange, threadid
cimport openmp

cdef extern from "<random>" namespace "std" nogil:
    cdef cppclass mt19937:
        mt19937() # we need to define this constructor to stack allocate classes in Cython
        mt19937(unsigned int seed) # not worrying about matching the exact int type for seed

    cdef cppclass uniform_real_distribution[T]:
        uniform_real_distribution()
        uniform_real_distribution(T a, T b)
        T operator()(mt19937 gen) # ignore the possibility of using other classes for "gen"

@cython.boundscheck(False)
@cython.wraparound(False)        
def test_rnd_par(long size):
    cdef:
        mt19937 gen
        uniform_real_distribution[double] dist = uniform_real_distribution[double](0.0,1.0)
        narr = np.empty(size, dtype=np.dtype("double"))
        double [:] narr_view = narr
        long i

    with nogil, parallel():
        gen = mt19937(openmp.omp_get_thread_num())
        for i in prange(size, schedule='static'):
            narr_view[i] = dist(gen)
    return narr

score 1 · Answer 2 · answered Sep 17 '17 at 19:22

I would like to note two things, that might be worth of your consideration:

A: If you take a look at the implementation of rand() in glibc, you will see that using rand() in a multi-threaded program leads to unspecified behavior: the produced numbers are always the same (assuming we have the same seed), but you cannot say which number will be used for which thread due to possible raise conditions. There is only one common state which is shared between all threads, and it need to be protected by a lock, otherwise even worse things could happen:

long int
__random ()
{
  int32_t retval;
  __libc_lock_lock (lock);
  (void) __random_r (&unsafe_state, &retval);
  __libc_lock_unlock (lock);
  return retval;
}

From this code a possible workaround becomes clear, if we are not allowed to use c++11: every thread could have its own seed and we could use the rand_r() method.

This lock, is the reason you cannot see any speed-up with the original version.

B: Why don't you see more speed-up with your c++11-solution? You produce 5GB of data and write it to memory - it is a pretty memory-bound-task. So if a thread is working, the memory-bandwidth is enough to transport the created data and the bottle-neck is the calculation of the next random number. If there are two threads, there are twice as much data, but no more memory-bandwidth. So there will be a number of threads, for which the memory-bandwidth becomes the bottle-neck and you will not be able to achieve any speed-up by adding more threads/cores.

So there is no gain in parallelizing the random number generation? The problem is not the random number generation, but the amount of data written to memory: if the created random number is consumed by the same thread without storing it in RAM, it would be a much better solution to parallelize compared to producing the numbers by a single thread and to distribute them:

You don't have to write these numbers to RAM.
You don't have to read these numbers from RAM.
You calculate them faster as with a single thread.

thanks @ead! I didn't know what c functions were doing internally, so your point A is great to know. your point B suggests an interesting direction for my project. I was planning to set up a monte carlo framework with a pipeline structure: basic rand >> random variable >> function of random variable, inspired on sklearn's transformer interface. But you point to a limitation of this structure: by creating all simulations for each stage before moving to the next I'd be limited to memory-bandwidth is a way that integrating all calculations per simulation would not (or less). — Luk17, Sep 17 '17 at 20:38

Prange slowing down Cython loop

2 Answers2