How to use prange in cython?

Question

I'm trying to solve a 2D-Ising model with Monte Carlo approach.

As it is slow I used Cython to accelerate the code execution. I would like to push it even further and parallelize the Cython code. My idea is to split the 2D-lattice in two, so for any point on a lattice has it's nearest neigbours on the other lattice. This way I can randomly choose one lattice and I can flip all the spins and this could be done in parallel since all those spins are independent.

So far this is my code :
_{( inspired from http://jakevdp.github.io/blog/2017/12/11/live-coding-cython-ising-model/ )}

%load_ext Cython
%%cython 
cimport cython
cimport numpy as np
import numpy as np
from cython.parallel cimport prange

@cython.boundscheck(False)
@cython.wraparound(False)

def cy_ising_step(np.int64_t[:, :] field,float beta):

    cdef int N = field.shape[0]
    cdef int M = field.shape[1]


    cdef int offset = np.random.randint(0,2)

    cdef np.int64_t[:,] n_update = np.arange(offset,N,2,dtype=np.int64)

    cdef int m,n,i,j

    for m in prange(M,nogil=True):
        i = m % 2
        for j in range(n_update.shape[0]) :
            n = n_update[j]

            cy_spin_flip(field,(n+i) %N,m%M,beta)

    return np.array(field,dtype=np.int64)


cdef cy_spin_flip(np.int64_t[:, :] field,int n,int m, float beta=0.4,float J=1.0):

    cdef int N = field.shape[0]
    cdef int M = field.shape[1]

    cdef float dE = 2*J*field[n,m]*(field[(n-1)%N,m]+field[(n+1)%N,m]+field[n,(m-1)%M]+field[n,(m+1)%M])

    if dE <= 0 :
        field[n,m] *= -1

    elif np.exp(-dE * beta) > np.random.rand():
        field[n,m] *= -1

I tried using a prange-constructor but I'm having a lots of troubles with GIL-lock. I'am new to Cython and parallel computing so I could easily have missed something.

The error :

Discarding owned Python object not allowed without gil
Calling gil-requiring function not allowed without gil

What exactly is the problem: error message, wrong output, slow execution? — Han-Kwang Nienhuys, Jun 07 '20 at 18:05
I got : Discarding owned Python object not allowed without gil Calling gil-requiring function not allowed without gil — Angelo C, Jun 07 '20 at 18:06

score 4 · Answer 1 · edited Nov 08 '20 at 18:14

Q : "How to use prange in cython?"^{. . . . + ( an Epilogue on True-[PARALLEL] True-randomness ... )}

Short version : best in those and only those places, where performance gains.

Longer version :
Your problem starts not with avoiding a GIL-lock ownership, but with the Physics & the Performance losses from almost computational anti-patterns, irrespective of all the powers the cython-isation may have ever enabled.

The code as-is attempts to apply a 2D-kernel operator over a whole 2D-domain of the {-1|+1}-spin-field[N,M], best in some fast and smart manner.

The actual result is INCONGRUENT with PHYSICAL FIELD ISING, because a technique of "destructive"-self-rewriting the actual-state of the field[n_,m] right "during" a current generation of [PAR][SEQ]-organised coverage of the 2D-domain of the field[:,:] of current spin values sequentially modifies the state of the field[i,j], which obviously does not happen in the real-world of the recognised Laws of Physics. Computers are ignorant of these rules, we, humans, should prefer not to.

Next, the prange'd attempt calls ( M * N / 2 )-times a cdef-ed cy_spin_flip() in a way, that might've been easy to code, yet which is immensely inefficient, if not a performance anti-pattern testing canard to ever run this way.

If one benchmarks the costs of invoking about 1E6-calls to a repaired, so as to become congruent with the Laws of Physics, cy_spin_flip() function, one straight sees the costs of per-call overheads start matter, the more when passing them in a prange-d fashion ( isolated, un-coordinated, memory-layout agnostic, almost atomic memory-I/O will devastate any cache / cache-line coherence ). This is an additional cost for going into naive prange, instead of attempts to do some vectorised / block-optimised, memory-I/O smarter matrix / kernel processing.

Vectorised code using a 2D-kernel convolution :

A fast sketched, vectorised code, using a trick proposed by a Master of Vectorisation @Divakar, can produce one step per ~ 3k3 [us] without CPU-architecture tuning and further tweaking on spin_2Dstate[200,200] :

The initial state is :

spin_2Dstate = np.random.randint( 2, size = N * M, dtype = np.int8 ).reshape( N, M ) * 2 - 1
# pre-allocate a memory-zone:
spin_2Dconv  = spin_2Dstate.copy()

The actual const convolution kernel is :

spin_2Dkernel =  np.array( [ [ 0, 1, 0 ],
                             [ 1, 0, 1 ],
                             [ 0, 1, 0 ]
                             ],
                           dtype = np.int8 # [PERF] to be field-tested,
                           )               #        some architectures may get faster if matching CPU-WORD

The actual CPU-architecture may benefit from smart-aligned data types, yet for larger 2D-domains ~ [ > 200, > 200 ] users will observe growing costs due to useless amount of memory-I/O spent on 8-B-rich transfers of a principally binary { -1 | +1 } or even more compact bitmap stored-{ 0 | 1 } spin-information.

Next, instead of double-looping calls on each field[:,:]-cell, rather block-update the full 2D-domain in one step, the helpers get:

#                             T[:,:] * sum(?)
spin_2Dconv[:,:] = spin_2Dstate[:,:] * signal.convolve2d( spin_2Dstate,
                                                          spin_kernel,
                                                          boundary = 'wrap',
                                                          mode     = 'same'
                                                          )[:,:]

Because of the Physics inside the spin-kernel properties,
this helper array will consist of only { -4 | -2 | 0 | +2 | +4 } values.

A simplified, fast vector code :

 def aVectorisedSpinUpdateSTEPrandom( S           =  spin_2Dstate,
                                      C           =  spin_2Dconv,
                                      K           =  spin_2Dkernel,
                                      minus2betaJ = -2 * beta * J
                                      ):
        C[:,:] = S[:,:] * signal.convolve2d( S, K, boundary = 'wrap', mode = 'same' )[:,:]
        S[:,:] = S[:,:] * np.where( np.exp( C[:,:] * minus2betaJ ) > np.random.rand(), -1, 1 )

For cases where the Physics does not recognise a uniform probability for spin-flip to happen across the whole 2D-domain at a same value, replace a scalar produced from the np.random.rand() with a 2D-field-of-(individualised ^† )-probabilities delivered from np.random.rand( N, M )[:,:] and this will now add some costs up to some 7k3 ~ 9k3 [us] per a spin update step :

 def aVectorisedSpinUpdateSTEPrand2D( S           =  spin_2Dstate,
                                      C           =  spin_2Dconv,
                                      K           =  spin_2Dkernel,
                                      minus2betaJ = -2 * beta * J
                                      ):
        C[:,:] = S[:,:] * signal.convolve2d( S, K, boundary = 'wrap', mode = 'same' )[:,:]
        S[:,:] = S[:,:] * np.where( np.exp( C[:,:] * minus2betaJ ) > np.random.rand( N, M ), -1, 1 )

 >>> aClk.start(); aVectorisedSpinUpdateSTEPrand2D( spin_2Dstate, spin_2Dconv, spin_2Dkernel, -0.8 );aClk.stop()
 7280 [us]
 8984 [us]
 9299 [us]

_{wide-screen commented as-was source :}

// ###################################################################### Cython PARALLEL prange / GIL-lock issues related to randomness-generator state-space management if PRNG-s are "immersed"-inside the cpython realms
                                                                        # https://www.desmos.com/calculator/bgz9t3s3nm
@cython.boundscheck( False )                                            # https://www.desmos.com/calculator/ttz3r735qy
@cython.wraparound(  False )                                            # https://stackoverflow.com/questions/62249186/how-to-use-prange-in-cython

def cy_ising_step( np.int64_t[:, :] field,                              # field[N,M] of INTs (spin) { +1 | -1 } so why int64_t [SPACE] 8-Bytes for a principal binary ? Or a complex128 for Quantum-state A*|1> + B*|0> ?
                              float beta                                # beta: a float-factor
                   ):                                                   #
    cdef int                   N = field.shape[0]                               # const
    cdef int                   M = field.shape[1]                               # const
    cdef int              offset = np.random.randint( 0, 2 )  #_GIL-lock        # const ??? NEVER RE-USED BUT IN THE NEXT const SETUP .... in pre-load const-s from external scope ??? an inital RANDOM-flip-MODE-choice-{0|1}
    cdef np.int64_t[:,] n_update = np.arange( offset, N, 2, dtype = np.int64 )  # const ??? 8-B far small int-s ?? ~ field[N,M] .......... being { either | or } == [ {0|1}, {2|3}, ... , { N-2 | N-1 } ]   of  { (S) | [L] }
    cdef int          m, n, i, j                                                #                                                                                                                           idxs{ (E) | [O] }
    #                                                                           #
    for     m in prange( M, nogil = True ):                                     #  [PAR]||||||||||||||||||||||||||||| m in M |||||||||
        i = m % 2                                                               #       ||||||||||||||||||||||||| i = m % 2  ||||||||| ... { EVEN | ODD }-nodes
        for j in range( n_update.shape[0] ) :                                   #       [SEQ]              j over ...        ||||||||| ... over const ( N / 2 )-steps ~ [0,1,2,...,N/2-1] as idx2access n_update with {(S)|[L]}-indices
            #     n =   n_update[j]                                             #             n = n_update[j]                |||||||||
            #     cy_spin_flip( field, ( n           + i ) % N, m % M, beta )   #                                            |||||||||
            #                   |||||                                           # INCONGRUENT with PHYSICAL FIELD ISING      |||||||||
            #                   vvvvv                                           # self-rewriting field[n_,m]"during" current generation of [PAR][SEQ]-organised coverage of 2D-field[:,:]
            pass; cy_spin_flip( field, ( n_update[j] + i ) % N, m % M, beta )   # modifies field[i,j] ??? WHY MODULO-FUSED ( _n + {0|1} ) % N, _m % M ops when ALL ( _n + {0|1} ) & _m ARE ALWAYS < N, M ???? i.e. remain self ?
            #                                                                   #                                            |||||||||
    return np.array( field, dtype = np.int64 )                                  #                                            ||||||||| RET?

#||| cy_spin_flip( ) [PAR]|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| [PERF]: all complete call-overheads are paid M*N/2 times (just to do a case-switching)
cdef cy_spin_flip( np.int64_t[:, :] field,                              # field[N,M] of ints (spin) { +1 | -1 } why int64_t 8-Bytes for a principal binary ? Or a complex128 for Quantum-state A*|1> + B*|0> ?
                                int n,                                  #         const int
                                int m,                                  #         const int
                              float beta = 0.4,                         #         const float ? is a pure positive scalar or can also be negative ?
                              float J    = 1.0                          #         const float ? is a pure positive scalar or can also be negative ? caller keeps this on an implicit, const == 1 value
                              ):
    cdef int    N = field.shape[0]                                              # const int  ? [PERF]: Why let this test & assignment ever happen to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
    cdef int    M = field.shape[1]                                              # const int  ? [PERF]: Why let this test & assignment ever happen to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
    cdef float dE = ( 2 * J *  field[  n,            m ]                        # const float           [?]                     [PERF]: FMUL 2, J to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
                            *( field[( n - 1 ) % N,  m ]                        #                        |                      (const)                                                 vvvv------------aSureSpinFLIP
                             + field[( n + 1 ) % N,  m ]                        #                  [?]-T[n,m]-[?]    sum(?) *T *( 2*J ) the spin-game ~{ -1 | +1 } * sum( ? )          |::::|
                             + field[  n,          ( m - 1 ) % M]               #                        |                                                                := {-8J |-4J |  0 | 4J | 8J }
                             + field[  n,          ( m + 1 ) % M]               #                       [?]                                              a T-dependent choice|__if_+T__|    |__if_-T__| FLIP @random-scaled by 2*J*beta
                               )#      |             |                          #                                                       ( % MODULO-fused OPs "skew" physics - as it "rolls-over" a 2D-field TOPOLOGY )
                     )          #      |             |                          #
    if dE <= 0 :                #      |             |                          #
                               field[  n,            m ] *= -1          # [PERF]: "inverts" spin (EXPENSIVE FMUL instead of bitwise +1 or numpy-efficient block-wise XOR MASK) (2D-requires more efforts for best cache-eff'cy)
    elif ( np.exp( -dE * beta ) #      |             |                  # [PERF]: with a minusBETA, one MUL uop SAVED * M * N
         > np.random.rand() #__________|_____________|__________GIL-lock# [PERF]: pre-calc in the external-scope + [PHYSICS]: Does the "hidden"-SEQ-order here anyhow matter in realms of generally accepted laws of PHYSICS???
           ):               #          |             |                  #                                                     Is a warranty of the uniform distribution "lost" by an if(field-STATE)-governed sub-stepping ????
                               field[  n,            m ] *= -1          # identical OP ? .OR.-ed in if(): ?                   of a pre-generated uniform-.rand() or a general (non-sub-stepped) sequenced stepping         ????
    #                                                                   #                                                     in a stream-of-PRNG'd SPIN-FLIP threshold floats from a warranted uniform distrib. of values ????

The Physics:

The beta-controlled ( given const J ) model of spin-flip thresholds for { -8 | -4 | 0 | +4 | +8 } which are the only cases for ~ 2 * spin_2Dkernel-convolutions across the whole 2D-domain of the current spin_2Dstate, is available here : https://www.desmos.com/calculator/bgz9t3s3nm one may live-experiment with beta to see the lowering threshold for either of possible positive outputs { + 4 | + 8 }, as np.exp( -dE * 2 * J * beta ) is strongly controlled by beta and the larger the beta the lower the probability a randomly drawn number, warranted to be from a semi-closed range [0, 1) will not dominate the np.exp()-result.

† An Epilogue on a Post-Festum Remark :

_{"Normally on a true Metropolis algorithm, you flip spins (chosen randomly) one by one. As I wanted to parallelize the algorithm I flip half the spins for each iteration (when the function cy_ising_step is called). Those spins are chosen in a way that none of thems are nearest neighbor as it would impact the Monte-Carlo optimization. This might not be a correct approach..."
– Angelo C 7 hours ago}

Thanks for all remarks & details on method and your choices. The "most-(densely)-aggressive" spin updates by a pair of non-"intervening" lattices requires the more careful choice of strategy for sourcing the randomness.

While using the "most-aggressive" density of somehow-probable updates, the source of randomness is the core trouble - not only for the overall processing performance ( a technical issue on its own how to maintain a FSA-state, if resorted to a naive, central PRNG-source ).

You either design your process to be truly a randomness based ( using some of the available sources of indeed non-deterministic entropy ), or willing to be sub-ordinated to a policy to allowing repeatable experiments ( for re-inspection & re-validation of scientific computing ), for which you have one more duty - a duty of Configuration Management of such scientific experiment ( to record / setup / distribute / manage the initial "seeding" of all PRNG-s, that the scientific computing experiment is configured to use.

Here, given the nature warrants the spins to be mutually independent in the 2D-domain of the field[:,:], the direction of the time-arrow ought be the only direction, in which such (deterministic)-PRNG-s may retain their warranty of outputs remaining uniformly distributed over [0,1). As a side-effect of that, they will cause no problems for a parallelisation of their individual evolution of their respective internal states. Bingo! Computationally cheap, HPC-grade performant & robustly-random PRNG-s are a safe way for doing this ( be warned, if not aware of already, not all "COTS" PRNG-s have all these properties "built-in" ).

That means, either of the spins will remain fair & congruent with the Laws of Physics if and only if it sources a spin-flip decision treshhold from its "own" (thus congruently autonomous to retain the uniformity of distribution of outputs) PRNG-instance (not a problem, but a care is needed not to forget it implement right & run it efficiently).

For a case of a need to operate an indeed non-deterministic PRNG, the source of a truly ND-entropy may become a performance bottleneck, if trying to use it beyond its performance ceiling limit. A fight for a nature-like entropy is a challenging task in a domain of (no matter how large, yet still) Finite-State-Automata, isn't it?

Well thanks a lot for the clever optimization sugestions ! I am confused by your physical analysis. This is a very classical Metropolis algorithm witch upon repetition of cy_ising_step will converge to an equilibrium configuration where the free-energy is minimal. Only the equilibrium configuration has a physical meaning. — Angelo C, Jun 08 '20 at 14:13
As for beta =1/(KT) one can see that at low enough T spins will tend to align with each other to minimise energy (paramegnetic-phase), for high T the entropy will win and spins will have random orientation (ferromagnetic-phase). As for the modulo it changes the topology to a torus to preserve translational invariance his is called BVK-boundary condition it describe well systems far from boundaries. — Angelo C, Jun 08 '20 at 14:13
Normally on a true Metropolis algorithm, you flip spins (chosen randomly) one by one. As I wanted to parallelize the algorithm I flip half the spins for each iterration (when the function cy_ising_step is called). Those spins are chosen in a way that none of thems are nearest neighbor as it would impact the monte-carlo optimization. This might not be a correct approach but my results seems coherent with the classical monte-carlo. — Angelo C, Jun 08 '20 at 14:14
@AngeloC Yes, the tori-alike wrapping is configurable straight in the .convolve2D()-method's call-signature. There were some surpised experiments with numpy inline assignments, using index-tricks, that jumped from 10k [us] down to ~ 800 [ns] for "mostly-empty" numpy.where()-MASKs ( avoidance of "mostly-all" memory-I/O that would otherwise happen for numpy block-optimised vectorised code ), so this will not be reasonably repeatable for any "denser" spin-flip masks (and as you try to harness as-dense-as-possible spin-field "evolution" in a name of min Ising time,that trick may work against you ) — user3666197, Jun 08 '20 at 22:40

DavidW · Accepted Answer · 2020-06-07T18:39:39.863

From a Cython point-of-view the main problem is that cy_spin_flip requires the GIL. You need to add nogil to the end of its signature, and set the return type to void (since by default it returns a Python object, which requires the GIL).

However, np.exp and np.random.rand also require the GIL, because they're Python function calls. np.exp is probably easily replaced with libc.math.exp. np.random is a bit harder, but there's plenty of suggestions for C- and C++-based approaches: 1 2 3 4 (+ others).

A more fundamental problem is the line:

cdef float dE = 2*J*field[n,m]*(field[(n-1)%N,m]+field[(n+1)%N,m]+field[n,(m-1)%M]+field[n,(m+1)%M])

You've parallelized this with respect to m (i.e. different values of m are run in different threads), and each iteration changes field. However in this line you are looking up several different values of m. This means the whole thing is a race-condition (the result depends on which order the different threads finish) and suggests your algorithm may be fundamentally unsuitable for parallelization. Or that you should copy field and have field_in and field_out. It isn't obvious to me, but this is something that you should be able to work out.

Edit: it does look like you've given the race condition some thought with using i%2. It isn't obvious to me that this is right though. I think a working implementation of your "alternate cells" scheme would look something like:

for oddeven in range(2):
    for m in prange(M):
        for n in range(N):
            # some mechanism to pick the alternate cells here.

i.e. you need a regular loop to pick the alternate cells outside your parallel loop.

May I ask you to kindly benchmark your proposed cython-approach? The per-call overheads ( best in [us] per call ) are of a particular interest for small-scale spin-fields: ~ a ( N, M )-range of [ 200+, 200 ] - thank you. — user3666197, Jun 08 '20 at 13:29
I'm reluctant to because there's quite a bit (mostly the random number generation, but also the alternate indexing mechanism) that I deliberately left "as an exercise for the reader" - i.e. I don't have much interest in writing it myself, but felt that pointing out the obvious issues and starting points to fix them would be useful to OP. I know that sounds lazy, and it is! — DavidW, Jun 08 '20 at 13:54

How to use prange in cython?

2 Answers2

Vectorised code using a 2D-kernel convolution :

The Physics:

† An Epilogue on a Post-Festum Remark :

Linked