Q : "How to use prange
in cython?" . . . . + ( an Epilogue on True-[PARALLEL]
True-randomness ... )
Short version : best in those and only those places, where performance gains.
Longer version :
Your problem starts not with avoiding a GIL-lock ownership, but with the Physics & the Performance losses from almost computational anti-patterns, irrespective of all the powers the cython
-isation may have ever enabled.

The code as-is attempts to apply a 2D-kernel
operator over a whole 2D-domain of the {-1|+1}
-spin-field[N,M]
, best in some fast and smart manner.
The actual result is INCONGRUENT with PHYSICAL FIELD ISING, because a technique of "destructive"-self-rewriting the actual-state of the field[n_,m]
right "during" a current generation of [PAR][SEQ]
-organised coverage of the 2D-domain of the field[:,:]
of current spin values sequentially modifies the state of the field[i,j]
, which obviously does not happen in the real-world of the recognised Laws of Physics. Computers are ignorant of these rules, we, humans, should prefer not to.
Next, the prange
'd attempt calls ( M * N / 2 )
-times a cdef
-ed cy_spin_flip()
in a way, that might've been easy to code, yet which is immensely inefficient, if not a performance anti-pattern testing canard to ever run this way.
If one benchmarks the costs of invoking about 1E6
-calls to a repaired, so as to become congruent with the Laws of Physics, cy_spin_flip()
function, one straight sees the costs of per-call overheads start matter, the more when passing them in a prange
-d fashion ( isolated, un-coordinated, memory-layout agnostic, almost atomic memory-I/O will devastate any cache / cache-line coherence ). This is an additional cost for going into naive prange
, instead of attempts to do some vectorised / block-optimised, memory-I/O smarter matrix / kernel processing.
Vectorised code using a 2D-kernel convolution :
A fast sketched, vectorised code, using a trick proposed by a Master of Vectorisation @Divakar, can produce one step per ~ 3k3 [us]
without CPU-architecture tuning and further tweaking on spin_2Dstate[200,200]
:
The initial state is :
spin_2Dstate = np.random.randint( 2, size = N * M, dtype = np.int8 ).reshape( N, M ) * 2 - 1
# pre-allocate a memory-zone:
spin_2Dconv = spin_2Dstate.copy()
The actual const
convolution kernel is :
spin_2Dkernel = np.array( [ [ 0, 1, 0 ],
[ 1, 0, 1 ],
[ 0, 1, 0 ]
],
dtype = np.int8 # [PERF] to be field-tested,
) # some architectures may get faster if matching CPU-WORD
The actual CPU-architecture may benefit from smart-aligned data types, yet for larger 2D-domains ~ [ > 200, > 200 ]
users will observe growing costs due to useless amount of memory-I/O spent on 8-B-rich transfers of a principally binary { -1 | +1 }
or even more compact bitmap stored-{ 0 | 1 }
spin-information.
Next, instead of double-looping calls on each field[:,:]
-cell, rather block-update the full 2D-domain in one step, the helpers get:
# T[:,:] * sum(?)
spin_2Dconv[:,:] = spin_2Dstate[:,:] * signal.convolve2d( spin_2Dstate,
spin_kernel,
boundary = 'wrap',
mode = 'same'
)[:,:]
Because of the Physics inside the spin-kernel properties,
this helper array will consist of only { -4 | -2 | 0 | +2 | +4 }
values.
A simplified, fast vector code :
def aVectorisedSpinUpdateSTEPrandom( S = spin_2Dstate,
C = spin_2Dconv,
K = spin_2Dkernel,
minus2betaJ = -2 * beta * J
):
C[:,:] = S[:,:] * signal.convolve2d( S, K, boundary = 'wrap', mode = 'same' )[:,:]
S[:,:] = S[:,:] * np.where( np.exp( C[:,:] * minus2betaJ ) > np.random.rand(), -1, 1 )
For cases where the Physics does not recognise a uniform probability for spin-flip to happen across the whole 2D-domain at a same value, replace a scalar produced from the np.random.rand()
with a 2D-field-of-(individualised † )-probabilities delivered from np.random.rand( N, M )[:,:]
and this will now add some costs up to some 7k3 ~ 9k3 [us]
per a spin update step :
def aVectorisedSpinUpdateSTEPrand2D( S = spin_2Dstate,
C = spin_2Dconv,
K = spin_2Dkernel,
minus2betaJ = -2 * beta * J
):
C[:,:] = S[:,:] * signal.convolve2d( S, K, boundary = 'wrap', mode = 'same' )[:,:]
S[:,:] = S[:,:] * np.where( np.exp( C[:,:] * minus2betaJ ) > np.random.rand( N, M ), -1, 1 )
>>> aClk.start(); aVectorisedSpinUpdateSTEPrand2D( spin_2Dstate, spin_2Dconv, spin_2Dkernel, -0.8 );aClk.stop()
7280 [us]
8984 [us]
9299 [us]
wide-screen commented as-was source :

// ###################################################################### Cython PARALLEL prange / GIL-lock issues related to randomness-generator state-space management if PRNG-s are "immersed"-inside the cpython realms
# https://www.desmos.com/calculator/bgz9t3s3nm
@cython.boundscheck( False ) # https://www.desmos.com/calculator/ttz3r735qy
@cython.wraparound( False ) # https://stackoverflow.com/questions/62249186/how-to-use-prange-in-cython
def cy_ising_step( np.int64_t[:, :] field, # field[N,M] of INTs (spin) { +1 | -1 } so why int64_t [SPACE] 8-Bytes for a principal binary ? Or a complex128 for Quantum-state A*|1> + B*|0> ?
float beta # beta: a float-factor
): #
cdef int N = field.shape[0] # const
cdef int M = field.shape[1] # const
cdef int offset = np.random.randint( 0, 2 ) #_GIL-lock # const ??? NEVER RE-USED BUT IN THE NEXT const SETUP .... in pre-load const-s from external scope ??? an inital RANDOM-flip-MODE-choice-{0|1}
cdef np.int64_t[:,] n_update = np.arange( offset, N, 2, dtype = np.int64 ) # const ??? 8-B far small int-s ?? ~ field[N,M] .......... being { either | or } == [ {0|1}, {2|3}, ... , { N-2 | N-1 } ] of { (S) | [L] }
cdef int m, n, i, j # idxs{ (E) | [O] }
# #
for m in prange( M, nogil = True ): # [PAR]||||||||||||||||||||||||||||| m in M |||||||||
i = m % 2 # ||||||||||||||||||||||||| i = m % 2 ||||||||| ... { EVEN | ODD }-nodes
for j in range( n_update.shape[0] ) : # [SEQ] j over ... ||||||||| ... over const ( N / 2 )-steps ~ [0,1,2,...,N/2-1] as idx2access n_update with {(S)|[L]}-indices
# n = n_update[j] # n = n_update[j] |||||||||
# cy_spin_flip( field, ( n + i ) % N, m % M, beta ) # |||||||||
# ||||| # INCONGRUENT with PHYSICAL FIELD ISING |||||||||
# vvvvv # self-rewriting field[n_,m]"during" current generation of [PAR][SEQ]-organised coverage of 2D-field[:,:]
pass; cy_spin_flip( field, ( n_update[j] + i ) % N, m % M, beta ) # modifies field[i,j] ??? WHY MODULO-FUSED ( _n + {0|1} ) % N, _m % M ops when ALL ( _n + {0|1} ) & _m ARE ALWAYS < N, M ???? i.e. remain self ?
# # |||||||||
return np.array( field, dtype = np.int64 ) # ||||||||| RET?
#||| cy_spin_flip( ) [PAR]|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| [PERF]: all complete call-overheads are paid M*N/2 times (just to do a case-switching)
cdef cy_spin_flip( np.int64_t[:, :] field, # field[N,M] of ints (spin) { +1 | -1 } why int64_t 8-Bytes for a principal binary ? Or a complex128 for Quantum-state A*|1> + B*|0> ?
int n, # const int
int m, # const int
float beta = 0.4, # const float ? is a pure positive scalar or can also be negative ?
float J = 1.0 # const float ? is a pure positive scalar or can also be negative ? caller keeps this on an implicit, const == 1 value
):
cdef int N = field.shape[0] # const int ? [PERF]: Why let this test & assignment ever happen to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
cdef int M = field.shape[1] # const int ? [PERF]: Why let this test & assignment ever happen to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
cdef float dE = ( 2 * J * field[ n, m ] # const float [?] [PERF]: FMUL 2, J to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
*( field[( n - 1 ) % N, m ] # | (const) vvvv------------aSureSpinFLIP
+ field[( n + 1 ) % N, m ] # [?]-T[n,m]-[?] sum(?) *T *( 2*J ) the spin-game ~{ -1 | +1 } * sum( ? ) |::::|
+ field[ n, ( m - 1 ) % M] # | := {-8J |-4J | 0 | 4J | 8J }
+ field[ n, ( m + 1 ) % M] # [?] a T-dependent choice|__if_+T__| |__if_-T__| FLIP @random-scaled by 2*J*beta
)# | | # ( % MODULO-fused OPs "skew" physics - as it "rolls-over" a 2D-field TOPOLOGY )
) # | | #
if dE <= 0 : # | | #
field[ n, m ] *= -1 # [PERF]: "inverts" spin (EXPENSIVE FMUL instead of bitwise +1 or numpy-efficient block-wise XOR MASK) (2D-requires more efforts for best cache-eff'cy)
elif ( np.exp( -dE * beta ) # | | # [PERF]: with a minusBETA, one MUL uop SAVED * M * N
> np.random.rand() #__________|_____________|__________GIL-lock# [PERF]: pre-calc in the external-scope + [PHYSICS]: Does the "hidden"-SEQ-order here anyhow matter in realms of generally accepted laws of PHYSICS???
): # | | # Is a warranty of the uniform distribution "lost" by an if(field-STATE)-governed sub-stepping ????
field[ n, m ] *= -1 # identical OP ? .OR.-ed in if(): ? of a pre-generated uniform-.rand() or a general (non-sub-stepped) sequenced stepping ????
# # in a stream-of-PRNG'd SPIN-FLIP threshold floats from a warranted uniform distrib. of values ????
The Physics:
The beta
-controlled ( given const J
) model of spin-flip thresholds for { -8 | -4 | 0 | +4 | +8 }
which are the only cases for ~ 2 * spin_2Dkernel
-convolutions across the whole 2D-domain of the current spin_2Dstate
, is available here : https://www.desmos.com/calculator/bgz9t3s3nm one may live-experiment with beta
to see the lowering threshold for either of possible positive outputs { + 4 | + 8 }
, as np.exp( -dE * 2 * J * beta )
is strongly controlled by beta
and the larger the beta
the lower the probability a randomly drawn number, warranted to be from a semi-closed range [0, 1)
will not dominate the np.exp()
-result.
† An Epilogue on a Post-Festum Remark :
"Normally on a true Metropolis algorithm, you flip spins (chosen randomly) one by one. As I wanted to parallelize the algorithm I flip half the spins for each iteration (when the function cy_ising_step is called). Those spins are chosen in a way that none of thems are nearest neighbor as it would impact the Monte-Carlo optimization. This might not be a correct approach..."
– Angelo C 7 hours ago
Thanks for all remarks & details on method and your choices. The "most-(densely)-aggressive" spin updates by a pair of non-"intervening" lattices requires the more careful choice of strategy for sourcing the randomness.
While using the "most-aggressive" density of somehow-probable updates, the source of randomness is the core trouble - not only for the overall processing performance ( a technical issue on its own how to maintain a FSA-state, if resorted to a naive, central PRNG-source ).
You either design your process to be truly a randomness based ( using some of the available sources of indeed non-deterministic entropy ), or willing to be sub-ordinated to a policy to allowing repeatable experiments ( for re-inspection & re-validation of scientific computing ), for which you have one more duty - a duty of Configuration Management of such scientific experiment ( to record / setup / distribute / manage the initial "seeding" of all PRNG-s, that the scientific computing experiment is configured to use.
Here, given the nature warrants the spins to be mutually independent in the 2D-domain of the field[:,:]
, the direction of the time-arrow ought be the only direction, in which such (deterministic)-PRNG-s may retain their warranty of outputs remaining uniformly distributed over [0,1)
. As a side-effect of that, they will cause no problems for a parallelisation of their individual evolution of their respective internal states. Bingo! Computationally cheap, HPC-grade performant & robustly-random PRNG-s are a safe way for doing this ( be warned, if not aware of already, not all "COTS" PRNG-s have all these properties "built-in" ).
That means, either of the spins will remain fair & congruent with the Laws of Physics if and only if it sources a spin-flip decision treshhold from its "own" (thus congruently autonomous to retain the uniformity of distribution of outputs) PRNG-instance (not a problem, but a care is needed not to forget it implement right & run it efficiently).
For a case of a need to operate an indeed non-deterministic PRNG, the source of a truly ND-entropy may become a performance bottleneck, if trying to use it beyond its performance ceiling limit. A fight for a nature-like entropy is a challenging task in a domain of (no matter how large, yet still) Finite-State-Automata, isn't it?