Cython optimisation for ising model

Question

This is my first post, I am trying to make an ising model simulation using a monte carlo algorithm. I order to optimise the code I am currently trying to use Cython. However for now the execution time of the python and cython programm are identical. If anyone has an idea to either improve the cython implementation or to make the program run faster without using cython please let me know.

Thanks

PS : here is the code

#cython: language_level=3
import numpy as np
import time
cimport numpy as np




cpdef gen_neighbors_2d(int a):
    cdef int n = a * a
    cdef np.ndarray neighbors = np.zeros((n, 4), dtype=np.int32)

    cdef int i, j, index
    for i in range(a):
        for j in range(a):
            index = i * a + j
            neighbors[index, 0] = i * a + (j - 1) % a # left
            neighbors[index, 1] = i * a + (j + 1) % a # right
            neighbors[index, 2] = ((i - 1) % a) * a + j # up
            neighbors[index, 3] = ((i + 1) % a) * a + j # down

    return neighbors

cpdef int MH_single_flip(np.ndarray[int, ndim=2] neighbors_list, double T, int iterations, int size):
    cdef np.ndarray[int, ndim=1] spins = np.random.choice([-1, 1], size=size)

    cdef int step, n, spin, delta_E
    for step in range(iterations):
        n = np.random.randint(0, size)
        spin = spins[n]

        delta_E = 2 * spin * np.add.reduce(spins[neighbors_list[n]])

        if delta_E < 0 or -T*np.log(np.random.random(1)[0]) > delta_E:
            spins[n] = -1*spin

    return spins




neighbors_list = gen_neighbors_2d(100)

T_list = np.linspace(1, 4, 1000)
size = neighbors_list.shape[0]
t1 = time.time()
for T in range(len(T_list)):

    blank_var = MH_single_flip(neighbors_list, T_list[T], 5000, size)
t2 = time.time()
print(t2-t1)

I at first tried to run the cython code without any variable definitions but it only made the code run 5 percent faster. Any of the upgrades I added after that increased performance by more than 1 percent. Any help is welcome !

Do you know what is taking all the time? If you don't know where the time is being used, then you are just shooting in the dark. — Tim Roberts, Mar 19 '23 at 18:19
You're making 5,000,000 calls to `np.add.reduce` and 10,000,000 calls to `random`. Those are already in C and won't be sped up in CPython. I'm not convinced your `neighbors_list` is a net win here, but I haven't tried an alternative. — Tim Roberts, Mar 19 '23 at 18:32
It's worth reading https://stackoverflow.com/questions/70821944/what-parts-of-a-numpy-heavy-function-can-i-accelerate-with-cython for a bunch of generic hints (some of which probably apply here) — DavidW, Mar 19 '23 at 19:43
The line : delta_E = 2 * spin * np.add.reduce(spins[neighbors_list[n]]) is taking the most amount of time. I think you are right for the reason why it does not speed it up significantly. I had totally forgot numpy and random were in C thank you ! Ill try to optimise differently and replace the system of the neighbor map. — Martin Baptiste, Mar 21 '23 at 21:53

score 0 · Answer 1 · answered Mar 19 '23 at 18:42

This is not the complete answer, but the time gets cut nearly in half if I create the neighbors list as a list-of-tuples instead of a numpy array, and use a list comprehension instead of a call to np.add.reduce. The benefits of numpy show up with large arrays; you have a lot of overhead here for a list of 4 things.

import numpy as np
import time

def gen_neighbors_2d(a):
    n = a * a
    neighbors = []

    for i in range(a):
        for j in range(a):
            index = i * a + j
            neighbors.append((
                i * a + (j - 1) % a, # left
                i * a + (j + 1) % a, # right
                ((i - 1) % a) * a + j, # up
                ((i + 1) % a) * a + j # down
            ))

    return neighbors

def MH_single_flip(neighbors_list, T, iterations, size):
    spins = np.random.choice([-1, 1], size=size)

    for _ in range(iterations):
        n = np.random.randint(0, size)
        spin = spins[n]

        delta_E = 2 * spin * sum(spins[j] for j in neighbors_list[n])

        if delta_E < 0 or -T*np.log(np.random.random(1)[0]) > delta_E:
            spins[n] = -1*spin

    return spins

neighbors_list = gen_neighbors_2d(100)

T_list = np.linspace(1, 4, 100)
size = len(neighbors_list)
t1 = time.time()
for T in T_list:
    blank_var = MH_single_flip(neighbors_list, T, 5000, size)
t2 = time.time()
print(t2-t1)

Hello, thank you for your answer ! I tried your code and it indeed improves the run time. Im starting to doubt that the idea of a neighbor map may not be that efficient. I had originally written a code that determined the neighbors within the for loop using : S[(i+1)%L, j] + S[i, (j+1)%L] + S[(i-1)%L, j] + S[i, (j-1)%L] This implementation worked much faster but a researcher insisted that a neighbor map was the way to go. Ill post an update when I'll see him again. — Martin Baptiste, Mar 21 '23 at 21:49

score 0 · Accepted Answer · answered Mar 22 '23 at 07:25

Let's time your implementation as a benchmark reference:

%timeit MH_single_flip(neighbors_list, T, 5000, size)

>>> 15 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Here are some hints in arbitrary order (you can find some of them also in DavidW's excellent answer linked in the comments)

Calling numpy functions like np.random.randint and np.random.random with scalar arguments inside a loop is pretty inefficient as function calls in Python have noticeable overhead. As numpy ufuncs call C code under the hood, most of the time is spent for checking the function arguments on the Python level rather than the calculation itself. You should always try to do as much (vectorized) calculations as possible with a single numpy function call. That way, you can benefit from the fast calculations under the hood and don't need to worry much about the Python overhead.
Similarly, np.add.reduce only computes the sum of four integers, so it's dominated by the Python overhead. Here, using a plain C function is reasonable and faster alternative.
You should prefer typed memoryviews to the old typed np.ndarrays. The syntax is cleaner and they are slightly faster.
You can trade safety for speed by disabling all index checks of the numpy arrays / memoryviews by the Cython directive boundschecks.

Here's some code snippet:

%%cython

import numpy as np
from cython import wraparound, boundscheck, cdivision, initializedcheck

@wraparound(False)
@boundscheck(False)
@initializedcheck(False)
@cdivision(True)
def gen_neighbors_2d(int a):
    cdef int n = a * a
    cdef int[:, ::1] neighbors = np.zeros((n, 4), dtype=np.int32)
    cdef int i, j, index
    for i in range(a):
        for j in range(a):
            index = i * a + j
            neighbors[index, 0] = i * a + (j - 1) % a # left
            neighbors[index, 1] = i * a + (j + 1) % a # right
            neighbors[index, 2] = ((i - 1) % a) * a + j # up
            neighbors[index, 3] = ((i + 1) % a) * a + j # down
    return np.asarray(neighbors)

@wraparound(False)
@boundscheck(False)
@initializedcheck(False)
cdef int calculate_delta_e(int spin, long[::1] spins, int[::1] neighbors_list):
    return 2 * spin * (spins[neighbors_list[0]] + spins[neighbors_list[1]] + spins[neighbors_list[2]] + spins[neighbors_list[3]])

@wraparound(False)
@boundscheck(False)
@initializedcheck(False)
def MH_single_flip(int[:, ::1] neighbors_list, double T, int iterations, int size):
    cdef long[::1] spins = np.random.choice([-1, 1], size=size)
    cdef int[::1] nns = np.random.randint(0, size, iterations, dtype=np.int32)
    cdef double[::1] log_rand_nums = np.log(np.random.random(iterations))
    cdef int i, delta_E, n, spin
    for i in range(iterations):
        n = nns[i]
        spin = spins[n]
        delta_E = calculate_delta_e(spin, spins, neighbors_list[n])
        if delta_E < 0 or -T*log_rand_nums[i] > delta_E:
            spins[n] = -1*spin
    return np.asarray(spins)

On my machine, this is approximately 75 times faster than your version:

%timeit MH_single_flip(neighbors_list, T, 5000, size)

>>> 213 µs ± 527 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Hello thank you for your answer. Indeed your code runs faster. It was a really good idea to generate the random numbers and make the calculations outside of the loop. I think I will keep these ideas but I beleive that gen_neighbors_2d may not be that good of an idea as the code seems to be slower when using it instead of calculating neighobrs inside the loop using : S[(i+1)%L, j] + S[i, (j+1)%L] + S[(i-1)%L, j] + S[i, (j-1)%L] — Martin Baptiste, Mar 28 '23 at 13:48

Cython optimisation for ising model

2 Answers2