Rewriting MATLAB code with nested loops in Python with fast execution

Question

Here is a nested loop, where the inner one indices depend on the outer one, with the following parameters:

f = rand(1,70299)
nech=24*30*24
N=length(f);
xh=(1:nech)/24;

In MATLAB:

sf2(1:nech)=0.;
sf2vel(1:nech)=0.;
count(1:nech)=0.;

for i=1:nech
    for j=1:N-i-1
        sf2(i)=sf2(i)+(f(j+i)-f(j))^2;
        count(i)=count(i)+1; 
    end
    sf2(i)=sf2(i)/count(i);
end

In Python:

def structFunPython(f,N,nech):
    sf2 = np.zeros(N)
    count = np.zeros(N)
    for i in range(nech):
        indN = np.arange(1,N-i-1)
        for j in indN:
            sf2[i] += np.power((f[i+j]-f[j]),2)
            count[i] += 1
        sf2[i] = sf2[i]/count[i]
    return sf2

With cython:

import cython
cimport numpy as np
import numpy as np
def structFun(np.ndarray f,N,nech):
    cdef np.ndarray sf2 = np.zeros(N), count = np.zeros(N),
    for i in range(nech):
        indN = np.arange(1,N-i-1)
        for j in indN:
            sf2[i] += np.power((f[i+j]-f[j]),2)
            count[i] += 1
        sf2[i] = sf2[i]/count[i]
    return sf2

Executions times:

Matlab: 7.8377 sec
Python: 3651.35 sec
Cython: 3336.21 sec

I have a hard time believing Python and Cython (especially Cython) are that slow for the same computation, so I think I must have made an error in my Python/Cython loops, but I can't see where.

I assume using `np.power` won't speed up things. Also use cython with annotations (-a option) to see the python interaction, which slows things down. — ead, Apr 01 '22 at 08:51
This smells of a dupe of [Why is MATLAB so fast in matrix multiplication?](https://stackoverflow.com/q/6058139/5211833), but given this is about loops, I'll refrain from dupe-hammering for the time being. Anyhow, MATLAB is not Python, so any implementation will inherently be different, both in execution time as well as numerical result. MATLAB is rather expensive, precisely because a lot of work goes into optimisation over naive, unoptimised code in other languages. — Adriaan, Apr 01 '22 at 08:58
The Python computation is brutally different from that of MATLAB (especially `np.arange()` and `np.power()`) and the Cython code does not contain any useful speed-up trick. Anyway, I would suggest having a look at Numba. Also, you should also provide some code to test your code otherwise this is just a lot of guesswork... — norok2, Apr 01 '22 at 09:03
@CrisLuengo yes but the Python version assumed that `nech` is an iterable which the MATLAB version assumed that `nech` is a number. However, the code in the question has been edited now though so I'll delete my comment — DavidW, Apr 01 '22 at 14:44

norok2 · Answer 1 · 2022-04-01T15:38:27.550

The way I would rewrite the MATLAB code as equivalent Python code could be (beware of indices starting at 1 in MATLAB and at 0 in Python... since I cannot know how this should be adapted without context I went with the simplest approach):

import numpy as np


def func(f_arr, nech, n):
    sf2 = np.zeros(nech)
    count = np.zeros(nech)
    for i in range(nech):
        for j in range(n - i):
            sf2[i] += (f_arr[i + j] - f_arr[i]) ** 2
            count[i] += 1
        sf2[i] /= count[i]
    return sf2

Note that count[i] += 1 is useless as the final value of count[i] is known in advance, and actually the whole count is useless, e.g.:

import numpy as np


def func2(f_arr, nech, n):
    sf2 = np.zeros(nech)
    for i in range(nech):
        for j in range(n - i):
            sf2[i] += (f_arr[i + j] - f_arr[i]) ** 2
        sf2[i] /= (n - i)
    return sf2

Speed-up

This a manual case of Numba speed up. This is just as easy as adding/using a Numba @njit decorator:

import numba as nb


func_nb = nb.njit(func)
func2_nb = nb.njit(func2)

Now, func, func_nb, func2 and func2_nb all perform the same computation:

nech = n = 6
f_arr = np.arange(n)
print(func(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func_nb(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func2(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func2_nb(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]

If you really need to stick to Cython, here is an implementation based on func2:

%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


import numpy as np
import cython as cy


cpdef func2_cy(f_arr, nech, n):
    sf2 = np.zeros(nech)
    _func2_cy(f_arr.astype(np.float_), sf2, nech, n)
    return sf2


cdef _func2_cy(double[:] f_arr, double[:] sf2, cy.int nech, cy.int n):
    for i in range(nech):
        for j in range(1, n - i):
            sf2[i] = sf2[i] + (f_arr[i + j] - f_arr[i]) ** 2
        sf2[i] = sf2[i] / (n - i)

This is significantly more complicated to write compared to Numba, but achieves similar performances. The trick is to have a function _func2_cy which operates with little to no interaction with Python (read: it goes at C speed).

Again the result is the same as func2:

nech = n = 6
f_arr = np.arange(n)
print(func2_cy(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]

Timings

With some little toy benchmarking we get a feeling of the speed-ups, including a comparable function written as you did, and the vectorized solution proposed in Andras Deak's very fine answer (but fixing the indices to match the above):

def func_OP(f, nech, n):
    sf2 = np.zeros(n)
    count = np.zeros(n)
    for i in range(nech):
        indN = np.arange(n - i)  # <-- indexing fixed
        for j in indN:
            sf2[i] += np.power((f[i+j]-f[i]),2)
            count[i] += 1
        sf2[i] = sf2[i] / count[i]
    return sf2


func_OP_nb = nb.njit(func_OP)

def func_pdisty(f, nech, n):
    res = np.zeros(nech)
    dists = scipy.spatial.distance.pdist(f[:, None], metric='sqeuclidean')
    counts = np.arange(n - 1, n - nech - 1, -1)
    inds = np.repeat(np.arange(res.size), counts)
    np.add.at(res, inds, dists[:inds.size])
    res /= (counts + 1)
    return res

nech = n = 6
f_arr = np.arange(n)
print(func_OP(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func_pdisty(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]

nech = n = 1000
f_arr = np.arange(n)
%timeit func_OP(f_arr, nech, n)
# 1 loop, best of 5: 1.5 s per loop
%timeit func(f_arr, nech, n)
# 1 loop, best of 5: 567 ms per loop
%timeit func2(f_arr, nech, n)
# 1 loop, best of 5: 352 ms per loop
%timeit func_OP_nb(f_arr, nech, n)
# 1000 loops, best of 5: 1.87 ms per loop
%timeit func_nb(f_arr, nech, n)
# 1000 loops, best of 5: 1.7 ms per loop
%timeit func2_nb(f_arr, nech, n)
# 1000 loops, best of 5: 768 µs per loop
%timeit func_pdisty(f_arr, nech, n)
# 10 loops, best of 5: 44.5 ms per loop
%timeit func2_cy(f_arr, nech, n)
# 1000 loops, best of 5: 1 ms per loop

nech = n = 2000
f_arr = np.arange(n)
%timeit func_OP(f_arr, nech, n)
# 1 loop, best of 5: 6.01 s per loop
%timeit func(f_arr, nech, n)
# 1 loop, best of 5: 2.3 s per loop
%timeit func2(f_arr, nech, n)
# 1 loop, best of 5: 1.42 s per loop
%timeit func_OP_nb(f_arr, nech, n)
# 100 loops, best of 5: 7.31 ms per loop
%timeit func_nb(f_arr, nech, n)
# 100 loops, best of 5: 6.82 ms per loop
%timeit func2_nb(f_arr, nech, n)
# 100 loops, best of 5: 3.05 ms per loop
%timeit func_pdisty(f_arr, nech, n)
# 1 loop, best of 5: 344 ms per loop
%timeit func2_cy(f_arr, nech, n)
# 100 loops, best of 5: 3.95 ms per loop

nech = n = 4000
f_arr = np.arange(n)
%timeit func_OP(f_arr, nech, n)
# 1 loop, best of 5: 24.3 s per loop
%timeit func(f_arr, nech, n)
# 1 loop, best of 5: 9.27 s per loop
%timeit func2(f_arr, nech, n)
# 1 loop, best of 5: 5.71 s per loop
%timeit func_OP_nb(f_arr, nech, n)
# 10 loops, best of 5: 29 ms per loop
%timeit func_nb(f_arr, nech, n)
# 10 loops, best of 5: 27.3 ms per loop
%timeit func2_nb(f_arr, nech, n)
# 100 loops, best of 5: 12.2 ms per loop
%timeit func_pdisty(f_arr, nech, n)
# 1 loop, best of 5: 706 ms per loop
%timeit func2_cy(f_arr, nech, n)
# 100 loops, best of 5: 15.9 ms per loop

...and with the input sizes you provided:

nech = 24 * 30 * 24
n = 70299
f_arr = np.random.random(n)
%timeit -n1 -r1 func_OP(f_arr, nech, n)
# 1 loop, best of 1: 1h 4min 50s per loop
%timeit -n1 -r1 func(f_arr, nech, n)
# 1 loop, best of 1: 21min 14s per loop
%timeit -n1 -r1 func2(f_arr, nech, n)  # only one run / loop
# 1 loop, best of 1: 13min 9s per loop
%timeit func_OP_nb(f_arr, nech, n)
# 1 loop, best of 5: 4.74 s per loop
%timeit func_nb(f_arr, nech, n)
# 1 loop, best of 5: 4 s per loop
%timeit func2_nb(f_arr, nech, n)
# 1 loop, best of 5: 1.62 s per loop
# %timeit func_pdisty(f_arr, nech, n)
# -- MEMORY ERROR --
%timeit func2_cy(f_arr, nech, n)
# 1 loop, best of 5: 2.2 s per loop

If your "loop" is meant as the outer loop, OP's code has `24*30*24 = 17000` of those, resulting in `7.8/17000 = 5e-4`, or about 0.5ms per outer loop calculation (on their system of course). — Adriaan, Apr 01 '22 at 09:54
I am waiting for the timings with the given input sizes... zzz :-) — norok2, Apr 01 '22 at 10:02
I'm pretty sure your inner loop has an off-by-one error in the `j` index. OP's code starts from 1, which makes sense because there's no point in computing `f[i] - f[i]`. Then again OP's code probably has an off-by-one error at the end of the `range`, missing the last value. — Andras Deak -- Слава Україні, Apr 01 '22 at 10:41
@AndrasDeak--СлаваУкраїні Possibly, I do not have MATLAB available, so I did not bother fixing this. It is probably something the OP can/should figure out on its own. It will not be relevant for acceleration strategies or timing comparisons. — norok2, Apr 01 '22 at 12:01
Compare to their Python implementation. In general your point about performance is OK, but a faster implementation that gives the wrong result is not very useful :) — Andras Deak -- Слава Україні, Apr 01 '22 at 12:02
@AndrasDeak--СлаваУкраїні I am pretty sure those numbers will hold up even with the fixed indices. As you pointed out, the Python implementation is probably bugged, and I provide a OP-like implementation with the same indexing as the others. So what's the point of spending a lot of time for guessing that when this can be easily fixed by OP? — norok2, Apr 01 '22 at 12:15
Nice answer. Just two suggestions regarding your Cython implementation: I think for a fair comparison you should use C-contiguous memoryviews, i.e. `double[::1] f_arr` and `double[::1] sf2`. It's also worth mentioning the used Compiler. On my machine (macOS & gcc), Cython beats the numba implementation whereas numba is often faster on Windows & MSVC. In my experience, this is due to MSVC's bad SIMD-auto-vectorization. — joni, Apr 01 '22 at 21:44
@joni I am using Google Colab and unsure what version of LLVM / GCC they are using. Specifying C-contiguous memory views `dtype[::1]` (as opposed to `dtype[:]`) translates into identical C code. I think that `dtype[:]` vs `dtype[::1]` is when you are explicitly writing code that requires contiguity, but I do not think Cython provides any further optimization automagically. — norok2, Apr 04 '22 at 09:31

Andras Deak -- Слава Україні · Accepted Answer · 2022-04-01T14:35:00.480

Disclaimer: as @norok2 pointed out in a comment, my approach allocates at least N*(N-1)/2 doubles due to the use of pdist. For your N = 70299 this means about 18.5 GB of doubles in an array. Other index arrays will have similar size. So unless some of your use cases have smaller N, the vectorised approach in this answer is only feasible if you have a lot of memory.

As others have noted, just translating code from one language to another will not lead to optimal code in both languages. And using Cython alone is no guarantee for speed-up, in the same way that using NumPy alone is no guarantee for speed-up.

norok2's answer nicely shows you how you can use numba or something similar to compile your numerical code. This can give you something quite similar to what you have in MATLAB in terms of performance, since MATLAB has a just-in-time (JIT) compiler of its own. There's also wiggle room to optimize your code for compilation, because multiple implementations can end up with wildly different performance.

Anyway, I want to make the point that you can also speed up your code by using advanced features in NumPy and SciPy. In particular, what you want is to compute pairwise squared distances among a set of 1d points. This is what scipy.spatial.distance.pdist can do for you (with the 'sqeuclidean' squared Euclidean norm). The upside is that it only compute each pairwise distance once (which is great for CPU and memory performance), but the downside is that picking out the contributions that you want to sum up a bit more cumbersome.

Anyway, here is the code corresponding to your Python implementation (with the fix that the inner loop uses np.arange(1, N-i) rather than np.arange(1, N-i-1)):

from scipy.spatial.distance import pdist

def pdisty(f, nech):
    offset = f.size - nech
    res = np.zeros_like(f, shape=nech)
    dists = pdist(f[:, None], metric='sqeuclidean')
    counts = np.arange(offset, f.size)[::-1]
    inds = np.repeat(np.arange(res.size), counts)
    np.add.at(res, inds, dists[:inds.size])
    res /= counts
    return res

What happens here is that

we compute the pairwise distance of each unique pair of array values and store it to dists
we compute the number of pairs each point is involved in (this is what we'll have to normalise at the end), store it to counts
figure out the 1d index of each value in dists that it corresponds to (this is the hard part), store it in inds
use np.add.at to accumulate each contribution to the appropriate output index
normalise with the counts.

Here are some timings for N = 1000, where func2() is the corresponding function from norok2's answer:

>>> %timeit structFunPython(f, f.size - 1)
... %timeit func2(f, f.size - 1)
... %timeit pdisty(f, f.size - 1)
1.48 s ± 89.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
274 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
36.7 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The above solution is considerably faster, but of course it's still slower than a fully compiled solution. If you have an issue with additional dependencies or getting llvm installed on your system, this might be a reasonable compromise. The bottom line is that code should be adapted to the language you're trying to optimise it in.

For completeness' sake here are the implementations I used for the comparison (I slightly changed the signatures because N can be computed from the input array, and I fixed some fencepost errors):

def structFunPython(f, nech):
    """Very slightly modified from the question"""
    N = f.size
    sf2 = np.zeros(nech)
    count = np.zeros(nech)
    for i in range(nech):
        indN = np.arange(1,N-i)
        for j in indN:
            sf2[i] += np.power((f[i+j]-f[i]),2)
            count[i] += 1
        sf2[i] = sf2[i]/count[i]
    return sf2


def func2(f_arr, nech):
    """Very slightly modified from norok2's answer

    See https://stackoverflow.com/a/71704834/5067311

    """
    n = f_arr.size
    sf2 = np.zeros(nech)
    for i in range(nech):
        for j in range(1, n - i):
            sf2[i] += (f_arr[i + j] - f_arr[i]) ** 2
        sf2[i] /= (n - i - 1)
    return sf2

With these definitions all three functions give the same result within machine precision:

rng = np.random.default_rng()
N = 1000
nech = N - 2
f = rng.random(1000)
assert np.allclose(structFunPython(f, nech), func2(f, nech))
assert np.allclose(structFunPython(f, nech), pdisty(f, nech))

joni · Answer 3 · 2022-04-04T10:44:19.937

You already got some nice answers that presented fast numpy and numba implementations. I'd like to present a decent Cython implementation for a fair comparison.

First of all, let's time norok2's fastest numba implementation on my machine:

In [3]: %timeit func2_nb(f_arr, n)
3.4 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In order to obtain fast memory access in Cython, it's crucial to give it all the necessary information about the input arrays. In your case, f_arr is a C-contiguous np.ndarray, so we use a C-contiguous memoryviews double[::1] instead of normal memoryviews double[:]. The difference is that indexing a C-contiguous memoryview reduces to plain C code f_arr[i] while the latter reduces to f_arr[i + f_arr.strides[0]].
Next, it's worth mentioning that the Python power operator a**2 is gonna be substituted by the C code pow(a,2), i.e. we call the pow-function. Calling a function in tight loops has unnecessary overhead, even in C. In C, we would just write a*a. So let's do the same in Cython.

In Code:

%%cython -c=-O3 -c=-march=native
# for MSVC use: -c=/Ox -c=/arch:AVX2
#           or: -c=/Ox -c=/arch:AVX512 if your CPU supports AVX512 

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def func2_cy(double[::1] f_arr, int nech):
    cdef int n = f_arr.size
    cdef double[::1] sf2 = np.zeros(nech)
    cdef int i, j
    for i in range(nech):
        for j in range(n-i-2):
            sf2[i] += (f_arr[i+j+1]-f_arr[i])*(f_arr[i+j+1]-f_arr[i])
        sf2[i] /= (n - i - 2)
    return np.asarray(sf2)

Compiled on my machine with Apple Clang 13.1.6 on macOS, this is slower than the aforementioned numba implementation:

In [5]: %timeit func2_cy(f_arr, n)
4.2 s ± 32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

However, one should be aware that Cython is basically just a Python to C compiler. This means that after we removed all the python interactions, we can try the same performance tricks we'd apply when using C code: passing the optimization flag -O3 and enabling all CPU instructions available on the current platform -march=native. Note that this also implies that the performance of good Cython code (i.e. code with no python interactions in tight loops) heavily depends on your C Compiler. In my experience, numba is often faster than Cython on Windows due to MSVC's bad auto-vectorization. On macOS/Linux and gcc or clang, it's often the other way around.

One of these well-known performance tricks is loop-unrolling in order to give the compiler hints to SIMD-vectorize the specific loop. Unrolling the innermost loop, the function looks like this:

%%cython -c=-O3 -c=-march=native
# for MSVC use: -c=/Ox -c=/arch:AVX2
#           or: -c=/Ox -c=/arch:AVX512 if your CPU supports AVX512 

#cython: cdivision=True

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def func3_cy(double[::1] f_arr, int nech):
    cdef int n = f_arr.size
    cdef double[::1] sf2 = np.zeros(nech)
    cdef double fi, fij, fij1, fij2, fij3
    cdef int i, j, ntmp
    for i in range(nech):
        # unroll the inner loop for better SIMD vectorization
        j = 1
        ntmp = (n-i-2) - ((n-i-2) % 4)
        # we use a while-loop since Cython doesn't support range(j, ntmp, 4)
        while j < ntmp:
            # unpack the values and hope the CPU keeps them in its register
            fi   = f_arr[i]
            fij  = f_arr[i+j+1]
            fij1 = f_arr[i+j+2]
            fij2 = f_arr[i+j+3]
            fij3 = f_arr[i+j+4]
            sf2[i] += ((fij-fi)*(fij-fi) + (fij1-fi)*(fij1-fi) + (fij2-fi)*(fij2-fi) + (fij3-fi)*(fij3-fi))
            j += 4
        for j in range(ntmp, n - i - 2):
            sf2[i] += (f_arr[i+j+1] - f_arr[i]) * (f_arr[i+j+1] - f_arr[i])
        sf2[i] /= (n-i-2)
    return np.asarray(sf2)

On my machine, this is nearly two times faster than numba:

In [7]: %timeit func3_cy(f_arr, n)
1.7 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

To take it one step further, we can parallelize the outer loop with the help of prange which is just a thin wrapper around OpenMP:

%%cython -c=-O3 -c=-march=native -c=-fopenmp --link-args=-fopenmp
# for MSVC use: -c=/Ox -c=/openmp -c=/arch:AVX2
#           or: -c=/Ox -c=/openmp -c=/arch:AVX512 if your CPU supports AVX512 

#cython: cdivision=True

import numpy as np
cimport numpy as np
cimport cython
from cython.parallel cimport prange

@cython.boundscheck(False)
@cython.wraparound(False)
def func4_cy(double[::1] f_arr, int nech):
    cdef int n = f_arr.size
    cdef double[::1] sf2 = np.zeros(nech)
    cdef double fi, fij, fij1, fij2, fij3
    cdef int i, j, ntmp
    for i in prange(nech, nogil=True, schedule="static", chunksize=1):
        # unroll the inner loop for better SIMD vectorization
        j = 1
        ntmp = (n-i-2) - ((n-i-2) % 4)
        # we use a while-loop since Cython doesn't support range(j, ntmp, 4)
        while j < ntmp:
            # unpack the values and hope the CPU keeps them in its register
            fi   = f_arr[i]
            fij  = f_arr[i+j+1]
            fij1 = f_arr[i+j+2]
            fij2 = f_arr[i+j+3]
            fij3 = f_arr[i+j+4]
            sf2[i] += ((fij-fi)*(fij-fi) + (fij1-fi)*(fij1-fi) + (fij2-fi)*(fij2-fi) + (fij3-fi)*(fij3-fi))
            j += 4
        for j in range(ntmp, n - i - 2):
            sf2[i] += (f_arr[i+j+1] - f_arr[i]) * (f_arr[i+j+1] - f_arr[i])
        sf2[i] /= (n-i-2)
    return np.asarray(sf2)

On my machine with 8 CPU Cores, this is by far the fastest implementation:

In [9]: %timeit func4_cy(f_arr, n)
329 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

However, it's worth mentioning that Numba also supports thread parallelism, so I'd expect a similar performance with Numba.

I generally agree about `pow(a, 2)` -> `a*a` although generally I've found that most C compilers should optimize them to the same thing (so it probably isn't worth worry too much about them). `@cython.boundscheck(False)` is potentially dodgy given that `nech` comes from the caller - it might be worth a little validation at the start of the function. — DavidW, Apr 02 '22 at 14:28
And I think `np.array(sf2)` make a copy while `np.asarray(sf2)` probably won't. — DavidW, Apr 02 '22 at 14:30
@DavidW I agree that decent C compilers like gcc and clang will optimize it anyway. However, MSVC does not. [Here](https://godbolt.org/z/s8hEMG3Eo)'s a simple example that demonstrates that the call to `pow` is still there with the highest optimization flag while gcc and clang will optimize it away. Good hint regarding np.asarray, I must have overlooked that. — joni, Apr 02 '22 at 15:55
Fair enough. I'm sure I'd tried it on MSVC before and found that it did the optimization, but your test looks fairly conclusive. At very least it isn't reliable — DavidW, Apr 02 '22 at 17:15
Good job posting the most on-topic answer :D Fastest too, apparently. Did you check that it gives the same result as OP's one? Just to be on the safe side, in case they copy your answer. — Andras Deak -- Слава Україні, Apr 02 '22 at 18:50
@AndrasDeak--СлаваУкраїні At least now it should give the same results as OP's python version. However, I just noticed that OP's Matlab version calculates `(f(i+j)-f(j))**2` while his Python implementation calculates `(f[i+j]-f[i])**2`.. Note that it's `f[i]` and not `f[j]`. — joni, Apr 02 '22 at 20:48
@joni good catch, I didn't notice that. I think that amount of extra-sensory perception is beyond our responsibility and it's enough if we match their Python code. — Andras Deak -- Слава Україні, Apr 02 '22 at 21:42
Sorry for the typo in the python code. @AndrasDeak--СлаваУкраїні — Ogami musashi, Apr 04 '22 at 05:32

hpaulj · Answer 4 · 2022-04-01T18:23:08.353

With some basic python/numpy rewrites I can speed up your code quite a bit

def structFunPython4(f,N,nech):
    sf2 = np.zeros(N)
    #count = np.zeros(N)
    for i in range(nech):
        # indN = np.arange(1,N-i-1)
        sf2[i] = np.sum(np.power(f[i+1:N-1]-f[i],2))
        #for j in range(1,N-i-1):
        #    sf2[i] += np.power((f[i+j]-f[i]),2)
        #    #count[i] += 1
        sf2[i] = sf2[i]/(N-i-2)
    return sf2

For a modest sample size:

In [53]: f = np.arange(100); N=f.shape[0]; nech=98

they match:

In [54]: np.allclose(structFunPython(f,N,nech),structFunPython4(f,N,nech))
Out[54]: True

and timings:

In [55]: timeit structFunPython(f,N,nech)
34.4 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [56]: timeit structFunPython4(f,N,nech)
2.22 ms ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I first replaced for i in indN: with for i in range(1,N-1-i):. Iteration on an array is slower than iteration on a list or range.

As others noted we don't need to iterate the count.

But the big change is to replace the j iteration with an array slice, power on the whole slice and array sum.

I haven't looked at the i iteration enough to eliminate that. f[i+1:N-1] slice varies in length, from nech down to 0.

MATLAB does a lot of jit compilation, so you can get by with iterations. Back in the 1990s when I worked MATLAB that would have been bad form - those versions required whole-matrix calculations to get any reasonable speed. Python level iteration is slow (interpreted), and slower on arrays than on lists. Try to use whole-array methods where possible. Or use numba to compile the calculations.

====

Rewriting MATLAB code with nested loops in Python with fast execution

4 Answers4

Speed-up

Timings