Seeking Optimization for Computation-Heavy Mathematical Function in Numpy

Question

I've recently been developing a Python script that implements a specific mathematical function, which is shown in the figure below

where indexation is periodic and 1 <= j <= n. The function is relatively complex and is inspired by a previous question. The main purpose of the code is to evaluate the mathematical function for large arrays x (of size 5000 or more). Here is the current implementation of this function in Python:

import numpy as np
import sys, os

def compute_y(x, L, v):
    n = len(x)
    y = np.zeros(n)

    for k in range(L+1):
        # Generate the indices for the window around each j, with periodic wrapping
        indices = np.arange(-k, k+1)

        # Compute the weights
        weights_1 = k - np.abs(indices)
        weights_2 = k + 1 - np.abs(indices)
        weights_d = np.ones(2*k+1)

        # For each j, take the elements in the window around j and multiply by the weights
        x_matrix = np.take(x, (np.arange(n)[:, None] + indices) % n, mode='wrap')
        
        exp_1 = np.exp(-np.sum(weights_1[None, :] * x_matrix, axis=1)/v)
        exp_2 = np.exp(-np.sum(weights_2[None, :] * x_matrix, axis=1)/v)
        denom = np.sum(weights_d[None, :] * x_matrix, axis=1)

        # Compute the weighted sum for each j and add to the total
        y += (exp_1 - exp_2)/denom

    return y

# Test the function
x = np.random.rand(5000)
L = len(x)//2
v = 1.4
y = compute_y(x, L, v)

The issue I'm currently facing is that this code, although functional and vectorized, is significantly slower than desired, particularly when applied to large arrays. I believe the primary source of this slowness is the for loop which deals with generating the indices, computing the weights, and summing and calculating the exponentials.

I am therefore looking for guidance and suggestions on how to speed up this code, specifically for arrays of size 5000 or larger. I am particularly interested in methods that take advantage of vectorization through Numpy to expedite the computations.

Any help would be much appreciated. Thank you in advance!

In your given equation, where does the `j` come in? How does `j` relate to `y_j` and `term`? — jared, Jul 18 '23 at 19:12
@jared `j` is an index of the vector `term`. Because everything is vectorized, `j` does not appear explicitly in the code. I will change `term` to `y` to make it more clear. — sam wolfe, Jul 18 '23 at 19:17
Is there evidence that the evaluation of the function the bottle-neck of the whole optimization process? Often it is better to focus on the optimizer. I.e. have the optimizer do fewer iterations and (obviously) to provide exact gradients (finite differences is a source of problems including an excessive number of function evaluations). — Erwin Kalvelagen, Jul 18 '23 at 19:17
@ErwinKalvelagen I am not sure. Clearly, a data-dependent choice for `L` drastically increases speed, to a fairly reasonable estimate (farther indexes contribute less, so they can potentially be ignored). However, from a pure performance standpoint, I wonder whether broadcasting or stride tricks could help, for example. The triangular-shaped data could also be used, but I am not too familiar with this. — sam wolfe, Jul 18 '23 at 19:25
What's the precedence here? In the equation image, v looks like it's part of the exponential. In the code, dividing by v is done after exponentiation. — Nick ODell, Jul 18 '23 at 19:37
@NickODell Good catch! That was a typo, it should be in the exponent. I will fix it. — sam wolfe, Jul 18 '23 at 19:40
What's the purpose of `weights_d`? It appears to be just multiplying by 1. It seems like you could simplify that line to `denom = np.sum(x_matrix, axis=1)`. Not a huge difference either way - only 3% faster. — Nick ODell, Jul 18 '23 at 19:42
How fast are you looking to get this code? Also, `x_sums = np.sum(x_matrix, axis=1)` is also never used. — jared, Jul 18 '23 at 19:49
@jared Hard to tell, just wondering if there is any improvement I could make. Thanks for pointing that out, fixed it. — sam wolfe, Jul 18 '23 at 19:51
Using a rewritten version of your function and numba, I got it down from 800s on my machine to 170s (for size 5000). — jared, Jul 18 '23 at 20:01
Is `L` always `len(x)//2`? Or at least, can we assume that `L` is smaller than `len(x)`? — chrslg, Jul 18 '23 at 20:14
@chrslg considering `L` smaller increases speed, but I am looking for a `L`-independent solution. Having said that, that particular choice for `L` is to guarantee that all points contribute only once to the value at a specific index. — sam wolfe, Jul 18 '23 at 20:20

Nick ODell · Accepted Answer · 2023-07-21T16:42:00.673

Convolutions

The most important thing to notice, I think, is that this is more-or-less a convolution:

exp_1 = np.exp(-np.sum(weights_1[None, :] * x_matrix, axis=1)/v)

If you look at the values of x_matrix, they are the x values for x in position -1, 0, 1, ... around each x value. Then it's being multiplied by weights. That's a convolution.

Why do we care? This is helpful because somebody has already made fast libraries for doing convolutions.

The core idea here is to replace np.take() with three convolutions, and avoid creating the x_matrix array.

import scipy.signal

def compute_y_convolution(x, L, v):
    n = len(x)
    y = np.zeros(n)
    x3 = np.tile(x, 3)

    for k in range(L+1):
        # Generate the indices for the window around each j, with periodic wrapping
        indices = np.arange(-k, k+1)

        # Compute the weights
        weights_1 = k - np.abs(indices)
        weights_2 = k + 1 - np.abs(indices)
        weights_d = np.ones(2*k + 1)

        conv = scipy.signal.convolve(x3, weights_d, mode='same')[n:-n]
        exp_1 = np.exp(-(scipy.signal.convolve(x3, weights_1, mode='same')[n:-n])/v)
        exp_2 = np.exp(-(scipy.signal.convolve(x3, weights_2, mode='same')[n:-n])/v)

        # Compute the weighted sum for each j and add to the total
        y += (exp_1 - exp_2)/conv

    return y

(Why does this create 3 copies of the x array before doing the convolution? At the edge of each array, you want it to wrap around and access elements on the other end of the array, but scipy.signal.convolve will just treat those parts of the convolution as zero.)

This works pretty well, and it achieves a 141x speedup on a 5000 element array. (718 seconds vs. 5.06 seconds)

Re-using convolutions

Convolutions are pretty expensive, and we end up doing three of them every loop in the previous example. Can we do better?

Let's print out the weights used by the convolution each loop:

k=0
weights_1=array([0])
weights_2=array([1])
weights_d=array([1.])
k=1
weights_1=array([0, 1, 0])
weights_2=array([1, 2, 1])
weights_d=array([1., 1., 1.])

We can notice three things:

The weights in the denominator are all one, which is a uniform filter response. Scipy has a specialized function which is faster for uniform filters.
The value of weights_2 is equivalent to the value of weights_1 plus the uniform filter.
The value of weights_1 is equivalent to the value of weights_2 in the previous loop.

Using those observations, we can go from 3 to 1 convolutions.

def compute_y_reuse(x, L, v):
    n = len(x)
    y = np.zeros(n)

    last_exp_2_raw = np.zeros(n)
    for k in range(L+1):
        uniform = scipy.ndimage.uniform_filter(x, size=2*k + 1, mode='wrap') * (2*k + 1)
        exp_1_raw = last_exp_2_raw
        exp_1 = np.exp(-exp_1_raw/v)
        exp_2_raw = exp_1_raw + uniform
        exp_2 = np.exp(-exp_2_raw/v)

        # Compute the weighted sum for each j and add to the total
        y += (exp_1 - exp_2)/uniform
        
        last_exp_2_raw = exp_2_raw

    return y

This achieves a 1550x speedup versus the original on a 5000 element array. (718 seconds vs. 0.462 seconds)

Removing last convolution

I looked into this further, and tried to remove the last convolution. Essentially, the idea is that in the previous loop, we calculated the sum of the N closest elements, and the next loop, we calculate the sum of the N+2 closest elements, so we can just add up the 2 elements on the very edge.

I tried to use np.roll() for this, but found it was slower than uniform_filter(), because it must copy the array. Eventually I found this thread, which let me figure out how to solve this.

Also, since exp_1_raw is the same as the previous iteration's exp_2_raw, we can re-use the np.exp() call by saving the output from that iteration.

def fast_roll_add(dst, src, shift):
    dst[shift:] += src[:-shift]
    dst[:shift] += src[-shift:]


def compute_y_noconv(x, L, v):
    n = len(x)
    y = np.zeros(n)

    last_exp_2_raw = np.zeros(n)
    last_exp_2 = np.ones(n)
    uniform = x.copy()
    for k in range(L+1):
        if k != 0:
            fast_roll_add(uniform, x, k)
            fast_roll_add(uniform, x, -k)
        exp_1_raw = last_exp_2_raw
        exp_1 = last_exp_2
        exp_2_raw = exp_1_raw + uniform / v
        exp_2 = np.exp(-exp_2_raw)

        # Compute the weighted sum for each j and add to the total
        y += (exp_1 - exp_2) / uniform
        
        last_exp_2_raw = exp_2_raw
        last_exp_2 = exp_2
    return y

This achieves a 3100x speedup versus the original on a 5000 element array. (718 seconds vs. 0.225 seconds) It also no longer requires scipy as a dependency.

To frame it as convolutions seems indeed like a good idea. Great answer, this really boosts the calculation. I am not too familiar with some of the techniques you mention here, but I will get back to you. I have plenty to read, thanks! — sam wolfe, Jul 19 '23 at 13:06
@samwolfe I took another look at this today, and improved the final result by 30% by re-using the `np.exp()` call from the previous loop for exp_1. — Nick ODell, Jul 20 '23 at 16:42
The improvements are astonishing. I am more than happy with this answer because it gives me great tools for further performance boosts in other problems. I'm also interested in an optimization problem involving this same expression under non-negative constraints. If I may ask, given your expertise and if you have the time, any insights into [this other question](https://stackoverflow.com/questions/76637914/optimizing-with-non-negative-constraints) would be helpful! One of the answers provided there already hints at specific attempts, but the issue has been, similarly, speed. Thank you anyway! — sam wolfe, Jul 21 '23 at 16:18

jared · Answer 2 · 2023-07-18T22:20:34.330

4

My solution is much slower than @Nick ODell's, but it is still faster than the original code and it's simple and easier to read.

I pretty much just went the naive route and wrote the function using nested loops and then used numba to speed it up. Update: Thanks to @Nin17 for suggesting I use njit(parellel=True) on compute_y as well and use nb.prange instead of range.

import numpy as np
import numba as nb

@nb.njit
def compute_yj(j, x, L, invv, n):
    res = 0
    for k in range(L+1):
        t1 = 0
        t2 = 0
        d = 0
        for i in range(-k, k+1):
            xijn = x[(j+i)%n]
            t1 += (k - np.abs(i))*xijn*invv
            t2 += (k + 1 - np.abs(i))*xijn*invv
            d += xijn
        res += (np.exp(-t1) - np.exp(-t2))/d
        
    return res

@nb.njit(parallel=True)
def compute_y(x, L, v):
    n = x.shape[0]
    y = np.zeros(n)
    invv = 1/v
        
    for j in nb.prange(n):
        y[j] = compute_yj(j, x, L, invv, n)
        
    return y

Testing:

# run a lightweight call for the compilation
x = np.random.rand(2)
L = len(x)//2
v = 1.4
compute_y2(x, L, v)

# actual call
x = np.random.rand(5000)
L = len(x)//2
y = compute_y(x, L, v)

It runs in about 1.5 seconds for 1000 values and about 43 seconds for 5000 (compared to ~8 seconds and ~800 seconds for the original code). And those timings include the small, size 2, compilation time.

edited Jul 18 '23 at 22:20

answered Jul 18 '23 at 21:34

jared

4,165
1
8
31

If you define the functions independently and njit both and change the ```for j in range(n)``` to ```for j in nb.prange(n)``` you can get a pretty significant speed up – Nin17 Jul 18 '23 at 21:45
@Nin17 I tried doing that but didn't see any significant improvement, though I might have done something wrong. `compute_y` is only called once, so unless they are calling it multiple times, the compilation of the first call will slow things down. That said, I have only started playing with numba in the past few days, so I don't fully know what I am doing with it yet. – jared Jul 18 '23 at 21:53
you get the speed up from being able to run the calls to ```compute_yj``` in parallel https://numba.pydata.org/numba-doc/latest/user/parallel.html – Nin17 Jul 18 '23 at 21:59
@Nin17 Ah, I forget the `parellel=True` argument. That said, the speed-up will still only be realized if `compute_y` is called multiple times, since the first call will be slow. Am I misunderstanding? – jared Jul 18 '23 at 22:05
@Nin17 But I did test it by first doing a small call (size 2) and then doing my actual call, which makes it much faster. I will update my post to include that. – jared Jul 18 '23 at 22:06
with ```x.size==1000``` the first call is basically the same at 990ms but for subsequent calls the parallel version is ~.4s faster 85ms vs 520ms – Nin17 Jul 18 '23 at 22:13
@Nin17 Right, so subsequent calls are faster, as expected. In my updated code you can see that I have numba compile on a small example so the larger one can run faster. In the end, the code is fast with that small compilation than my original code without `njit` on the `compute_y` function. Thanks for the suggestion. – jared Jul 18 '23 at 22:19
Thank you! `numba` seems extremely useful. I will keep this in mind for future problems. – sam wolfe Jul 21 '23 at 16:20

score 2 · Answer 3 · answered Jul 18 '23 at 20:21

2

On my system (a small laptop), your original code with the rand() array size changed from 5000 to 1000 runs in 9 seconds. Changing it to use multiprocessing reduced that to 5 seconds. I bet your machine has more cores and so will be even faster.

The code changes are:

def compute_y(x, L, v, k):
    # everything as before except no loop, k is already here

def run_one(k):
    return compute_y(x, L, v, k)

if __name__ == '__main__':
    import concurrent.futures
    pool = concurrent.futures.ProcessPoolExecutor()
    y = sum(pool.map(run_one, range(L+1), chunksize=10))

For what it's worth, the line with np.take() seems to consume a lot of the time. If you want to optimize further I'd try attacking that.

answered Jul 18 '23 at 20:21

John Zwinck

239,568
38
324
436

This is quite interesting. How does `chunksize` affect speed? – sam wolfe Jul 18 '23 at 20:24
@samwolfe: `chunksize` is empirically determined, 10 makes it a bit faster (like 10%) by not having to communicate between processes as much (vs the default of 1). I tried other values like 50 and 100 but they didn't help on my machine. On a faster machine it might be better to use slightly larger chunksize. – John Zwinck Jul 18 '23 at 20:25
I see. Just a brief question: `run_one` takes only `k`, assuming `x`, `L` and `v` are defined elsewhere. Could I include the definition of `run_one` general, for any `x`, `L` and `v`, and still map it over `k`? That is, I want the input to be general, but the `pool.map` to be applied only over `k`. Might be easy to solve, but I can't think of a way. – sam wolfe Jul 18 '23 at 21:00
1

@samwolfe: Try something like `pool.map(compute_y, [(x, L, v, k) for k in range(L+1)], chunksize=10)`. – John Zwinck Jul 20 '23 at 20:13

Seeking Optimization for Computation-Heavy Mathematical Function in Numpy

3 Answers3

Convolutions

Re-using convolutions

Removing last convolution