Convolutions
The most important thing to notice, I think, is that this is more-or-less a convolution:
exp_1 = np.exp(-np.sum(weights_1[None, :] * x_matrix, axis=1)/v)
If you look at the values of x_matrix, they are the x values for x in position -1, 0, 1, ... around each x value. Then it's being multiplied by weights. That's a convolution.
Why do we care? This is helpful because somebody has already made fast libraries for doing convolutions.
The core idea here is to replace np.take()
with three convolutions, and avoid creating the x_matrix
array.
import scipy.signal
def compute_y_convolution(x, L, v):
n = len(x)
y = np.zeros(n)
x3 = np.tile(x, 3)
for k in range(L+1):
# Generate the indices for the window around each j, with periodic wrapping
indices = np.arange(-k, k+1)
# Compute the weights
weights_1 = k - np.abs(indices)
weights_2 = k + 1 - np.abs(indices)
weights_d = np.ones(2*k + 1)
conv = scipy.signal.convolve(x3, weights_d, mode='same')[n:-n]
exp_1 = np.exp(-(scipy.signal.convolve(x3, weights_1, mode='same')[n:-n])/v)
exp_2 = np.exp(-(scipy.signal.convolve(x3, weights_2, mode='same')[n:-n])/v)
# Compute the weighted sum for each j and add to the total
y += (exp_1 - exp_2)/conv
return y
(Why does this create 3 copies of the x array before doing the convolution? At the edge of each array, you want it to wrap around and access elements on the other end of the array, but scipy.signal.convolve
will just treat those parts of the convolution as zero.)
This works pretty well, and it achieves a 141x speedup on a 5000 element array. (718 seconds vs. 5.06 seconds)
Re-using convolutions
Convolutions are pretty expensive, and we end up doing three of them every loop in the previous example. Can we do better?
Let's print out the weights used by the convolution each loop:
k=0
weights_1=array([0])
weights_2=array([1])
weights_d=array([1.])
k=1
weights_1=array([0, 1, 0])
weights_2=array([1, 2, 1])
weights_d=array([1., 1., 1.])
We can notice three things:
- The weights in the denominator are all one, which is a uniform filter response. Scipy has a specialized function which is faster for uniform filters.
- The value of
weights_2
is equivalent to the value of weights_1
plus the uniform filter.
- The value of
weights_1
is equivalent to the value of weights_2
in the previous loop.
Using those observations, we can go from 3 to 1 convolutions.
def compute_y_reuse(x, L, v):
n = len(x)
y = np.zeros(n)
last_exp_2_raw = np.zeros(n)
for k in range(L+1):
uniform = scipy.ndimage.uniform_filter(x, size=2*k + 1, mode='wrap') * (2*k + 1)
exp_1_raw = last_exp_2_raw
exp_1 = np.exp(-exp_1_raw/v)
exp_2_raw = exp_1_raw + uniform
exp_2 = np.exp(-exp_2_raw/v)
# Compute the weighted sum for each j and add to the total
y += (exp_1 - exp_2)/uniform
last_exp_2_raw = exp_2_raw
return y
This achieves a 1550x speedup versus the original on a 5000 element array. (718 seconds vs. 0.462 seconds)
Removing last convolution
I looked into this further, and tried to remove the last convolution. Essentially, the idea is that in the previous loop, we calculated the sum of the N closest elements, and the next loop, we calculate the sum of the N+2 closest elements, so we can just add up the 2 elements on the very edge.
I tried to use np.roll()
for this, but found it was slower than uniform_filter()
, because it must copy the array. Eventually I found this thread, which let me figure out how to solve this.
Also, since exp_1_raw
is the same as the previous iteration's exp_2_raw
, we can re-use the np.exp()
call by saving the output from that iteration.
def fast_roll_add(dst, src, shift):
dst[shift:] += src[:-shift]
dst[:shift] += src[-shift:]
def compute_y_noconv(x, L, v):
n = len(x)
y = np.zeros(n)
last_exp_2_raw = np.zeros(n)
last_exp_2 = np.ones(n)
uniform = x.copy()
for k in range(L+1):
if k != 0:
fast_roll_add(uniform, x, k)
fast_roll_add(uniform, x, -k)
exp_1_raw = last_exp_2_raw
exp_1 = last_exp_2
exp_2_raw = exp_1_raw + uniform / v
exp_2 = np.exp(-exp_2_raw)
# Compute the weighted sum for each j and add to the total
y += (exp_1 - exp_2) / uniform
last_exp_2_raw = exp_2_raw
last_exp_2 = exp_2
return y
This achieves a 3100x speedup versus the original on a 5000 element array. (718 seconds vs. 0.225 seconds) It also no longer requires scipy as a dependency.