Well you are doing a lot of optimizations in your answer post
. I would like to add few more (mostly tweaks). I would build upon the winner from the answer post, which seems to be numexpr
based on.
Tweak #1
First off, np.sum(X ** 2, axis = -1)
could be optimized with np.einsum
. Though this part isn't the biggest overhead, but optimization of any sort won't hurt. So, that summation could be expressed as -
X_norm = np.einsum('ij,ij->i',X,X)
Tweak #2
Secondly, we could leverage Scipy supported blas functions and if allowed use single-precision dtype for noticeable performance improvement over its double precision one. Hence, np.dot(X, X.T)
could be computed with SciPy's sgemm
like so -
sgemm(alpha=1.0, a=X, b=X, trans_b=True)
Few more tweaks on rearranging the negative sign with gamma
lets us feed more to sgemm
. Also, we would push in gamma
into the alpha
term.
Tweaked implementations
Thus, with these two optimizations, we would have two more variants (if I could put it that way) of the numexpr
method, listed below -
from scipy.linalg.blas import sgemm
def app1(X, gamma, var):
X_norm = -np.einsum('ij,ij->i',X,X)
return ne.evaluate('v * exp(g * (A + B + 2 * C))', {\
'A' : X_norm[:,None],\
'B' : X_norm[None,:],\
'C' : np.dot(X, X.T),\
'g' : gamma,\
'v' : var\
})
def app2(X, gamma, var):
X_norm = -gamma*np.einsum('ij,ij->i',X,X)
return ne.evaluate('v * exp(A + B + C)', {\
'A' : X_norm[:,None],\
'B' : X_norm[None,:],\
'C' : sgemm(alpha=2.0*gamma, a=X, b=X, trans_b=True),\
'g' : gamma,\
'v' : var\
})
Runtime test
Numexpr based one from your answer post -
def app0(X, gamma, var):
X_norm = np.sum(X ** 2, axis = -1)
return ne.evaluate('v * exp(-g * (A + B - 2 * C))', {
'A' : X_norm[:,None],
'B' : X_norm[None,:],
'C' : np.dot(X, X.T),
'g' : gamma,
'v' : var
})
Timings and verification -
In [165]: # Setup
...: X = np.random.randn(10000, 512)
...: gamma = 0.01
...: var = 5.0
In [166]: %timeit app0(X, gamma, var)
...: %timeit app1(X, gamma, var)
...: %timeit app2(X, gamma, var)
1 loop, best of 3: 1.25 s per loop
1 loop, best of 3: 1.24 s per loop
1 loop, best of 3: 973 ms per loop
In [167]: np.allclose(app0(X, gamma, var), app1(X, gamma, var))
Out[167]: True
In [168]: np.allclose(app0(X, gamma, var), app2(X, gamma, var))
Out[168]: True