12

I have a k*n matrix X, and an k*k matrix A. For each column of X, I'd like to calculate the scalar

X[:, i].T.dot(A).dot(X[:, i])

(or, mathematically, Xi' * A * Xi).

Currently, I have a for loop:

out = np.empty((n,))
for i in xrange(n):
    out[i] = X[:, i].T.dot(A).dot(X[:, i])

but since n is large, I'd like to do this faster if possible (i.e. using some NumPy functions instead of a loop).

nneonneo
  • 171,345
  • 36
  • 312
  • 383

3 Answers3

11

This seems to do it nicely: (X.T.dot(A)*X.T).sum(axis=1)

Edit: This is a little faster. np.einsum('...i,...i->...', X.T.dot(A), X.T). Both work better if X and A are Fortran contiguous.

IanH
  • 10,250
  • 1
  • 28
  • 32
  • Appears to *handily* beat my original code: for `n=10000, k=10`, my code is 76.2ms, the new code is *1.64ms*. Nice! – nneonneo Aug 30 '13 at 22:32
6

You can use the numpy.einsum:

np.einsum('ji,jk,ki->i',x,a,x)

This will get the same result. Let's see if it is much faster:

enter image description here

Looks like dot is still the fastest option, particularly because it uses threaded BLAS, as opposed to einsum which runs on one core.

import perfplot
import numpy as np


def setup(n):
    k = n
    x = np.random.random((k, n))
    A = np.random.random((k, k))
    return x, A


def loop(data):
    x, A = data
    n = x.shape[1]
    out = np.empty(n)
    for i in range(n):
        out[i] = x[:, i].T.dot(A).dot(x[:, i])
    return out


def einsum(data):
    x, A = data
    return np.einsum('ji,jk,ki->i', x, A, x)


def dot(data):
    x, A = data
    return (x.T.dot(A)*x.T).sum(axis=1)


perfplot.show(
    setup=setup,
    kernels=[loop, einsum, dot],
    n_range=[2**k for k in range(10)],
    logx=True,
    logy=True,
    xlabel='n, k'
    )
Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
  • 1
    This is considerably slower for large dimension on modern processors due to its in ability to use a threaded BLAS. – Daniel Aug 30 '13 at 22:19
  • @Ophion good point, but I believe it will still be faster than the Python `for` loop... something worth checking – Saullo G. P. Castro Aug 30 '13 at 22:23
  • Python `for` loop cython/numpy `for` loop does not matter. The time really isnt in the loop. – Daniel Aug 30 '13 at 22:24
  • I don't have threaded BLAS (though I should obviously get it at some point). For `n=10000`, this outperforms my original code (76.2ms vs. 1.48ms). – nneonneo Aug 30 '13 at 22:33
  • @nneonneo I'm finishing a test here, mine is BLAS with 4 cores... let's see the results... – Saullo G. P. Castro Aug 30 '13 at 22:36
  • @Ophion... you are right... I've updated the answer with some comparisons agains an optimized BLAS and for high `n`, the `numpy.einsum` is much slower – Saullo G. P. Castro Aug 30 '13 at 22:40
  • You should create the arrays outside the function to avoid the impact of `random`. – nneonneo Aug 30 '13 at 22:41
  • @nneonneo but that would not change the results significantly – Saullo G. P. Castro Aug 30 '13 at 22:42
  • Just to confirm some results, I use MKL and it is definitely slower on my machine too. – IanH Aug 30 '13 at 22:43
  • 2
    Hm, you may be right. Thanks for the `einsum` link; it's nice to know what it can do. Too bad it is not the fastest solution. (Very surprising that it is faster than @IanH's solution for `n=10000, k=10` on one core, though) – nneonneo Aug 30 '13 at 22:43
  • @nneonneo [check this question](http://stackoverflow.com/q/18365073/832621) where it is shown many cases where `einsum` outperforms... it can be very useful and faster in some cases – Saullo G. P. Castro Aug 30 '13 at 22:46
  • @nneonneo brings up a good point. When all is said and done, which version is faster will probably depend on the sizes of the arrays and the setup of the system to be used. – IanH Aug 30 '13 at 22:47
  • @lanH I think we can roughly say that when the ATLAS scales to more than 1 core, your solution will be faster, otherwise the `einsum` can be faster... – Saullo G. P. Castro Aug 30 '13 at 22:49
  • @nneonneo `np.dot(x,x)` scales at roughly `n^2.8` while `np.einsum("ij,jk",x,x)` is naive and scales at `n^3`. `np.dot` will always be faster per core for large arrays. – Daniel Aug 30 '13 at 22:53
  • @SaulloG.P.Castro, Almost 10 years later, is there a better way to do it? – Royi Feb 25 '23 at 18:05
0

You can't do it faster unless you parallelize the whole thing: One thread per column. You'll still use loops - you can't get away from that.

Map reduce is a nice way to look at this problem: map step multiples, reduce step sums.

duffymo
  • 305,152
  • 44
  • 369
  • 561
  • 3
    Of course I can't get faster from a complexity standpoint, but avoiding Python loops (in favour of NumPy constructs) usually provides a speedup simply by avoiding slower Python code. – nneonneo Aug 30 '13 at 21:45