1

I have a code where I need to operate a lot of multiplications between matrices. The code is meant to be used for 2D matrices of arbitrary dimension n, which in principle could be very large, making the program very slow. So far, in order to operate the multiplications, I have always used np.dot, as in the following example

def getV(csi, e, e2, k):
    ktrans = k.transpose()
    v = np.dot(csi, ktrans)
    v = np.dot(v, e)
    v = np.dot(v, k)
    v = np.dot(v, csi)
    v = np.dot(v,  ktrans)
    e2trans = e2.transpose()
    v = np.dot(v, e2trans)
    v = np.dot(v, k)
    traceV = 2*v.trace()
    return traceV

where the output should be twice the trace of the product:

csi*ktrans*e*k*csi*ktrans*e2trans*k

(they are all matrices multiplied together). I am sure there is a faster way to make such a long product, possibly in one passage. Can someone explain how? I have tried but it seems that np.dot always needs just two matrices at any single passages.

johnhenry
  • 1,293
  • 5
  • 21
  • 43
  • 1
    This looks equivalent to `v = csi.dot(k.T).dot(e).dot(k).dot(csi).dot(k.T).dot(e2.T).dot(k)`. Can you verify? – wflynny Mar 19 '15 at 15:27
  • Also, are you looking for Numpy optimization for matrix multiplication or trying to build your own? Either way, I would look into QR decomposition or using `numpy.einsum`. – wflynny Mar 19 '15 at 15:28
  • Roughly what are the dimensions of these arrays? – hpaulj Mar 19 '15 at 15:40
  • 2
    The [forthcoming version of numpy (1.10.0)](https://github.com/numpy/numpy/blob/master/doc/release/1.10.0-notes.rst) will have [`np.linalg.multi_dot`](https://github.com/numpy/numpy/blob/master/numpy/linalg/linalg.py#L2188-L2264), which will optimally compute chained dot products of multiple matrices. – ali_m Mar 19 '15 at 15:41
  • I suspect that chaining `np.dot` calls will probably be at least as fast as `np.einsum` (`np.dot` can use specialized BLAS routines for matrix-matrix multiplication). The order in which you evaluate the dot products can have a large impact on efficiency (see the example in the `multi_dot` docs that I linked above). You basically want to avoid evaluation orders that generate large intermediate arrays. – ali_m Mar 19 '15 at 16:02
  • The iteration space for an `einsum` will prohibitively large, e.g. `n**8`. Still in expression might reveal some optimization steps. – hpaulj Mar 19 '15 at 16:13
  • @hpaulj typically some matrices of dimension around 1024*1024 or 2048*2048 – johnhenry Mar 19 '15 at 16:14
  • @ali_m how can I get numpy 1.10.0? – johnhenry Mar 19 '15 at 16:14
  • You can download the source for the development version from the [numpy github repository](https://github.com/numpy/numpy). The official instructions for building it from source are [here](http://www.scipy.org/scipylib/building/index.html#building); I've also written a guide [here](http://stackoverflow.com/a/14391693/1461210). Bear in mind that the dev version is potentially buggy and unstable. – ali_m Mar 19 '15 at 16:28
  • For arrays of size ~1000x2000, I get that `np.einsum('ij,jk->ij', a, b)` takes 0.863s where `a.dot(b)` takes 1.16s (100 iterations, repeated 3 times). However, I'm not sure I want to construct the einsum string for the OP's multiplication. – wflynny Mar 19 '15 at 16:32
  • @wflynny `a.dot(b)` is equivalent to `np.einsum('ij,jk->ik', a, b)`, not `'ij,jk->ij'`. For `(1000, 1000)` random float64 matrices, `a.dot(b)` is about 16x faster than the equivalent `np.einsum` call on my machine. – ali_m Mar 19 '15 at 16:44
  • `2*np.einsum('ij,kj,kl,lm,mn,on,po,pi', csi, k, e, k, csi, k, e2, k)` is the naive `einsum` version. But it quickly hits `iterator size` problems. – hpaulj Mar 19 '15 at 18:25
  • @ali_m. It was a typo. Didn't copy and paste from ipython. – wflynny Mar 19 '15 at 18:34
  • I just copied the development `master/numpy/linalg/linalg.py` file, and tried a version of `getV` using `multi_dot`. When all of the arrays have the same size, it does not help. – hpaulj Mar 20 '15 at 07:24

1 Answers1

3

Because of the properties of the trace this computation can be rewritten as follows, which reduces the number of matrix multiplications from 7 to 4:

def getV(csi, k, e, e2):
    temp = k.dot(csi).dot(k.T)
    trace_ = (temp.dot(e).dot(temp) * e2).sum()
    return 2 * trace_

Depending on your current setup, you could also try installing a different BLAS library or computing this on the graphics card instead of the CPU.

  • My time savings for this are about 4.5/7, since the `*` still counts a big calculation. I see how the symmetry of the trace lets you replace the final `dot`. I haven't worked out yet how you can reorder the `k` and `k.T`. – hpaulj Mar 19 '15 at 20:36
  • @hpaulj, the Wikipedia page mentions "the trace is invariant under cyclic permutations" i.e. `trace(ABCD) == trace(DABC)`. This allows to move the multiplication with `k` that was last to the front. –  Mar 20 '15 at 03:24