1

I have a very large multiply and sum operation that I need to implement as efficiently as possible. The best method I've found so far is bsxfun in MATLAB, where I formulate the problem as:

L = 10000;
x = rand(4,1,L+1);
A_k = rand(4,4,L);
tic
for k = 2:L
    i = 2:k;
    x(:,1,k+1) = x(:,1,k+1)+sum(sum(bsxfun(@times,A_k(:,:,2:k),x(:,1,k+1-i)),2),3);
end
toc

Note that L will be larger in practice. Is there a faster method? It's strange that I need to first add the singleton dimension to x and then sum over it, but I can't get it to work otherwise.

It's still much faster than any other method I've tried, but not enough for our application. I've heard rumors that the Python function numpy.einsum may be more efficient, but I wanted to ask here first before I consider porting my code.

I'm using MATLAB R2017b.

horchler
  • 18,384
  • 4
  • 37
  • 73
Alex
  • 75
  • 9
  • Many times a for loop is faster than bsxfun. But you seems to try it already...? – Adiel Oct 19 '17 at 18:28
  • 1
    Matrix multiplication is faster than `bsxfun`, but this doesn't seem to easily lend itself to that – Luis Mendo Oct 19 '17 at 18:30
  • @LuisMendo I've considered transforming the problem into a sparse matrix multiplication but it's a bit daunting... You think it would be faster? – Alex Oct 19 '17 at 18:40
  • What version of Matlab are you using? – horchler Oct 19 '17 at 18:45
  • @horchler MATLAB R2017b – Alex Oct 19 '17 at 18:46
  • 1
    If an operation with `bsxfun` / implicit broadcasting can be directly transformed into matrix product, [yes, it's faster](https://stackoverflow.com/a/23911671/2586922). But here the transformation could take too much time – Luis Mendo Oct 19 '17 at 18:49

2 Answers2

5

Since you're using an new version of Matlab you might try broadcasting / implicit expansion instead of bsxfun:

x(:,1,k+1) = x(:,1,k+1)+sum(sum(A_k(:,:,2:k).*x(:,1,k-1:-1:1),3),2);

I also changed the order of summation and removed the i variable for further improvement. On my machine, and with Matlab R2017b, this was about 25% faster for L = 10000.

horchler
  • 18,384
  • 4
  • 37
  • 73
5

I believe both of your summations can be removed, but I only removed the easier one for the time being. The summation over the second dimension is trivial, since it only affects the A_k array:

B_k = sum(A_k,2);
for k = 2:L
    i = 2:k;
    x(:,1,k+1) = x(:,1,k+1) + sum(bsxfun(@times,B_k(:,1,2:k),x(:,1,k+1-i)),3);
end

With this single change the runtime is reduced from ~8 seconds to ~2.5 seconds on my laptop.

The second summation could also be removed, by transforming times+sum into a matrix-vector product. It needs some singleton fiddling to get the dimensions right, but if you define an auxiliary array that is B_k with the second dimension reversed, you can generate the remaining sum as ~x*C_k with this auxiliary array C_k, give or take a few calls to reshape.


So after a closer look I realized that my original assessment was overly optimistic: you have multiplications in both dimensions in your remaining term, so it's not a simple matrix product. Anyway, we can rewrite that term to be the diagonal of a matrix product. This implies that we're computing a bunch of unnecessary matrix elements, but this still seems to be slightly faster than the bsxfun approach, and we can get rid of your pesky singleton dimension too:

L = 10000;
x = rand(4,L+1);
A_k = rand(4,4,L);
B_k = squeeze(sum(A_k,2)).';

tic
for k = 2:L
    ii = 1:k-1;
    x(:,k+1) = x(:,k+1) + diag(x(:,ii)*B_k(k+1-ii,:));
end
toc

This runs in ~2.2 seconds on my laptop, somewhat faster than the ~2.5 seconds obtained previously.

  • 1
    Wow! This is also about 4 times faster on my laptop. Thank you! – Alex Oct 19 '17 at 19:08
  • 1
    @Alex just make sure it's correct, I only took a cursory glance and didn't check rigorously due to all the `NaN`s and `Inf`s. I think I could get rid of your second sum which would further improve performance, I just don't have the time right now :) – Andras Deak -- Слава Україні Oct 19 '17 at 19:09
  • 1
    @Alex so I removed the second sum too, it's only slightly faster as the previous version, but it might scale better for larger problems. – Andras Deak -- Слава Україні Oct 19 '17 at 19:17
  • 1
    Faster, but I get very different outputs for `x` as `L` grows (seeding `rand` the same each time). – horchler Oct 19 '17 at 19:20
  • 1
    @AndrasDeak Great! I'm getting the same outputs for each when I seed with the same `rand` (at least up to the `Inf`'s). Looks like I should brush up on my linear algebra... :) – Alex Oct 19 '17 at 19:34
  • 1
    @horchler Are you sure? I forgot to clear x between tests the first time I compared but it seems good now. – Alex Oct 19 '17 at 19:35
  • 3
    There are definitely differences, but they appear to be due to the numerics (e.g., different order of operations) as they are on the order of `eps(max(x(:)))`. – horchler Oct 19 '17 at 19:50
  • 1
    Thanks, @horchler, for the heads-up. Hopefully it's fine considering Alex' comments, but I'll check on the numbers later to be sure. – Andras Deak -- Слава Україні Oct 19 '17 at 19:56
  • 1
    @Alex also, considering the python implementation: `numpy.dot` is what's _really_ fast (it uses low-level compiled routines and utilizes multiple cores by default). While I love `einsum` due to its expressive power, it gets slower for multiple (more than 2) input arguments (although this would not be such a case). – Andras Deak -- Слава Україні Oct 19 '17 at 20:00
  • @AndrasDeak Do you think `numpy.dot` will be any faster than MATLAB? I'd like to benchmark it but I'm not sure how to formulate the problem as a dot product. – Alex Oct 19 '17 at 21:48
  • 1
    @Alex sorry, you're right, I was distracted by my original dotty notions. Well, you _can_ use `.dot` the same way I used matrix multiplication in my second example, and then you can grab the `numpy.diag` of the resulting array in the same way. I can't say if that or `einsum` or broadcasted multiplication+sum would be faster, you should time them all to see. One important difference to MATLAB is that while MATLAB is column-major, numpy is row-major, so vectorization is fast along leading dimensions in MATLAB but it's fast along trailing dimensions in numpy. – Andras Deak -- Слава Україні Oct 19 '17 at 22:22
  • @AndrasDeak Ah I see. I was hoping we might get rid of the extra calculations that we discard with `diag`. And good to know about the leading dimensions! – Alex Oct 19 '17 at 22:47