How is numpy multi_dot slower than numpy.dot?

Question

I'm trying to optimize some code that performs lots of sequential matrix operations.

I figured numpy.linalg.multi_dot (docs here) would perform all the operations in C or BLAS and thus it would be way faster than going something like arr1.dot(arr2).dot(arr3) and so on.

I was really surprised running this code on a notebook:

v1 = np.random.rand(2,2)

v2 = np.random.rand(2,2)



%%timeit 
        
v1.dot(v2.dot(v1.dot(v2)))

The slowest run took 9.01 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.14 µs per loop



%%timeit        

np.linalg.multi_dot([v1,v2,v1,v2])

The slowest run took 4.67 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 32.9 µs per loop

To find out that the same operation is about 10x slower using multi_dot.

My questions are:

Am I missing something ? does it make any sense ?
Is there another way to optimize sequential matrix operations ?
Should I expect the same behavior using cython ?

That data is too small. multidot has some setup-phase to optimize ordering. Of course this only pays off with bigger data. (and dot uses BLAS too as seen by: *multidot chains np.dot*) — sascha, Aug 24 '17 at 03:03

Polor Beer · Answer 1 · 2018-06-08T23:32:08.680

8

It's because your test matrices are too small and too regular; the overhead in figuring out the fastest evaluation order may outweights the potential performance gain.

Using the example from the document:

import numpy as snp
from numpy.linalg import multi_dot

# Prepare some data
A = np.random.rand(10000, 100)
B = np.random.rand(100, 1000)
C = np.random.rand(1000, 5)
D = np.random.rand(5, 333)

%timeit -n 10 multi_dot([A, B, C, D])
%timeit -n 10 np.dot(np.dot(np.dot(A, B), C), D)
%timeit -n 10 A.dot(B).dot(C).dot(D)

Result:

10 loops, best of 3: 12 ms per loop
10 loops, best of 3: 62.7 ms per loop
10 loops, best of 3: 59 ms per loop

multi_dot improves performance by evaluating the fastest multiplication order in which there are least scalar multiplications.

In the above case, the default regular multiplication order ((AB)C)D is evaluated as A((BC)D)--so that a 1000x100 @ 100x1000 multiplication is reduced to 1000x100 @ 100x333, cutting down at least 2/3 scalar multiplications.

You can verify this by testing

%timeit -n 10 np.dot(A, np.dot(np.dot(B, C), D))
10 loops, best of 3: 19.2 ms per loop

edited Jun 08 '18 at 23:32

answered Aug 24 '17 at 03:12

Polor Beer

1,814
1
19
18

`(A, (B, C)), D)` gives the same time as `multi_dot` in this case. `np.einsum` now has a similar `optimize` option. – hpaulj Aug 24 '17 at 03:30
Awesome, thanks a lot. (i) How big is big enough? any clues ? (ii) Any comments on using both functions on cython ? does the logic differ ? – Pedro Braz Aug 24 '17 at 03:36
(i) I think the potential lies in how different are the matrix dimensions, and how much can rearranging the order reduces number of scalar multiplications. As long as you have dimension changes somewhere in the chain, it would be better to use this function than not. (ii) I'm not sure whether you can call `multi_dot` directly in cython, but in case you want to implement it from scratch, here's [a reference](http://www.geeksforgeeks.org/dynamic-programming-set-8-matrix-chain-multiplication/). – Polor Beer Aug 24 '17 at 03:56

How is numpy multi_dot slower than numpy.dot?

1 Answers1

Linked