Recently I've discovered a case in which matrix multiplication with numpy shows very strange performance (at least to me). To illustrate it I've created an example of such matrices and a simple script to demonstrate the timings. Both can be downloaded from the repo, and I don't include the script here because it's of little use without the data.
The script multiplies two pairs of matrices (each pair is the same in terms of shape
and dtype
, only the data differs) in different ways using both dot
function and einsum
. Actually, I've noticed several anomalies:
- The first pair (
A * B
) is multiplied much faster than the second one (C * D
). - When I convert all matrices to
float64
, the times become the same for both pairs: longer than it took to multiplyA * B
, but shorter thanC * D
. - These effects remain for both
einsum
(numpy implementation, as I understand) anddot
(uses BLAS at my machine). For the sake of completeness, the output of this script at my laptop:
With np.dot: A * B: 0.142910003662 s C * D: 4.9057161808 s A * D: 0.20524597168 s C * B: 4.20220398903 s A * B (to float32): 0.156805992126 s C * D (to float32): 5.11792707443 s A * B (to float64): 0.52608704567 s C * D (to float64): 0.484733819962 s A * B (to float64 to float32): 0.255760908127 s C * D (to float64 to float32): 4.7677090168 s With einsum: A * B: 0.489732980728 s C * D: 7.34477996826 s A * D: 0.449800014496 s C * B: 4.05954909325 s A * B (to float32): 0.411967992783 s C * D (to float32): 7.32073783875 s A * B (to float64): 0.80580997467 s C * D (to float64): 0.808521032333 s A * B (to float64 to float32): 0.414498090744 s C * D (to float64 to float32): 7.32472801208 s
How can such results be explained, and how to multiply C * D
faster, like A * B
?