What multiplication algorithm is used in practice?

Question

Naive matrix multiplication is O(n^3). One can get better asymptotic performance with more sophisticated algorithms, but most of them are not useful because of huge overhead.

My question is, what algorithm is favoured in current libraries such as BLAS? What about in accelerators like CUDA?

And what is their asymptotic complexity?

What did you find when you examined open-source implementations like OpenBLAS or Atlas? Please add results of preliminary research to the question. — njuffa, Sep 14 '22 at 20:16
naive matrix multiplications can be done with multiple dot products and easily parallelizable... I assume nowadays GPUs have SIMD instructions (up to 4D vectors) for that so they most likely exploit that instead of going for more sophisticated algorithms (which are usually not paralleizable fully or at all) if the matrix sizes are below threshold ... so having many GPU cores and SIMD for vector dot product will give huge speed boost ... — Spektre, Sep 15 '22 at 06:11
Related: [How does BLAS get such extreme performance?](https://stackoverflow.com/questions/1303182/how-does-blas-get-such-extreme-performance) and [How to write a matrix matrix product that can compete with Eigen?](https://stackoverflow.com/questions/35620853/how-to-write-a-matrix-matrix-product-that-can-compete-with-eigen/35637007#35637007) — Stef, Sep 15 '22 at 12:46

What multiplication algorithm is used in practice?

0 Answers0