Naive matrix multiplication is O(n^3). One can get better asymptotic performance with more sophisticated algorithms, but most of them are not useful because of huge overhead.
My question is, what algorithm is favoured in current libraries such as BLAS? What about in accelerators like CUDA?
And what is their asymptotic complexity?