Assuming that you are running code on Intel processors or compatible (AMD) you may actually want to switch to assembly language to do heavy matrix computations. Luckily, you have the Intel-IPP library that does the actual work for you using advanced processor technology and selecting what is expected to be the fastest algorithm depending on your processor.
The IPP includes all the necessary matrix computations that you'd possibly need. The only problem you may encounter is the order in which you created your matrices. You may have to re-organize the order to make it easier to use the IPP functions you'd like to use.
Note that in regard to your two code examples, the second one will be faster because you avoid the +=
operator which is a read / modify / write cycle and that's generally slow (not only that, it requires the result matrix to be all zeroes to start with whereas the second example does not require clearing the output first), although your matrices are likely to fit in the cache... but, the processors are optimized to read input data in sequence (a[0], a1, a[2], a[3], ...) and also to write that data back in sequence. If you can write your algorithm to be as close as possible to such a sequence, all the better. Don't get me wrong, I know that matrix multiplications cannot be done in sequence. But if you think of that to do your optimization, you'll achieve better results (i.e. change the order in which your matrices are saved in memory could be one of them).