How to make matrix calculations as fast as possible

Question

Purely for my own knowledge and understanding of code and computers, I am trying to create an array/matrix class with multiple matrix functions, which I will then use in any projects I need a matrix or array class for. Most significantly, I would like to make a neural network library using this matrix/array class, and therefore require it to be as fast as possible.

The function I require to be fastest is the matrix product calculation of two matrices, however, I have had little luck trying to make this calculation fast with larger matrices.

My current method for calculating the dot product is:

Note, this code is in python, however, if python is not the optimal language, I can use any other

a = [[1, 2, 3], [4, 5, 6]]
b = [[1], [2], [3]]

def dot(a, b):
    c = [[0 for j in range(len(b[i]))] for i in range(len(a))]

    for i in range(len(c)):
        for j in range(len(c[i])):
            t = 0
            for k in range(len(b)):
                t += a[i][k] * b[k][j]
            c[i][j] = t
    return c

print(dot(a, b))
# [[14], [32]]

I have looked into the Intel MKL (I have an intel core i7) and other BLAS implementations like OpenBLAS, however I have not been able to get any results that worked, and no amount of googling can make them work, so my question is, what is the fastest way to calculate the dot product of two matrices? (CPU and memory usage do not matter much to me currently, however, being more efficient would be nice)

PS:

I am trying to do all of this using no external libraries (numpy, for example, in python)

***** UPDATE *****

I am using a mac

***** UPDATE 2 *****

Thank you everyone for all of your help, however, I am unsure how to implement these methods of calculating the dot product as my math skills are not yet advanced enough to understand what any of it means (I am yet to start my GCSEs), though I will keep these ideas in mind and will experiment with these ideas further.

Thank you again for everyone's help.

I slightly retaged your question as I got the feeling those are better fit if you disagree rollback the edit ... — Spektre, Oct 30 '19 at 08:17
How large are the matrices typically. Implementing a efficient matrix, matrix on larger matrices multiplication isn't that simple. If you want to benchmark a simple Cython implementation against BLAS implementations you can benchmark np.dot() with floats -> BLAS, and integer arrays (Cython implementation) — max9111, Oct 30 '19 at 09:22
@max9111 I have already tried cython, and although I did get better results, they wer not quite satisfactory. The matrices would be anything from 5x5 to 1000x1000, however generally on the smaller end of that range — Pencilcaseman, Oct 30 '19 at 09:29
Why not simply using np.dot, which is a wrapper for BLAS algorithms in MKL, OpenBlas? A little modification for this code https://stackoverflow.com/a/58418715/4045774 wouldn't be too bad (a lot faster than your example), but not as fast as MKL. — max9111, Oct 30 '19 at 09:52
If your requirement is that your matrix multiplications be as fast as possible, then BLAS (for which IntelMKL provides one well-regarded implementation, but there are others) is the way to go. Rolling your own is / will be an interesting exercise, but if you end up with an implementation within 25% or the rate of execution of Intel MKL you will have done very well indeed. I don't understand why you were willing to use MKL but not numpy (which probably calls a low-level BLAS implementation anyway). — High Performance Mark, Oct 30 '19 at 10:03
I see what you mean, and to be honest I don’t know why I let myself use the MKL, but it didn’t work anyway, so maybe that is a good thing in terms of forcing me to learn about different methods? — Pencilcaseman, Oct 30 '19 at 10:28

AbdelAziz AbdelLatef · Accepted Answer · 2019-10-30T15:53:59.060

0

If it possible, you can use CUDA to utilize GPU for very fast calculations.

edited Oct 30 '19 at 15:53

answered Oct 29 '19 at 23:58

AbdelAziz AbdelLatef

3,650
6
24
52

I have downnloaded CUDA and installed it, however, I cannot find it in my files, and the examples button on the installer does not do anything, have I missed anything out? – Pencilcaseman Oct 30 '19 at 13:44
You should use it with C++. If you have Visual Studio 2019, you can start with a CUDA project. There is an example of matrix multiplication in their documents. – AbdelAziz AbdelLatef Oct 30 '19 at 15:53
I am on a mac, and visual studio is not avalable fully on mac, so what should I do to get around this? – Pencilcaseman Oct 30 '19 at 16:47
@Pencilcaseman CUDA is a lib so you should link it to your project either as DLL or OBJ or whatever extention you have on MAC programming environment of yours. if CUDA is not shipped for MAC you can use OpenGL GLSL shaders to do more or less the same ... OpenGL should run on MAC but you need a decent gfx driver ... of coarse if you are not fluent in GL and GLSL it would be a hard to do for you – Spektre Oct 30 '19 at 16:58
You can use it in Mac as described here https://docs.nvidia.com/cuda/cuda-installation-guide-mac-os-x/index.html . – AbdelAziz AbdelLatef Oct 30 '19 at 17:21

Spektre · Answer 2 · 2019-10-30T08:21:55.237

0

You can use GPU

as AbdelAziz AbdelLatef suggested in his answer. However this limits the usage of your lib to computers with GPU.
Parallelize the dot products for big matrices
use SIMD instructions
use state of the art algorithms

some operations on big data sets can be done much faster using more advanced techniques which are too slow for small matrices ... usually involving FFT or NTT ... Matrix multiplication is set of dot products and dot product is form of convolution so FFT approach should be applicable but had never done that for matrices/vectors ...

there are also special algorithms solely for matrices like Strassen algorithm

for powers you can use power by squaring, for sqr I think you can simplify even more some multiplications would be the same ...

Python is far from optimal as its slow I would do such thing in C++ or even combine with asm if there is the need for extreme speed (like the SIMD instructions use). IIRC you still can use C++ created libs in python (link as DLL,obj,...)

However if you need fast neural network then use dedicated HW. There are neural network processing ICs out there too.

edited Oct 30 '19 at 08:21

answered Oct 30 '19 at 08:02

Spektre

49,595
11
110
380

How do you suggest I parallelize the dot products, also, which language should I use for this, C/C++ or java, as I am more familliar with java – Pencilcaseman Oct 30 '19 at 13:48
@Pencilcaseman JAVA and python are usually much slower then C/C++ by design. Paralleization is done by threads. You divide the dot products in between CPU cores you have. As you only read the input matrices no sharing locks are needed and each dot result goes into different output matrix cell so no need for lock there too. On windows after acquiring system affinity mask you have info on how much and which CPUs to use so its usual to create per core working thread with set affinity to only single unique core (its usually equalize load) and schedule the work to them.... – Spektre Oct 30 '19 at 14:54
You say on windows, however I am on mac (Adding this to the question now), is there any difference? – Pencilcaseman Oct 30 '19 at 16:50
@Pencilcaseman sure the OS api function calls for affinity and threads are slightly different between different OS... – Spektre Oct 30 '19 at 16:54

How to make matrix calculations as fast as possible

2 Answers2