Problem with optimizing matrix multiplication speed in NumPy/MATLAB

Question

Currently, I'm in the process of optimizing a piece of code I have. I'm trying to speed-up a bunch of matrix multiplications by reducing the size of matrix dimensions. However, in some cases, using both NumPy and MATLAB, I'm not able to obtain the speed-ups I expect. Using MATLAB, I first defined 2 randomized matrices: bigger_mat which is 10000x10000 and smaller_mat which is 10000x100. I then created 2 smaller matrices by slicing bigger_mat and smaller_mat, such that I get matrix dimensions of 200x10000 (bigger_mat_sliced) and 10000x2 (smaller_mat_sliced).

% Defining full (big) array dimension 
dim_big = 100;

% Defining sliced (small) array dimension. 
dim_small = 2;

% Creating a 10000x10000 randomized array
bigger_mat = rand(dim_big^2, dim_big^2);

% Creating a 10000x100 randomized array
smaller_mat = rand(dim_big^2, dim_big);

% Slicing bigger_mat to obtain a 200x10000 array
bigger_mat_sliced = bigger_mat(1:dim_small * dim_big, :); 

% Slicing smaller_mat to obtain a 10000x2 array
smaller_mat_sliced = smaller_mat(:, 1:dim_small);

I then measured the runtimes for the following 3 matrix multiplications:

bigger_mat x smaller_mat
bigger_mat_sliced x smaller_mat
bigger_mat x smaller_mat_sliced

My expectations were as the following:

Multiplication #1 should take the longest amount of time since the unsliced (full) matrices are being multiplied
Multiplications #2 and #3 should take less time than #1, as in both #2 and #3 I'm multiplying a full matrix with a sliced matrix. Specifically, #2 and #3 both should require the same amount of time, and both should be 50 times faster than #1 (the sliced dimensions are scaled down by a factor of dim_big/dim_small = 100/2 = 50).

The timings I got were:

bigger_mat x smaller_mat: Elapsed time is 0.110538 seconds
bigger_mat_sliced x smaller_mat: Elapsed time is 0.002564 seconds
bigger_mat x smaller_mat_sliced: Elapsed time is 0.068878 seconds

While #2 is behaving as expected with a 43x speed-up compared to #1, #3 is only 1.6x faster than #1. I tried running this test using NumPy, but I also got similar timings to those above.

It seems to me like when multiplying 2 matrices of unequal outer dimensions, for example A(i,k)*B(k,j) where i >> j, slicing the largest of the 2 outer dimensions (i) by some factor scales down the multiplication time as expected. However, for some reason, scaling down (or slicing) the smaller dimension (j) yields barely any speed-up. I'm really having a hard time understanding these results. I tried looking up matrix multiplication algorithms implemented in BLAS libraries, hoping to find an explanation, but soon I found myself out of my depth.

Lastly, is there a way to make multiplication #2 as fast as #3? Thanks!

Not writing an answer, because I don't remember the full story, but there was something about matrices being read per row in C and per column in Fortran or something like that, so depending on whether the big dimension is column or row and on whether the vectorized operation are written in C or Fortran you might get linear scalability or not get it. Look it up and if you can find something along these lines. — Puff, Dec 27 '21 at 14:40
As @Puff said, this is extremely dependent on the memory layout of the matrix, mostly due to how the cache is filled from memory. Roughly, the cache needs to be completely filled with data that the CPU will actually process. If much of the data isn't useful (because the matrix is laid out inefficiently relative to the cache), it's extremely wasteful of CPU cycles. [Here's](https://stackoverflow.com/a/11421344/1354854) a good answer, with lots of helpful links. — bnaecker, Dec 27 '21 at 21:55
Very good answer you linked. I just want to extract a single line from it that basically states that I war right in principle, but for practical applications with modern algorithms, row vs column iteration is no longer relevant > Together with the BLIS papers it becomes fairly easy to understand how libraries like Intel MKL can gain such a performance. And why it does not matter whether you use row or column major storage! — Puff, Dec 29 '21 at 15:24
Thank you Puff and bnaecker a lot for the useful information! Do you know if there is a way I can manipulate the matrices ( or store them efficiently in memory), such that #2 and #3 matrix multiplications will be equally fast? I unfortunately don't have the CS/Math background to answer this question solely by reading the papers which @bnaecker linked, so any insight is very much appreciated! — Abdelrahman Ahmed, Dec 30 '21 at 14:07
It's not likely that you can find an easy way to make #2 and #3 work equally quickly. These are just very different matrix products, and they way they interact with your cache is probably both hard to understand and not worth your time. It's also completely dependent on the exact machine you're currently using. You should basically trust that libraries like BLAS are doing this as quickly as possible, and spend your time optimizing other parts of your code. — bnaecker, Dec 31 '21 at 04:18

Problem with optimizing matrix multiplication speed in NumPy/MATLAB

0 Answers0