1

I had a recent job interview where I was requested to multiply 2 matrices which was really easy. Then I was asked:

Imagine that when reading one value of any matrix the CPU will get you the 4 adjacent ones from the right, how can you use this fact to improve performance?

At first I though about saving every 4 values in variables and instead of reading A[i][j] I can simply check the variables, but this doesn't help at all since still we are reading values from memory thus no single advantage...

  • It sounds like they're asking you to use SIMD, but it's not really complete: you'd need 4-wide arithmetic operations to make good use of the 4-wide loads. Was anything said about that? – harold Jul 04 '21 at 22:04
  • @harold I don't think that's what he meant, it's intern job interview –  Jul 04 '21 at 23:07
  • 3
    matrix multiply typically involves going down the rows of one and columns of the other. If you first transpose one of the matrixes you can multiply by going down the rows (or columns) of each such that memory fetches ahead help. While transposing creates a new matrix, it's an O(1) operation and gains will offset it for larger matrixes. – doug Jul 05 '21 at 01:30
  • Why did this get referred to the ```What's the use of do while(0) when we define a macro? [duplicate]``` question? They are unrelated. – Nathaniel Johnson Jul 05 '21 at 21:08
  • I don't have any idea how the duplicate has anything to do with the question... The actual question is, does "when reading one value of any matrix the CPU will get you the 4 adjacent ones from the right" mean it ONLY gets the four adjacent ones and not the one you ask for? Or does it mean that you get 5 values: the one you read and four more? Because that changes the answer. – Jerry Jeremiah Jul 05 '21 at 21:25
  • @doug you didn't answer my question... I'm asking how fetching ahead will help at all... –  Jul 06 '21 at 09:29
  • I think what they mean is you get 4 values for the price of one, but they all must be consecutive. – n. m. could be an AI Jul 06 '21 at 09:39
  • @doug transpose is not an O(1) operation. Fortunately it is not really needed to take advantage of this fetch speedup. You need to store matrices a bit differently though. – n. m. could be an AI Jul 06 '21 at 09:42
  • @n.1.8e9-where's-my-sharem. but when you get those 4 values you still need to save them in some variables to use them in next step, thus we are back to the old problem (We need to read them later from memory so why reading blocks helps at all) and I didn't gain anything from the read of blocks of 4 –  Jul 06 '21 at 09:46
  • This depends on context and implementation. These languages do not enforce a particular implementation of a matrix, if implemented as a 2D array you can chose yourself which dimension that is x and y. Which means that the potential to muck up the whole cache memory friendliness is considerable. Also the question seems to be fishing for aligned reads, but we can't assume anything about that without knowing the types involved. This question cannot be answered without implementation details. – Lundin Jul 06 '21 at 09:51
  • Fetching data from memory and placing it to registers takes time. That's a basic operation. Once the data is in the registers, you can access it much more quickly. I'm not quite sure what you mean by "save them in some variables". Hardware doesn't have variables. It has memory and registers. Or you can think about it as if you had a cache line of size 4. – n. m. could be an AI Jul 06 '21 at 09:53

1 Answers1

0

There is no single way to optimise matrix multiplication, here is one way. I have no idea whether your interviewers expected something similar or something entirely different.

The method does require that you store matrices in a special way (neither row-major nor column-major), but it's always the same way for all matrices, so there is no need to switch between row major and column major representations (no transposes of intermediate results).

The basic idea is to store a matrix as a 2D array of 2x2 element blocks. In other words, you arrange it such that

a[2*i+0][2*j+0]
a[2*i+0][2*j+1]
a[2*i+1][2*j+0]
a[2*i+1][2*j+1]

are contiguous in memory, so that they are fetched and stored together.

You then just multiply the matrices block-wise (a row of blocks by a column of blocks).

You will need to pad matrices of odd sizes, but for big matrices this is not terribly expensive.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
  • Still you didn't use the fact that we are reading blocks of memory... you did left[i][k] so let's suppose k is 0 then you access left[i][0] then you access again left[i][1] which you already have. you didn't use the extra bonus in the question. –  Jul 06 '21 at 13:56
  • This answer comes not from the top of my head but from real hardware and software which I happen to be familiar with. I'm afraid I cannot explain it any better, so I will just leave this answer for the future readers. – n. m. could be an AI Jul 06 '21 at 16:27