I had a recent job interview where I was requested to multiply 2 matrices which was really easy. Then I was asked:
Imagine that when reading one value of any matrix the CPU will get you the 4 adjacent ones from the right, how can you use this fact to improve performance?
At first I though about saving every 4 values in variables and instead of reading A[i][j] I can simply check the variables, but this doesn't help at all since still we are reading values from memory thus no single advantage...