This is a hard one to answer. The best I can do is provide a possible explanation.
Switching j
and k
loops will produce the same results, but it will do the actual multiplications and additions in a completely different order.
Given that the speed of access for a memory location can change dramatically, depending on whether that location is in the cache or not, it's possible that the "fast" order of access is "cache friendly" in that most of the data is in the cache, versus the slower one leads to a lot of cache misses.
This SO question deals with a similar speed difference. Although the cause in that question is very different, it does serve to illustrate how subtle these effects can be.