Which has better memory access ? (C++)

Question

Which version is more efficient and why? It seems that both make the same computations. The only thing I can think of is if the compiler recognizes that in (a) j does not change value and doesn't have to compute it over and over again. Any input would be great!

#define M /* some mildly large number */
double a[M*M], x[M], c[M];
int i, j;

(a) First version
for (j = 0; j < M; j++)
    for (i = 0; i < M; i++)
        c[j] += a[i+j*M]*x[i];

(b) Second version
for (i = 0; i < M; i++)
    for (j = 0; j < M; j++)
        c[j] += a[i+j*M]*x[i];

@PaulR: Genuine question - can modern compilers not spot this and swap the loop preambles? Seeing as the semantics are the same. — Lightness Races in Orbit, Jan 06 '17 at 11:49
@LightnessRacesinOrbit: yes, some compilers can do loop reordering, at least for certain simple cases such as this. — Paul R, Jan 06 '17 at 11:49
I know that (a) is faster but I don't know why. It is a question in a book. — Samu, Jan 06 '17 at 11:49
@PaulR: Ok - my suggestion to measure it makes sense then :) — Lightness Races in Orbit, Jan 06 '17 at 11:50
@Samu: If you don't know why, then you don't know that it's true. — Lightness Races in Orbit, Jan 06 '17 at 11:50
@LightnessRacesinOrbit: yes, that's the trouble with the compiler optimisation arms race - everything you thought you knew 5 years ago is now wrong. ;-) — Paul R, Jan 06 '17 at 11:50
also : [Why does the order of the loops affect performance when iterating over a 2D array?](http://stackoverflow.com/q/9936132/327083) — J..., Jan 06 '17 at 11:54
related : [How does one write code that best utilizes the CPU cache to improve performance?](http://stackoverflow.com/q/763262/327083) — J..., Jan 06 '17 at 11:54
and maybe : [What is “cache-friendly” code?](http://stackoverflow.com/q/16699247/327083) — J..., Jan 06 '17 at 11:55
If *"some mildly large number"* means that `M` elements fit in a single cache line while `M*M` elements do not, then its all comming down to the order of element access `a[i+j*M]`, which becomes a bit jumpy when `j` is incremented in the inner loop. — grek40, Jan 06 '17 at 12:05

score 5 · Accepted Answer · answered Jan 06 '17 at 11:52

5

This is about memory-access patterns rather than computational efficiency. In general (a) is faster because it accesses memory with unit stride, which is much more cache-efficient than (b), which has a stride of M. In the case of (a) each cache line is fully utilised, whereas with (b) it is possible that only one array element will be used from each cache line before it is evicted,

Having said that, some compilers can perform loop reordering optimisations, so in practice you may not see any difference if that happens. As always, you should benchmark/profile your code, rather than just guessing.

answered Jan 06 '17 at 11:52

Paul R

208,748
37
389
560

1

I had never head of unit stride. I am reading about it now on wikipedia. Thanks for your answer :) – Samu Jan 06 '17 at 11:53
"Unit stride" effectively just means "sequentially" or "contiguously" in this context. – Paul R Jan 06 '17 at 11:54
2

@Samu: Literally "one step at a time". It's like picking up items in order as you walk down a supermarket aisle, rather than getting something from shelf 1, then walking down to get something from shelf 10, then walking back to shelf 2, then walking to shelf 11... In this analogy, your computer actually picked up everything from shelves 1-10 to begin with on the assumption that you could then cherry-pick whatever you wanted without doing any walking at all! And now it has to pick up everything from shelves 1-10, then everything from shelves 11-20, then everything from shelves 1-10 again... – Lightness Races in Orbit Jan 06 '17 at 11:55

Which has better memory access ? (C++)

1 Answers1