Which ordering of nested loops for iterating over a 2D array is more efficient

Question

Which of the following orderings of nested loops to iterate over a 2D array is more efficient in terms of time (cache performance)? Why?

int a[100][100];

for(i=0; i<100; i++)
{
   for(j=0; j<100; j++)
   {
       a[i][j] = 10;    
   }
}

or

for(i=0; i<100; i++)
{
   for(j=0; j<100; j++)
   {
      a[j][i] = 10;    
   }
}

You wrote them. Now you can make the numbers bigger and test it by yourself. I bet that there will be no difference due to compiler optimizations, or in the worst case, the first one will be faster, because it iterates over the contigous area of memory. — Rafał Rawicki, Mar 27 '12 at 10:58
Small note: use "++i" instead of "i++". It is faster (although for numbers difference with "i++" is very small, not as for STL iterators). — Raxillan, Mar 27 '12 at 11:02
@Raxillan - this is no longer true with modern processors and compilers in all cases depending on the actual language. — , Mar 27 '12 at 11:10
@Raxillan that's just wrong. They will be equally efficient, unless you're using a compiler from the 70's. — Luchian Grigore, Mar 27 '12 at 11:10
@Raxillan in this case, no. The optimizer is smart enough to know it doesn't need a copy. — Luchian Grigore, Mar 27 '12 at 11:19
@Raxillan Why should it? Neither the new nor the old value is used in the same expression, and the compiler knows that. So why should it make a difference? — glglgl, Mar 27 '12 at 12:13
Closing this as dupe, as per https://meta.stackoverflow.com/questions/380911/canonical-dupes-regarding-c-loop-ordering-and-cache-performance — Lundin, Mar 08 '19 at 08:41

MByD · Accepted Answer · 2012-03-27T11:06:04.150

65

The first method is slightly better, as the cells being assigned to lays next to each other.

First method:

[ ][ ][ ][ ][ ] ....
^1st assignment
   ^2nd assignment
[ ][ ][ ][ ][ ] ....
^101st assignment

Second method:

[ ][ ][ ][ ][ ] ....
^1st assignment
   ^101st assignment
[ ][ ][ ][ ][ ] ....
^2nd assignment

edited Mar 27 '12 at 11:06

answered Mar 27 '12 at 10:56

MByD

135,866
28
264
277

so you mean accessing them is faster than other one?? – Sachin Mhetre Mar 27 '12 at 10:57
1

It means that you will get less cache misses and processor will be able to guess better what memory is going to be accessed next. – tchap Mar 27 '12 at 11:18
2

Raymond Chen has a somewhat similar post on his blog, with pictures and a good explanation as well: http://blogs.msdn.com/b/oldnewthing/archive/2005/08/05/448073.aspx. Think of a bitmap as a bigger array. – chris Mar 27 '12 at 11:48
28

Fun benchmark: On my particular computer and compiler (i5 processor, Linux, gcc -O3), and with a much larger matrix, the first method took 2 seconds, and the second took 19 seconds. – Thomas Padron-McCarthy Mar 27 '12 at 11:54
3

Benchmarks on my computer also concluded the first method is more efficient. – Jesse Good Mar 27 '12 at 11:57
1

@ThomasPadron-McCarthy that is a very specific benchmark that has almost no meaning, in most real environments (with a more significant loop body) the difference will be much lower or even non-existing. – KillianDS Mar 27 '12 at 12:50
@KillianDS: I don't agree. Check most matrix or image library implementations (or other librarys manipulating large arrays of data). They usually go to great lengths to avoid accessing the array in a non-sequential order since cache misses can be very expensive. – Leo Mar 27 '12 at 13:47
2

@Leo: if you have a too fast inner loop, yes, otherwise: no. The thing is that the accesses in the second case are still very predictable (strided), except on column jumps, any modern CPU will prefetch these cache lines before you need them. – KillianDS Mar 27 '12 at 14:45
1

@KillianDS: Oh, I agree with that. The only case when the order of access will make a big difference is when the arrays are large enough to not fit in the cache. On the other hand, those are usually the cases when the loop body will be quite small and cache misses will be very noticable. If working with large arrays it is usually better to be safe than sorry and make sure the accesses are as sequential as possible (unless performance is irrelevant). – Leo Mar 27 '12 at 15:58

amit · Answer 2 · 2012-03-27T14:46:01.787

For array[100][100] - they are both the same, if the L1 cache is larger then 100*100*sizeof(int) == 10000*sizeof(int) == [usually] 40000. Note in Sandy Bridge - 100*100 integers should be enough elements to see a difference, since the L1 cache is only 32k.
Compilers will probably optimize this code all the same
Assuming no compiler optimizations, and matrix does not fit in L1 cache - the first code is better due to cache performance [usually]. Every time an element is not found in cache - you get a cache miss - and need to go to the RAM or L2 cache [which are much slower]. Taking elements from RAM to cache [cache fill] is done in blocks [usually 8/16 bytes] - so in the first code, you get at most miss rate of 1/4 [assuming 16 bytes cache block, 4 bytes ints] while in the second code it is unbounded, and can be even 1. In the second code snap - elements that were already in cache [inserted in the cache fill for the adjacent elements] - were taken out, and you get a redundant cache miss.
- This is closely related to the principle of locality, which is the general assumption used when implementing the cache system. The first code follows this principle while the second doesn't - so cache performance of the first will be better of those of the second.

Conclusion: For all cache implementations I am aware of - the first will be not worse then the second. They might be the same - if there is no cache at all or all the array fits in cache completely - or due to compiler optimization.

K.... So it means first one is always efficient... We can access elements much faster... — Sachin Mhetre, Mar 27 '12 at 11:07
@SachinMhetre: For all cache implementations I am aware of - the first will be not worse then the second. They might be the same - if there is no cache at all or all the array fits in cache. — amit, Mar 27 '12 at 11:10
It's probably worth noting that there are two issues: how long it takes for memory to get from L2 cache to registers, and the *bandwidth* between L2 cache and registers. If this were just a matter of latency, then prefetching (either in software or hardware) could eliminate most of the difference between the two ways of accessing the data. The hard limit here, though, is *bandwidth*; since each memory access reads an entire cache line rather than a single int, then with your assumptions, one access pattern has to read four times as much memory overall. — , Feb 28 '15 at 13:09
@amit Could you explain please how did you estimate these miss rates `1/4` and `1`? — sgnsajgon, Jan 17 '17 at 15:45

score 13 · Answer 3 · answered Mar 27 '12 at 10:57

13

This sort of micro-optimization is platform-dependent so you'll need to profile the code in order to be able to draw a reasonable conclusion.

answered Mar 27 '12 at 10:57

Luchian Grigore

253,575
64
457
625

23

I'd vote this up if someone actually exhibited a real-world platform where the first version was slower than the second. Yes, it's micro-optimization. Yes, it probably doesn't make a noticeable difference. No, you should not waste your time rewriting your loops unless the profiler indicates that they're performance critical. _But_ if you need to choose between two equally simple, clear and valid ways to write a piece of code, _and_ you just happen to know a rule of thumb that says that one of them is at least no slower than the other, then why not choose the not-slower one? – Ilmari Karonen Mar 28 '12 at 03:41
2

@IlmariKaronen I voted up your comment. But note that it is at least language dependent. Fortran, for example, lays out the array in memory the other way around, so for Fortran, the first version will likely be slower than the second one. – fishinear Jan 19 '17 at 15:29

Michael Foukarakis · Answer 4 · 2012-03-27T15:22:46.017

In your second snippet, the change in j in each iteration produces a pattern with low spatial locality. Remember that behind the scenes, an array reference computes:

( ((y) * (row->width)) + (x) )

Consider a simplified L1 cache that has enough space for only 50 rows of our array. For the first 50 iterations, you will pay the unavoidable cost for 50 cache misses, but then what happens? For each iteration from 50 to 99, you will still cache miss and have to fetch from L2 (and/or RAM, etc). Then, x changes to 1 and y starts over, leading to another cache miss because the first row of your array has been evicted from the cache, and so forth.

The first snippet does not have this problem. It accesses the array in row-major order, which achieves better locality - you only have to pay for cache misses at most once (if a row of your array is not present in the cache at the time the loop starts) per row.

That being said, this is a very architecture-dependent question, so you would have to take into consideration the specifics (L1 cache size, cache line size, etc.) to draw a conclusion. You should also measure both ways and keep track of hardware events to have concrete data to draw conclusions from.

score 6 · Answer 5 · answered Mar 27 '12 at 11:01

6

Considering C++ is row major, I believe first method is going to be a bit faster. In memory a 2D array is represented in a Single dimension array and performance depends in accessing it either using row major or column major

answered Mar 27 '12 at 11:01

Habib

219,104
29
407
436

score 4 · Answer 6 · answered Mar 27 '12 at 11:22

4

This is a classic problem about cache line bouncing

In most time the first one is better, but I think the exactly answer is: IT DEPENDS, different architecture maybe different result.

answered Mar 27 '12 at 11:22

llj098

1,404
11
13

score 4 · Answer 7 · answered Mar 30 '12 at 05:45

4

In second method, Cache miss, because the cache stores contigous data. hence the first method is efficient than second method.

answered Mar 30 '12 at 05:45

Parag

7,746
9
24
29

IProblemFactory · Answer 8 · 2012-03-27T11:10:59.467

3

In your case (fill all array 1 value), that will be faster:

   for(j = 0; j < 100 * 100; j++){
      a[j] = 10;
   }

and you could still treat a as 2 dimensional array.

EDIT: As Binyamin Sharet mentioned, you could do it if your a is declared that way:

int **a = new int*[100];
for(int i = 0; i < 100; i++){
    a[i] = new int[100];
}

edited Mar 27 '12 at 11:10

answered Mar 27 '12 at 10:54

IProblemFactory

9,551
8
50
66

4

You need to do that with a pointer, you cannot directly assign this way. – MByD Mar 27 '12 at 10:56
1

Sorry to be a PITA, but you need to double-dereference it :) – MByD Mar 27 '12 at 11:04

score 2 · Answer 9 · answered Oct 03 '13 at 19:44

In general, better locality (noticed by most of responders) is only the first advantage for loop #1 performance.

The second (but related) advantage, is that for loops like #1 - compiler is normally capable to efficiently auto-vectorize the code with stride-1 memory access pattern (stride-1 means there is contiguous access to array elements one by one in every next iteration). On the contrary, for loops like #2, auto-vectorizations will not normally work fine, because there is no consecutive stride-1 iterative access to contiguos blocks in memory.

Well, my answer is general. For very simple loops exactly like #1 or #2, there could be even simpler aggressive compiler optimizations used (grading any difference) and also compiler will normally be able to auto-vectorize #2 with stride-1 for outer loop (especially with #pragma simd or similar).

score 1 · Answer 10 · answered Mar 25 '15 at 16:31

1

First option is better as we can store a[i] in a temp variable inside first loop and then lookup for j index in that. In this sense it can be said as cached variable.

answered Mar 25 '15 at 16:31

Himanshu Goel

574
1
8
26

Which ordering of nested loops for iterating over a 2D array is more efficient

10 Answers10

Linked

Related