Performance issue about swap loops

Question

I got this question from a c test. I'm very curious about the performance affirmation of the question. I didn't know what to respond. My question is exactly the same. Why performance improved?

Suppose you have the following code which iterates over a large (2000 by 2000) squares 2D array and count the number of non-zero elements in the array. You swap the order of two inner loops so that the x loop is now the y loop. This drastically improved the performance of your code. Why?

int total = 0;
for (int x = 0; x < side_length; x++) {
   for (int y = 0; y < side_length; y++) {
      if(array[y][x] != 0) {
         total += 1;
      }
   }
}

I don't see any array! Just because something is called "array" does not mean it is one. Provide a [mcve]. — too honest for this site, May 29 '17 at 19:22
Because you access memory sequentially, which is more efficient for a cpu, since he can optimize memory accesses. If you invert the loops, memory accesses will be sparsed, thus less efficient. — Jazzwave06, May 29 '17 at 19:23
It's called the "locality" property. It is cached more efficiently. — Eugene Sh., May 29 '17 at 19:27
Possible duplicate of [Why does cache locality matter for array performance?](https://stackoverflow.com/questions/12065774/why-does-cache-locality-matter-for-array-performance) — rustyx, May 29 '17 at 19:31
Possible duplicate of [Which of these two for loops is more efficient in terms of time and cache performance](https://stackoverflow.com/questions/9888154/which-of-these-two-for-loops-is-more-efficient-in-terms-of-time-and-cache-perfor) — user694733, May 31 '17 at 07:48

score 1 · Accepted Answer · answered May 29 '17 at 19:29

If you have a matrix of 2000 by 2000, you would have 2000 arrays of 2000 elements. Accessing an array element by element would access memory sequentially, as the memory of one array is contiguous. This is the best case, as the cpu can optimize your memory accesses.

There is two way to iterate through a 2d array: row-first and column-first. In row-first iterations, you access memory sequentially as you iterate through all arrays completely before iterating the next one. In column-first, you access all first index of all arrays, then second index and so on. Those are random memory accesses and they cannot be optimized by the cpu.

You can read this article on wikipedia for more information.

score 1 · Answer 2 · answered May 31 '17 at 05:21

The reason is in the cpu HW.

An array in C is a contiguos list of its elements (a bi-dimensional array is a contiguos list of inner arrays).

A cpu spent the same time loading a small data as well as data wide as its data bus wideness (64 bit for many modern cpus). So many cpus load data as wide as their data bus' wideness; some cpu even perform a, short, fast burst transfer of sequential 'wide' transfers. Data are loaded in a cache (a sort of very big register). The requested part of data is loaded directly in the cpu register.

In case of a following request the next data are immediately available from the cache, without the need to spent time accessing the memory.

If the cpu access data randomly spread the data in the cache are overloaded with new data and the cache benefit is lost.

Performance issue about swap loops

2 Answers2