Why is there a significant difference in this C++ for loop's execution time?

Question

I was going through loops and found a significant difference in accessing loops. I can't understand what is the thing that causes such difference in both cases?

First Example:

Execution Time; 8 seconds

for (int kk = 0; kk < 1000; kk++)
{
    sum = 0;
    for (int i = 0; i < 1024; i++)
        for (int j = 0; j < 1024; j++)
        {
            sum += matrix[i][j];
        }
}

Second Example:

Execution Time: 23 seconds

for (int kk = 0; kk < 1000; kk++)
{
    sum = 0;
    for (int i = 0; i < 1024; i++)
        for (int j = 0; j < 1024; j++)
        {
            sum += matrix[j][i];
        }
}

What causes so much execution time difference just exchanging

matrix[i][j]

to

matrix[j][i]

?

Locality of reference (in C++, primitive multidimensional arrays are stored in row major format.) Also, did you optimize the code? 8 seconds (let alone 23) seems a ridiculously long execution time for a billion elements. — The Paramagnetic Croissant, Oct 27 '14 at 08:29
@Etixpp oops, nope, it's of course a billion. I'm still not sure whether the code in question has been optimized. — The Paramagnetic Croissant, Oct 27 '14 at 08:33
Depends what the matrix stores. At example a class pbject with overloaded + operator could cause such a time, but its not the question here actually. If the author wants optimization he may post the whole code — Etixpp, Oct 27 '14 at 08:38
@TheParamagneticCroissant code has not been optimized, its just a simple run to check the execution time difference accessing array in different ways. — Massab, Oct 27 '14 at 09:38
@Massab You should generally optimize the code before measuring its performance, otherwise you may get meaningless results. — The Paramagnetic Croissant, Oct 27 '14 at 10:17
Also, to use the word "significant" in benchmarks please at least average the results from more than one run, and include the information about standard deviation. — BartoszKP, Oct 27 '14 at 10:24
@TheParamagneticCroissant Thanx alot.... I will be careful next time. — Massab, Oct 27 '14 at 10:53
@BartoszKP I ran it a few times and its almost the average time, it was a bit up or down on every run. — Massab, Oct 27 '14 at 10:54
@Massab Lol, "almost" and "bit up or down" really make no difference here. "significant" is a strong word, especially in benchmarking context. I'm not saying I doubt in your results, just be precise next time :) — BartoszKP, Oct 27 '14 at 11:32
@BartoszKP I used word "significant" for the difference in execution times of the two examples as you can see the diff of 15 sec. Whereas , I what was saying about "bit up or down" , was about different execution times of one example to get the average such as 7 or 8 or 9 sec for example one. Hope you understand. Dont mind. — Massab, Oct 27 '14 at 11:42
@Massab Yes, I understand in what meaning you've used this word. I'm just saying that commonly in this context this word has a different, stronger and more precise meaning. It's not a big deal here, but in general you should just report the statistics. They would've taken less space than your vague (yes, still vague!) explanation in your comment of how were the measurements behaving ;) — BartoszKP, Oct 27 '14 at 12:19
@BartoszKP Well, thats fine. I'll be careful next time. Thanx :) — Massab, Oct 27 '14 at 12:50
Have a look at http://stackoverflow.com/questions/3928995/how-do-cache-lines-work. — Marius Bancila, Oct 27 '14 at 21:44
@Massab Download a pdf: [operating system book Galvin](http://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CC4QFjAA&url=http://it325blog.files.wordpress.com/2012/09/operating-system-concepts-7-th-edition.pdf&ei=8GlnUZSFNc3rrQe4qYDgDg&usg=AFQjCNHvytXcx-0_2TSlqbsj94nEd0JyKw&bvm=bv.45107431,bs.1,d.bmk) Read chapter 9: Virtual Memory Section: 9.9.5 Program Structure. — Grijesh Chauhan, Oct 28 '14 at 04:39
note that declaring large arrays with power-of-2 sizes will also be a source of performance degradation http://stackoverflow.com/questions/12264970/why-is-my-program-slow-when-looping-over-exactly-8192-elements?lq=1 http://stackoverflow.com/questions/6060985/why-is-there-huge-performance-hit-in-2048x2048-versus-2047x2047-array-multiplica http://stackoverflow.com/questions/11413855/why-is-transposing-a-matrix-of-512x512-much-slower-than-transposing-a-matrix-of?lq=1 http://stackoverflow.com/questions/7905760/matrix-multiplication-small-difference-in-matrix-size-large-difference-in-timi — phuclv, Oct 28 '14 at 08:32
try changing the array size to 1025x1025 and measure the time again, if it's faster then it's also problem related to cache usage — phuclv, Oct 28 '14 at 08:36
@Massab: Just out of curiosity (if you don't mind me asking), are you self-teaching yourself programming or learning/learned it formally in school? — user541686, Oct 28 '14 at 09:34
@Mehrdad Lolzzz........actually I have graduated and now am teaching a student of CS, didn't study too much in my own time but now learning alot while teaching. :D — Massab, Oct 28 '14 at 10:35
@TheParamagneticCroissant: What the heck are you talking about? "Profile first" has been standard optimization practice for decades, because you never want to spend hours optimizing what you thought was the bottleneck only to discover it was only taking 0.2% of the runtime. Or are you talking about turning on compiler optimization? — user2357112, Oct 28 '14 at 20:11
@TheParamagneticCroissant can you please tell me something about compiler optimizations?? — Massab, Nov 04 '14 at 15:22
@TheParamagneticCroissant What is compiler optimizations ? Whats its purpose? How can we achieve or say, turn it on in compiler ? — Massab, Nov 05 '14 at 16:10
@Massab If you don't know what a compiler optimization is, I'm not going to be able to explain it to you in a comment. You need to do your research. — The Paramagnetic Croissant, Nov 05 '14 at 16:15

score 111 · Accepted Answer · edited Oct 29 '14 at 12:22

111

It's an issue of memory cache.

matrix[i][j] has better cache hits than matrix[j][i], since matrix[i][j] has more continuous memory accessing chances.

For example, when we access matrix[i][0], the cache may load a continuous segment of memory containing matrix[i][0], thus, accessing matrix[i][1], matrix[i][2], ..., will benefit from caching speed, since matrix[i][1], matrix[i][2], ... are near to matrix[i][0].

However, when we access matrix[j][0], it is far from matrix[j - 1][0] and may not been cached, and can not benefit from caching speed. Especially, a matrix is normally stored as a continuous big segment of memory, and the cacher may predicate the behavior of memory accessing and always cache the memory.

That's why matrix[i][j] is faster. This is typical in CPU cache based performance optimizing.

edited Oct 29 '14 at 12:22

OrangeDog

36,653
12
122
207

answered Oct 27 '14 at 08:33

Peixu Zhu

2,111
1
15
13

1

@raison Can you please explain a little bit that cache issue? Because thats the topic am actually working on, so I need to understand it completely. Thanx alot. – Massab Oct 27 '14 at 08:51
@Massab wish it helps. – Peixu Zhu Oct 27 '14 at 10:08
3

An interesting fact I ran into about a year ago is that this is not the case when programming for a GPU in CUDA as discussed in my previous [question](http://stackoverflow.com/questions/16863196/cuda-matrix-addition-timings-by-row-vs-by-column) albeit that is beyond the scope of the current question. – Godric Seer Oct 27 '14 at 20:32
1

@GodricSeer GPU has many cores to process parallel work, and the data must be transferred from memory to GPU to perform the computing, thus, we'll focus on utilize more cores with less data transfers. – Peixu Zhu Oct 27 '14 at 23:22
@raison Yes, I wasn't suggesting a change to your process/answer, I just wanted to point out what I thought was a non-intuitive fact on a different architecture. – Godric Seer Oct 27 '14 at 23:25
4

The memory access pattern of this loop is extremely predictable; with prefetching there is no reason it should ever get a cache miss, should the optimizer (or a human) optimize for that. The problem is that the memory hardware is optimized to deliver contiguous memory -- a *cache line* -- rather than scattered memory, and so we are using memory inefficiently if we can't use an entire cache line at a time. – Oct 28 '14 at 00:58
If you go back before cache was common it was still known that memory access was an issue with matrix row/column order - ultimately drum memory has the same behaviour of reading the next word being quicker than reading a larger offset. – Pete Kirkham Oct 29 '14 at 14:54

score 54 · Answer 2 · answered Oct 27 '14 at 08:58

The difference in performance is caused by the caching strategy of the computer.

The 2 dimensional array matrix[i][j] is represented as a long list of values in the memory.

E.g the array A[3][4] looks like:

1 1 1 1   2 2 2 2   3 3 3 3

In this example every entry of A[0][x] is set to 1, every entry of A[1][x] set to 2, ...

If your first loop is applied to this matrix the order of access is this:

1 2 3 4   5 6 7 8   9 10 11 12

While the second loops access order looks like this:

1 4 7 10  2 5 8 11  3 6 9 12

When the program accesses an element of the array it also loads subsequent elements.

E.g. if you access A[0][1], A[0][2] and A[0][3] are loaded too.

Thereby the first loop has to do less load operations, as some elements are already in the cache when needed. The second loop loads entries into the cache that are not needed at the time, resulting in more load operations.

@Massab You should mark the most helpful reply as a valid answer. — H4kor, Oct 27 '14 at 09:42
raison is spot on, but this answer explains it the best for someone who doesn't know what the problem is already — Baldrickk, Oct 27 '14 at 10:16

score 34 · Answer 3 · edited May 23 '17 at 12:14

Other people have done a good job explaining why one form of your code makes more efficient use of the memory cache than the other. I'd like to add some background information you may not be aware of: you probably don't realize just how expensive main memory accesses are nowadays.

The numbers posted in this question look to be in the right ballpark to me, and I'm going to reproduce them here because they're so important:

Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
remote L3 CACHE ~100-300 cycles
Local Dram ~60 ns
Remote Dram ~100 ns

Note the change in units for the last two entries. Depending exactly which model you have, this processor runs at 2.9–3.2 GHz; to make the math simpler, let's just call it 3 GHz. So one cycle is 0.33333 nanoseconds. So DRAM accesses are also 100-300 cycles.

The point is that the CPU could have executed hundreds of instructions in the time it takes to read one cache line from main memory. This is called the memory wall. Because of it, efficient use of the memory cache is more important than any other factor in overall performance on modern CPUs.

While this is all true, it's (IMO) clear that memory latency isn't the only factor here: prefetching can eliminate cache misses, and the loop could be optimized to do so appropriately. In fact, the hardware prefetcher might even be able to figure much of it out without any help. Memory *bandwidth* is another relevant bottleneck at play here, because you load entire cache lines at a time, but one loop ordering only uses a fraction of the cache line before discarding it, and thus uses bandwidth inefficiently. The actual timings of the unoptimized loop are likely a combination of both. — , Oct 28 '14 at 16:15
@Hurkyl All of that is covered, at least by implication, in raison and H4kor's answers. — zwol, Oct 28 '14 at 21:36

score 18 · Answer 4 · answered Oct 27 '14 at 08:55

The answer depends a little bit on exactly how the matrix is defined. In a fully dynamically allocated array, you'd have:

T **matrix;
matrix = new T*[n];
for(i = 0; i < n; i++)
{
   t[i] = new T[m]; 
}

So, every matrix[j] will require a new memory lookup for the pointer. If you do the j loop outside, the inner loop can re-use the pointer for matrix[j] every for the whole inner loop.

If the matrix is a simple 2D array:

T matrix[n][m];

then the matrix[j] will simply be a multiplication by 1024 * sizeof(T) - which can be done by adding 1024 * sizeof(T) the loop index in the optimised code, so should be relatively fast either way.

On top of that, we have cache locality factors. Caches have "lines" of data that is typically 32 to 128 bytes per line. So if your code reads address X, the cache will load up with values 32 to 128 bytes around X. So if the NEXT thing you need is only sizeof(T) forward from the current location, it's highly likely already in the cache [and modern processors also detects that you are going round in a loop reading every memory location, and pre-loads the data].

In the case of j inner loop, you are reading a new location of sizeof(T)*1024 distance for each loop [or possiblya greater distance if it's dynamically allocated]. This means that the data being loaded will not be useful for the next loop, because it's not in the next 32 to 128 bytes.

And finally, it's entirely possible that the first loop is more optimised, thanks to SSE instructions or similar, which allow the calculation to be run even quicker. But this is probably marginal for such a large matrix, as the performance is highly memory bound at this size.

Thanks for your explanation dear. Its a bit hard but understood somehow. :) — Massab, Oct 27 '14 at 09:40
I agree with Massab, the explanation is more thorough but could be written better. — Pharap, Oct 28 '14 at 01:54

score 10 · Answer 5 · answered Oct 28 '14 at 01:15

Memory hardware is not optimized to deliver individual addresses: instead it tends to operate on larger chunks of continuous memory called cache lines. Every time you read one entry of your matrix, the entire cache line it lies in also gets loaded into cache along with it.

The faster loop ordering is set up to read memory in order; each time you load a cache line, you use all of the entries in that cache line. Each pass through the outer loop, you read each matrix entry only a single time.

The slower loop ordering, however, only uses a single entry from each cache line before moving on. Thus, each cache line has to get loaded multiple times, once for each matrix entry in the line. e.g. if a double is 8 byes and a cache line is 64 bytes long, then each pass through the outer loop has to read each matrix entry eight times rather than a single time.

All that said, if you had turned optimizations on, you would probably see no difference: optimizers understand this phenomenon, and good ones are capable of recognizing that they can swap which loop is the inner loop and which loop is the outer loop for this particular code snippet.

(also, a good optimizer would have only done one pass through the outermost loop, because it recognizes the first 999 times through are irrelevant to the final value of sum)

A bit late to the party, but I think your explanation is very well written, so +1 anyway. — Pharap, Oct 28 '14 at 02:04

James · Answer 6 · 2014-10-27T09:10:24.100

8

The matrix is stored in memory as a vector. Accessing it the first way accesses the memory sequentially. Accessing it the second way requires jumping around memory locations. See http://en.wikipedia.org/wiki/Row-major_order

edited Oct 27 '14 at 09:10

answered Oct 27 '14 at 08:36

James

309
1
10

1

Though your explanation is right, I'm not happy about your use of the terms matrix and vector. Vectors are matrices with only one dimension (i.e. the vector (4, 5, 6) can be said to be a 1x3 matrix), the matrix in question clearly has more than one dimension, so it's not really correct to say it is being stored as a vector. You're right about the sequential memory access though, and using references is always a good thing. – Pharap Oct 28 '14 at 02:01
@Pharap In memory it has only one dimension. – Wlerin Oct 28 '14 at 19:37
@Wlerin Memory is one dimension. – Pharap Oct 28 '14 at 22:49
@Pharap Yes, exactly. – Wlerin Oct 29 '14 at 04:54
@Wlerin The point is that the mathematical terminology is wrong. Memory is all just one block of contiguous memory, but the structure being represented (a matrix) is not. – Pharap Oct 29 '14 at 19:24
@Pharap No, the mathematical terminology is spot on. You may wish to reread what he wrote, as he is referring to how it is stored in memory, not in concept. – Wlerin Oct 29 '14 at 20:40

score 5 · Answer 7 · answered Oct 27 '14 at 08:36

5

If you access j - i the j dimension is cached so the machine code does not have to change it every time, the second dimension isnt cached, so you actually delete the cache every time what causes the difference.

answered Oct 27 '14 at 08:36

Etixpp

320
2
11

score 3 · Answer 8 · answered Oct 30 '14 at 04:34

Based on the concept of locality of reference, it is very likely for a piece of code to access memory locations that are adjacent. So more values are loaded into the cache than what is asked for. This means more cache hits. Your first example is satisfying this well while you code in second example is not.

Why is there a significant difference in this C++ for loop's execution time?

8 Answers8

Linked