16

Here is simple C++ code that compare iterating 2D array row major with column major.

#include <iostream>
#include <ctime>

using namespace std;

const int d = 10000;

int** A = new int* [d];

int main(int argc, const char * argv[]) {
    for(int i = 0; i < d; ++i)
        A[i] = new int [d];
    
    clock_t ColMajor = clock();
    
    for(int b = 0; b < d; ++b)
        for(int a = 0; a < d; ++a)
            A[a][b]++;
    
    double col = static_cast<double>(clock() - ColMajor) / CLOCKS_PER_SEC;
    
    clock_t RowMajor = clock();
    for(int a = 0; a < d; ++a)
        for(int b = 0; b < d; ++b)
            A[a][b]++;
    
    double row = static_cast<double>(clock() - RowMajor) / CLOCKS_PER_SEC;
    

    
    cout << "Row Major : " << row;
    cout << "\nColumn Major : " << col;

    return 0;
}

Result for different values of d:

d = 10^3 :

Row Major : 0.002431

Column Major : 0.017186

d = 10^4 :

Row Major : 0.237995

Column Major : 2.04471

d = 10^5

Row Major : 53.9561

Column Major : 444.339

Now the question is why row major is faster than column major?

Community
  • 1
  • 1
Amanita
  • 328
  • 1
  • 2
  • 9
  • 6
    because in C arrays are **row major** and because of **spatial locality** of the **cache**. – bolov Nov 15 '15 at 17:21
  • 1
    Possible duplicate of [Why does cache locality matter for array performance?](http://stackoverflow.com/questions/12065774/why-does-cache-locality-matter-for-array-performance) – bolov Nov 15 '15 at 17:23
  • This time it's not about branch prediction :). In both version you have the same number of comparisons, and both times the `true`/`false` pattern is the same (i.e. a lot of `true` conditions and then a `false` one - when the index reaches the end) – bolov Nov 15 '15 at 17:26
  • Technically the above is a ragged array. – Jason Nov 15 '15 at 17:32

3 Answers3

19

It obviously depends on the machine you're on but very generally speaking:

  1. Your computer stores parts of your program's memory in a cache that has a much smaller latency than main memory (even when compensating for cache hit time).

  2. C arrays are stored in a contiguous by row major order. This means if you ask for element x, then element x+1 is stored in main memory at a location directly following where x is stored.

  3. It's typical for your computer cache to "pre-emptively" fill cache with memory addresses that haven't been used yet, but that are locally close to memory that your program has used already. Think of your computer as saying: "well, you wanted memory at address X so I am going to assume that you will shortly want memory at X+1, therefore I will pre-emptively grab that for you and place it in your cache".

When you enumerate your array via row major order, you're enumerating it in such a way where it's stored in a contiguous manner in memory, and your machine has already taken the liberty of pre-loading those addresses into cache for you because it guessed that you wanted it. Therefore you achieve a higher rate of cache hits. When you're enumerating an array in another non-contiguous manner then your machine likely won't predict the memory access pattern you're applying, so it wont be able to pre-emptively pull memory addresses into cache for you, and you won't incur as many cache hits, so main memory will have to be accessed more frequently which is slower than your cache.

Also, this might be better suited for https://cs.stackexchange.com/ because the way your system cache behaves is implemented in hardware, and spatial locality questions seem better suited there.

rayryeng
  • 102,964
  • 22
  • 184
  • 193
David Zorychta
  • 13,039
  • 6
  • 45
  • 81
  • Your point (3) is a bit misleading. Modern CPUs do indeed perform pre-fetching, but in this case that isn't needed. The important factor is that the cache doesn't contain individual bytes or words, it holds chunks of adjacent memory, known as a cache-line, typically 64-bytes in size. So when address X is in the cache the CPU probably doesn't need to pre-emptively fetch X+1 because it has probably already got it (except in the case where X is the last byte in a cache-line, in which case it probably will have pre-fetched the next cache-line). – Jonathan Wakely Nov 15 '15 at 17:59
  • Slight nitpicking, but regarding point (2), column-major and row-major are identical for a one dimension. The last index increases fastest in row-major, while the first index increases fastest in column-major, which are the same with one dimension. Two dimensions, `x[0][0..10]` would be laid out contiguously in memory with row-major, whereas `x[0..10][0]` would be laid out contiguously with column major. – Jason Nov 15 '15 at 18:37
6

Your array is actually a ragged array, so row major isn't entirely a factor.

You're seeing better performance iterating over columns then rows because the row memory is laid out linearly, which reading sequentially is easy for the cache predictor to predict, and you amortize the pointer dereference to the second dimension since it only needs to be done once per row.

When you iterate over the rows then columns, you incur a pointer dereference to the second dimension per iteration. So by iterating over rows, you're adding a pointer dereference. Aside from the intrinsic cost, it's bad for cache prediction.

If you want a true two-dimensional array, laid out in memory using row-major ordering, you would want...

int A[1000][1000];

This lays out the memory contiguously in row-major order, instead of one array of pointers to arrays (which are not laid out contiguously). Iterating over this array using row-major would still perform faster than iterating column-major because of spatial locality and cache prediction.

Jason
  • 3,777
  • 14
  • 27
2

The short answer is CPU caches. Scott Mayers explains it very clearly here

metal4people
  • 131
  • 2
  • 5