2

I'm implementing a C code to compare the effect of spatial locality when accessing a matrix in a row-major order and in a column-major order.

However, the row-major way got a slower result, which is not what I've learned.

I'm running on ubuntu-16.04 (on vmware) with 2GB memory and 2 cores and I use gcc 5.4.0 to compile the code.

I create a 2000*2000 integer matrix and use "rdtsc" to measure the total clocks during the traversal.

I also tried to use clock_gettime() method to measure the time, and it showed similar result.

Here is the code I'm using to test these cases:

typedef enum {
    ROW_MAJOR=1,
    COLUMN_MAJOR
} traverse_type;

static inline unsigned long long read_tsc(void)
{
    unsigned low, high;
    asm volatile("rdtsc" :"=a"(low), "=d"(high));
    return ((low) | ((u64)(high) << 32));
}

void calc_matrix_traverse_clocks(traverse_type type) {
    int mat[ROW][COLUMN];
    int i, j, sum;
    unsigned long long before, after;

    if (type == ROW_MAJOR) {
        before = read_tsc();
        for (i = 0; i < ROW; i++) {
            for (j = 0; j < COLUMN; j++) {
               sum += mat[i][j];
            }
        }
        after = read_tsc();
        printf("ROW Major: %lld\n", after-before);
    } else {
        before  = read_tsc();
        for (i = 0; i < COLUMN; i++) {
            for (j = 0; j < ROW; j++) {
                sum += mat[j][i];
            }
        }
        after = read_tsc();
        printf("COLUMN Major: %lld\n", after-before);
    }
}

int main (int argc, char *argv[]) {

    calc_matrix_traverse_clocks(ROW_MAJOR);
    //calc_matrix_traverse_clocks(COLUMN_MAJOR);    

    return 0;
}

I've run the program several times. Even if I put these two cases into two files and run them respectively, the results are always similar:

ROW Major: 9274244
COLUMN Major: 5856220

To make sure the row-major way is having spatial locality, I've checked the addresses between two successive memory accesses are continuous. And the column-major way is not.

Besides, I also checked number of cache misses of these two versions. And it actually shows less cache miss rate when I used the row-major method. But this time the clocks measure in the row-major traversal becomes the less one.

Here is the result of valgrind:

Row-major version:

ROW Major: 124149507
==10786== 
==10786== I   refs:      11,160,797
==10786== I1  misses:           955
==10786== LLi misses:           944
==10786== I1  miss rate:       0.01%
==10786== LLi miss rate:       0.01%
==10786== 
==10786== D   refs:       6,054,374  (6,041,857 rd   + 12,517 wr)
==10786== D1  misses:        65,590  (   64,992 rd   +    598 wr)
==10786== LLd misses:        65,113  (   64,554 rd   +    559 wr)
==10786== D1  miss rate:        1.1% (      1.1%     +    4.8%  )
==10786== LLd miss rate:        1.1% (      1.1%     +    4.5%  )
==10786== 
==10786== LL refs:           66,545  (   65,947 rd   +    598 wr)
==10786== LL misses:         66,057  (   65,498 rd   +    559 wr)
==10786== LL miss rate:         0.4% (      0.4%     +    4.5%  )

Column-major version:

COLUMN Major: 131878026
==10793== 
==10793== I   refs:      11,160,943
==10793== I1  misses:           952
==10793== LLi misses:           943
==10793== I1  miss rate:       0.01%
==10793== LLi miss rate:       0.01%
==10793== 
==10793== D   refs:       6,054,434  (6,041,899 rd   + 12,535 wr)
==10793== D1  misses:     1,003,086  (1,002,488 rd   +    598 wr)
==10793== LLd misses:        65,104  (   64,545 rd   +    559 wr)
==10793== D1  miss rate:       16.6% (     16.6%     +    4.8%  )
==10793== LLd miss rate:        1.1% (      1.1%     +    4.5%  )
==10793== 
==10793== LL refs:        1,004,038  (1,003,440 rd   +    598 wr)
==10793== LL misses:         66,047  (   65,488 rd   +    559 wr)
==10793== LL miss rate:         0.4% (      0.4%     +    4.5%  )

As shown in valgrind, the row-major version actually has a lower miss rates. This confused me and now I'm wondering if I've done something wrong with this code or measurement method.

tgudtk
  • 39
  • 5
  • You should provide the compile line to make your test reproducable. For performance analysis, it's recommended to use at least `-O2`. However, I've real doubts whether your test proves anything. My main concern: You operate on uninitialized data `int mat[ROW][COLUMN];`. AFAIK, that's [U.B.](https://stackoverflow.com/a/4105123/1505939). Checking your code in [**Compiler Explorer**](https://gcc.godbolt.org/z/Tj-SIG), it becomes obvious: The code inside the loops is just not compiled. Whatever you measure, you cannot trust on this (IMHO). – Scheff's Cat Nov 03 '19 at 09:54
  • Btw. I first did the test in Compiler Explorer with gcc 9.2. Then, reading your Q again, I switched to gcc 5.4. This didn't change much - the code of loops is just missing. (I was not that sure that I could reproduce it with the older compiler but I could.) ;-) – Scheff's Cat Nov 03 '19 at 09:57
  • @Scheff , thanks for your reminder. The provided result above is compiled with using any optimization flags. $ gcc -o matrix matrix.c It seems that the problem is indeed the uninitialized double array in my code. After I added the array initialization part into it, I got a reasonable result. Row-major method is about twice faster than the column-major one, which is also reflected when I use valgrind to observer cache miss rates. – tgudtk Nov 03 '19 at 11:08
  • Concerning cache-locality, I once had a similar experience: [SO: Multi-threading benchmarking issues](https://stackoverflow.com/a/52835213/7478597) (the last part of my A). – Scheff's Cat Nov 03 '19 at 12:59

0 Answers0