I'm implementing a C code to compare the effect of spatial locality when accessing a matrix in a row-major order and in a column-major order.
However, the row-major way got a slower result, which is not what I've learned.
I'm running on ubuntu-16.04 (on vmware) with 2GB memory and 2 cores and I use gcc 5.4.0 to compile the code.
I create a 2000*2000 integer matrix and use "rdtsc" to measure the total clocks during the traversal.
I also tried to use clock_gettime() method to measure the time, and it showed similar result.
Here is the code I'm using to test these cases:
typedef enum {
ROW_MAJOR=1,
COLUMN_MAJOR
} traverse_type;
static inline unsigned long long read_tsc(void)
{
unsigned low, high;
asm volatile("rdtsc" :"=a"(low), "=d"(high));
return ((low) | ((u64)(high) << 32));
}
void calc_matrix_traverse_clocks(traverse_type type) {
int mat[ROW][COLUMN];
int i, j, sum;
unsigned long long before, after;
if (type == ROW_MAJOR) {
before = read_tsc();
for (i = 0; i < ROW; i++) {
for (j = 0; j < COLUMN; j++) {
sum += mat[i][j];
}
}
after = read_tsc();
printf("ROW Major: %lld\n", after-before);
} else {
before = read_tsc();
for (i = 0; i < COLUMN; i++) {
for (j = 0; j < ROW; j++) {
sum += mat[j][i];
}
}
after = read_tsc();
printf("COLUMN Major: %lld\n", after-before);
}
}
int main (int argc, char *argv[]) {
calc_matrix_traverse_clocks(ROW_MAJOR);
//calc_matrix_traverse_clocks(COLUMN_MAJOR);
return 0;
}
I've run the program several times. Even if I put these two cases into two files and run them respectively, the results are always similar:
ROW Major: 9274244
COLUMN Major: 5856220
To make sure the row-major way is having spatial locality, I've checked the addresses between two successive memory accesses are continuous. And the column-major way is not.
Besides, I also checked number of cache misses of these two versions. And it actually shows less cache miss rate when I used the row-major method. But this time the clocks measure in the row-major traversal becomes the less one.
Here is the result of valgrind:
Row-major version:
ROW Major: 124149507
==10786==
==10786== I refs: 11,160,797
==10786== I1 misses: 955
==10786== LLi misses: 944
==10786== I1 miss rate: 0.01%
==10786== LLi miss rate: 0.01%
==10786==
==10786== D refs: 6,054,374 (6,041,857 rd + 12,517 wr)
==10786== D1 misses: 65,590 ( 64,992 rd + 598 wr)
==10786== LLd misses: 65,113 ( 64,554 rd + 559 wr)
==10786== D1 miss rate: 1.1% ( 1.1% + 4.8% )
==10786== LLd miss rate: 1.1% ( 1.1% + 4.5% )
==10786==
==10786== LL refs: 66,545 ( 65,947 rd + 598 wr)
==10786== LL misses: 66,057 ( 65,498 rd + 559 wr)
==10786== LL miss rate: 0.4% ( 0.4% + 4.5% )
Column-major version:
COLUMN Major: 131878026
==10793==
==10793== I refs: 11,160,943
==10793== I1 misses: 952
==10793== LLi misses: 943
==10793== I1 miss rate: 0.01%
==10793== LLi miss rate: 0.01%
==10793==
==10793== D refs: 6,054,434 (6,041,899 rd + 12,535 wr)
==10793== D1 misses: 1,003,086 (1,002,488 rd + 598 wr)
==10793== LLd misses: 65,104 ( 64,545 rd + 559 wr)
==10793== D1 miss rate: 16.6% ( 16.6% + 4.8% )
==10793== LLd miss rate: 1.1% ( 1.1% + 4.5% )
==10793==
==10793== LL refs: 1,004,038 (1,003,440 rd + 598 wr)
==10793== LL misses: 66,047 ( 65,488 rd + 559 wr)
==10793== LL miss rate: 0.4% ( 0.4% + 4.5% )
As shown in valgrind, the row-major version actually has a lower miss rates. This confused me and now I'm wondering if I've done something wrong with this code or measurement method.