I wrote a simple matrix multiplication programme to demonstrate the effects of caching and performance measuring via perf. The two functions for comparing the relative efficiencies are sqmat_mult
and sqmat_mult_efficient
:
void sqmat_mult(int x, const int a[x][x], const int b[x][x], int m[x][x])
{
for (int i = 0; i < x; i++) {
for (int j = 0; j < x; j++) {
int sum = 0;
for (int k = 0; k < x; k++) {
sum += a[i][k] * b[k][j]; // access of b array is non sequential
}
m[i][j] = sum;
}
}
}
static void sqmat_transpose(int x, int a[x][x])
{
for (int i = 0; i < x; i++) {
for (int j = i+1; j < x; j++) {
int temp = a[i][j];
a[i][j] = a[j][i];
a[j][i] = temp;
}
}
}
void sqmat_mult_efficient(int x, const int a[x][x], int b[x][x], int m[x][x])
{
sqmat_transpose(x, b);
for (int i = 0; i < x; i++) {
for (int j = 0; j < x; j++) {
int sum = 0;
for (int k = 0; k < x; k++) {
sum += a[i][k] * b[j][k]; // access of b array is sequential
}
m[i][j] = sum;
}
}
sqmat_transpose(x, b);
}
However, when I run perf stat
based on these two functions, I am confused about the "page-faults" statistics. The output for sqmat_mult
is:
428374.070363 task-clock (msec) # 1.000 CPUs utilized
128 context-switches # 0.000 K/sec
127 cpu-migrations # 0.000 K/sec
12,334 page-faults # 0.029 K/sec
8,63,12,33,75,858 cycles # 2.015 GHz (83.33%)
2,89,73,31,370 stalled-cycles-frontend # 0.34% frontend cycles idle (83.33%)
7,90,36,10,13,864 stalled-cycles-backend # 91.57% backend cycles idle (33.34%)
2,24,41,64,76,049 instructions # 0.26 insn per cycle
# 3.52 stalled cycles per insn (50.01%)
8,84,30,79,219 branches # 20.643 M/sec (66.67%)
1,04,85,342 branch-misses # 0.12% of all branches (83.34%)
428.396804101 seconds time elapsed
While the output for sqmat_mult_efficient
is:
8534.611199 task-clock (msec) # 1.000 CPUs utilized
39876.726670 task-clock (msec) # 1.000 CPUs utilized
11 context-switches # 0.000 K/sec
11 cpu-migrations # 0.000 K/sec
12,334 page-faults # 0.309 K/sec
1,19,87,36,75,397 cycles # 3.006 GHz (83.33%)
49,19,07,231 stalled-cycles-frontend # 0.41% frontend cycles idle (83.33%)
50,40,53,90,525 stalled-cycles-backend # 42.05% backend cycles idle (33.35%)
2,24,10,77,52,356 instructions # 1.87 insn per cycle
# 0.22 stalled cycles per insn (50.01%)
8,75,27,87,494 branches # 219.496 M/sec (66.68%)
50,48,390 branch-misses # 0.06% of all branches (83.33%)
39.879816492 seconds time elapsed
With the transposed based multiplication, the "backend cycles idle" is considerably reduced as expected. This is due to the low wait time of fetching coherent memory. What confuses me is that the number of "page-faults" being the same. At first I thought this might be a context switching issue, so I made the programme run with scheduling SCHED_FIFO
and priority 1
. The number of context switches decreased from roughly 800 to 11, but the "page-faults" remanined exactly the same as before.
The dimension of the matrix I am using is 2048 x 2048
, so the entire array will not fit in a 4K page. Nor does it fit in my entire cache (4MiB in my case), since the size of the three matrices (3*2048*2048*sizeof(int) == 48MiB) far exceeds the total avaibale cache. Assuming that "page-faults" here refers to TLB miss or cache miss, I don't see how the two algorithms can have the exact same number. I understand that a lot of the efficiency improvement I see is from the fact that L1 cache hit is happening but that does not mean that RAM to cache transfers have no role to play.
I also did another experiment to see if the "page-faults" scale with the array-size - It does, as expected.
So my questions are -
- Am I correct in assuming that "page-faults" refer to a TLB miss or a cache-miss?
- How come the "page-faults" are exactly the same for both algorithms?