I have the following function:
void ikj(float (*a)[N], float (*b)[N], float (*c)[N], int n) {
int i, j, k;
float r;
papi_start();
for (i = 0; i < n; i++) {
for (k = 0; k < n; k++) {
r = a[i][k];
for (j = 0; j < n; j++)
c[i][j] += r * b[k][j];
}
}
papi_stop();
}
And i'm using PAPI
to count how many loads and stores i have between papi_start()
and papi_stop()
and the results i have are the following:
Loads (using PAPI_LD_INS
):
32 26781
64 205053
128 1606077
256 12714815
512 101189551
1024 807406950
2048 6450848188
Stores (using PAPI_SR_INS
):
32 8290
64 65698
128 524578
256 4194850
512 33555490
1024 268437701
2048 2147487778
where the first value is the size of N and the second value is the number of instructions.
i'm compiling with O3 and the sizes of my cache are L1 = 32KB x 2(Instruction and Data, 8-way) and L2 = 1024KB (4-way) (shared for 2 cores).. my cpu is Intel T3200 and have SSE3..
I know that O3 optimizes the code so it'll use prefetching among other features and since i'm loading contiguous addresses and my cache has a 64 bytes line size i'm loading 16 floats at once, but my calculations don't reach this values, so can anybody explain this to me?
EDIT: This are my assembly files, sorry to just throw them here, but i never worked with assembly and i can't really understand any of it:
http://dl.dropboxusercontent.com/u/878621/mmc.s http://dl.dropboxusercontent.com/u/878621/mmc_asm.s
Thanks!