Justify Memory Access in a cycle

Question

I have the following function:

void ikj(float (*a)[N], float (*b)[N], float (*c)[N], int n) {

    int i, j, k;
    float r;

    papi_start();

    for (i = 0; i < n; i++) {
        for (k = 0; k < n; k++) {

            r = a[i][k];

            for (j = 0; j < n; j++)
                c[i][j] += r * b[k][j];

        }
    }

    papi_stop();

}

And i'm using PAPI to count how many loads and stores i have between papi_start() and papi_stop() and the results i have are the following:

Loads (using PAPI_LD_INS):

32 26781
64 205053
128 1606077
256 12714815
512 101189551
1024 807406950
2048 6450848188

Stores (using PAPI_SR_INS):

32 8290
64 65698
128 524578
256 4194850
512 33555490
1024 268437701
2048 2147487778

where the first value is the size of N and the second value is the number of instructions.

i'm compiling with O3 and the sizes of my cache are L1 = 32KB x 2(Instruction and Data, 8-way) and L2 = 1024KB (4-way) (shared for 2 cores).. my cpu is Intel T3200 and have SSE3..

I know that O3 optimizes the code so it'll use prefetching among other features and since i'm loading contiguous addresses and my cache has a 64 bytes line size i'm loading 16 floats at once, but my calculations don't reach this values, so can anybody explain this to me?

EDIT: This are my assembly files, sorry to just throw them here, but i never worked with assembly and i can't really understand any of it:

http://dl.dropboxusercontent.com/u/878621/mmc.s http://dl.dropboxusercontent.com/u/878621/mmc_asm.s

Thanks!

Pretty hard for us to guess when you haven't told us what your `n` is (among other things). — Jerry Coffin, Dec 08 '13 at 16:31
Have you looked at the compiler output to see what the compiler has allocated to registers? Register accesses do not count as memory accesses. Also, the compiler may have vectorized your code, so it's performing multiple operations with a single load or store. — Joe Z, Dec 08 '13 at 16:32
Yes i did, its the first column of the output values, it goes from 32 to 2048.. @JerryCoffin — José Ricardo Ribeiro, Dec 08 '13 at 16:32
Also, does PAPI count actual load and store instructions, or number of L1D fetches and writebacks? If it's load and store instructions, then cache size and line size do not matter. — Joe Z, Dec 08 '13 at 16:33
@JoséRicardoRibeiro: At least according to the text in the question, that's `N`, not `n`. — Jerry Coffin, Dec 08 '13 at 16:33
Yes :) n is equal to N because where i call this function a pass N — José Ricardo Ribeiro, Dec 08 '13 at 16:36
I'm not sure exactly what the question is here. Are you expecting these values to be different? If so, in what way? — Oliver Charlesworth, Dec 08 '13 at 16:41
No, my question is how i can estimate approximately this values without using any counters like PAPI... — José Ricardo Ribeiro, Dec 08 '13 at 16:48
Basic prefetching happens as part of the architecture these days, not only when you choose -O3. Just a comment to something you write. — gnometorule, Dec 08 '13 at 17:51
@gnometorule - Hardware prefetching is internal, the compiler may add additional software prefetching which may be useful as the compiler, having just analyzed the entire code, knows more about the program context than the CPU would know at runtime. However both are unrelated to this performance counter result — Leeor, Dec 08 '13 at 19:16

Alan Stokes · Accepted Answer · 2013-12-08T19:09:11.637

3

Looking at stores, the number you are getting is pretty close to N**3 / 4. We'd expect it to be O(N**3), obviously.

That suggests that 4 float writes are coalesced into one of whatever PAPI_SR_INS is measuring. Looking at it alternatively you're counting the number of 16 byte writes.

Similarly the number of loads is roughly 3/4 N**3. The dominant term should be the load from b and c inside the innermost loop, which would be 2 reads per iteration. To be honest I can't make much sense of that.

If you don't know exactly what you're measuring, and you don't correlate it with the generated code, it's pretty hard to predict the measurement.

EDIT: the numbers appear to correlate to the load and store instructions executed, but not to the number of L1, L2, etc transactions or misses - so unlikely to correlate to actual performance. Isn't the time taken a better number to worry about? Given the complexity of modern CPU architecture I'd trust measurement over prediction any day.

edited Dec 08 '13 at 19:09

answered Dec 08 '13 at 16:58

Alan Stokes

18,815
3
45
64

2

A quick glance at the assembly files suggests that the loop is getting vectorized to operate on four elements at a time. I guess Jose needs to learn a little x86 assembly. – Joe Z Dec 08 '13 at 17:08
@JoeZ: Yep, that explains the stores. The loads count doesn't really make sense (I would also expect it to be 1/2 * N^3). – Oliver Charlesworth Dec 08 '13 at 17:11
Well, there's the outer loop load for `a[i][k]` which would happen N^2 times, and depending on which loop vectorized, it might need 4 loads there. 4 * N^2 + N^3 / 2 starts to get close. I admit I haven't read the assembly all that terribly closely. – Joe Z Dec 08 '13 at 17:15
2

For N=2048 at least the outer loop ought to be pretty insignificant. – Alan Stokes Dec 08 '13 at 17:23
The inner loop (.L10) in the assembly has two 4-float reads and one 4-float write, as you'd expect. (And is impressively concise.) Still baffled what the load counter is actually counting. – Alan Stokes Dec 08 '13 at 19:02

Justify Memory Access in a cycle

1 Answers1