How do I calculate the CPU cycle cost of a C++ code snippet?

Question

Say, I'm given a hypothetical machine :Hypothetical Von-Neumann.

The image says that :

The cache is 1 kB, and the cost of fetching 1 floating point (8 bytes) is 1 CPU cycle.
If the data required by the code is not in the cache, 1 kB of data is taken from the RAM(10MB) at the cost of 150 CPU cycles.

Now, given this machine. I want to know the specifics/rules to calcuate the CPU cycle cost of a code snippet. As an example please take this code where A is a 1024x1024 int matrix and has been initialized with integers :

#define N 1024
sum=0;

for(i=0;i<N;i++)
{
    for(j=0;j<N;j++) 
        sum+=A[i][j];
}

How do I go about calculating the no. of CPU cycles required to fetch the data of matrix A? I'm confused as to how the matrix A will be fetched from the main memory. I'm not looking for an exact answer, just want to know the procedure to go about figuring it out. I'm not fully sure about how the various memory are utilized by the code.

For example, when i = 0 and j = 0, (the first iteration), A will be called from main memory right? So would that mean 1 kB of data being transferred from main memory into cache or only 4 Byte since that element is just an integer? And what about the memory for the instructions or operations? Just confused about this.

What if I replace A[i][j] with A[j][i] above?
Also, if the exact same code is written in FORTRAN, what difference would it make?

EDIT : I just wanna know how to calculate CPU cycles for fetching only the data for matrix A.

So what do you think the answer is to your three questions? And to give a precise answer is not going to be possible with the information given, since we need to know the number of cache-ways, the number of bytes per cache-line, and the instruction time of integer operations which are not covered by the statement above. — Mats Petersson, Jan 23 '16 at 18:18
Not to mention the different optimizations that could be employed by different compilers. It doesn't make sense to ask this question of human-readable code. — Joel Cornett, Jan 23 '16 at 18:23
Could you give some context, why are you asking this question ? What real world problem will that solve ? — kebs, Jan 23 '16 at 18:29

score 0 · Answer 1 · edited May 23 '17 at 10:27

Assuming 32-bit int and no optimizations other then actually using the cache

How do I go about calculating the no. of CPU cycles required? I'm confused as to how the array A will be fetched from the main memory.

A is an 2-dim array, which looks like a 1-dim array as mentioned here:

int array1[2][2] = {{0, 1}, {2, 3}};

In memory looks like this:

0 1 2 3

So ,when the CPU tries to do sum += A[i][j], it takes 1 KB from memory, at a cost of 150 CPU cycles and writes it into the cache. The first time this action happens is the when i=j=0, so 1 KB of ints from that place (assuming a 32 bit integer) is 2^8 ints (basically it's the A[0][0-255] first elements).

Now, each addition the sum += A[i][j] is done by going to the cache and using the memory written there (there is also a time for checking whether the data is in the cache or not, but since you didn't mention it in the question, I'll do the same from now on)

So, to make a long story short, you've just taken a 1 KB into the cache in 150 cycles cost, and you can read from it another 255 (again, assuming 32-bit int) ints in 1 cycle per int (the first has already been read from memory)

What if I replace A[i][j] with A[j][i] above?

Bad, bad things will happen. This action will nullify the cache productiveness, because each time you will go fetch a new int, you will find it's not in the cache and you will have to go get it from memory (150 cycles again. and again :( )

Thanks. I just read that FORTRAN stores arrays in column format. So that would mean it should essentially be the same as the answer to 2. — Abhinav Rajagopalan, Jan 23 '16 at 19:12
Don't know how FORTRAN handles arrays. If it's like C, then my answer is still valid. If it goes in columns (if I understand you correctly), then when taking a KB from memory you'll get A[i1 - i2][j] instead of A[i][j1 - j2] as C would. That means that you should iterate the other way around (first j instead of i) — CIsForCookies, Jan 23 '16 at 20:10

score 0 · Answer 2 · answered Jan 23 '16 at 18:45

To calculate the cycle times, you will need to know the number of bytes per cache-line, the size of integers, and to accurately calculate the time it takes to execute this loop will also depend on what the compiler generates for the loop iterations, at least the inner of the two loops will execute 1 million times, so that's at the very least a couple of million cycles for the typical case of one increment and one compare operation.

Given a size of cache-line, the time taken is:

totalsize = 1024 * 1024 * sizeof(int);
150 * totalsize / cachelinesize
1 * (totalsize - (totalsize / cachelinesize))

Obviously, the order of the index will matter: The address of an element in an array in C, say T arr[A][B];, accessing arr[a][b] is arr + (sizeof(T) * (A * a + b), so if the "next" access is far apart, then the cache-line will not caontain the next element being accessed.

[I believe the Fortran part of the question is referring to the fact that in Fortran, the address calculation for a given element is the reverse order to that of C - in other words, the first index is the one that moves one sizeof(T) at a time, and the right-most index is moving sizeof(T) * B if we have the same array definition as above.]

What you call row and column is of course just convention and semantics. But yes, the first index is the smallest step, and A(1,1) is followed by A(2,1) and so on. — Mats Petersson, Jan 23 '16 at 19:06

How do I calculate the CPU cycle cost of a C++ code snippet?

2 Answers2