CPU cache understanding using a C program

Question

I am trying to understand CPU caches and cache lines using a C program, as I do for most C concepts. The program that I use is given below. I got the idea from the blog.

http://igoro.com/archive/gallery-of-processor-cache-effects/

Now the output of the below program on my machine is as given below. This is the output with CFLAGS="-g -O0 -Wall".

./cache
CPU time for loop 1 0.460000 secs.
CPU time for loop 2 (j = 8) 0.050000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.050000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.050000 secs.
CPU time for loop 2 (j = 17) 0.040000 secs.
CPU time for loop 2 (j = 18) 0.050000 secs.
CPU time for loop 2 (j = 19) 0.040000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.040000 secs.
CPU time for loop 2 (j = 24) 0.030000 secs.
CPU time for loop 2 (j = 25) 0.040000 secs.
CPU time for loop 2 (j = 26) 0.030000 secs.
CPU time for loop 2 (j = 27) 0.040000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.040000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.

The output with optimization (CFLAGS=-g -O3 -Wall)

CPU time for loop 1 0.130000 secs.
CPU time for loop 2 (j = 8) 0.040000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.040000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.040000 secs.
CPU time for loop 2 (j = 17) 0.050000 secs.
CPU time for loop 2 (j = 18) 0.040000 secs.
CPU time for loop 2 (j = 19) 0.050000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.030000 secs.
CPU time for loop 2 (j = 24) 0.040000 secs.
CPU time for loop 2 (j = 25) 0.030000 secs.
CPU time for loop 2 (j = 26) 0.040000 secs.
CPU time for loop 2 (j = 27) 0.030000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.030000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.

It is pointed out in the blog that

The first loop multiplies every value in the array by 3, and the second loop multiplies > only every 16-th. The second loop only does about 6% of the work of the first loop, but on modern machines, the two for-loops take about the same time: 80 and 78 ms respectively on my machine.

This doesn't seem to be the case on my machine. As you can see, the time for the execution of

loop 1 is 0.46 seconds.

and that for

loop 2 is 0.03 seconds or 0.04 seconds or 0.05 seconds

for different values of j.

Why is this happening?

#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>

#define MAX_SIZE (64*1024*1024)

int main()
{
    clock_t start, end;
    double cpu_time;
    int i = 0;
    int j = 0;
    /* MAX_SIZE array is too big for stack. This is an unfortunate rough edge of the way the stack works.
     It lives in a fixed-size buffer, set by the program executable's configuration according to the
     operating system, but its actual size is seldom checked against the available space. */
    /* int arr[MAX_SIZE]; */

    int *arr = (int*)malloc(MAX_SIZE * sizeof(int));

    /* CPU clock ticks count start */
    start = clock();

    /* Loop 1 */
    for (i = 0; i < MAX_SIZE; i++)
        arr[i] *= 3;

    /* CPU clock ticks count stop */
    end = clock();

    cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

    printf("CPU time for loop 1 %.6f secs.\n", cpu_time);

    for (j = 8 ; j < 32 ; j++)
    {
        /* CPU clock ticks count start */
        start = clock();

        /* Loop 2 */
        for (i = 0; i < MAX_SIZE; i += j)
            arr[i] *= 3;

        /* CPU clock ticks count stop*/
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
    }

    return 0;
}

duplicate ? http://stackoverflow.com/questions/17138250/segmentation-fault-when-calling-clock — hit, Jun 16 '13 at 23:36
Not really duplicate.I posted the other question regarding segfault.I fixed that and this is an attempt to interpret the result. — liv2hak, Jun 16 '13 at 23:37
@hit: Whilst the CODE is the same for both questions, the actual question asked is quite different... — Mats Petersson, Jun 16 '13 at 23:37
Could you report what compiler you're using, and what flags you're passing it? — Phil Miller, Jun 16 '13 at 23:46
excellent question.I am using gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52) CFLAGS = -g -O0 -Wall — liv2hak, Jun 16 '13 at 23:47
Wow, that's a positively ancient compiler. Also, when you're benchmarking, you should be compiling with optimization enabled. Dare I ask what hardware you're running this on? — Phil Miller, Jun 16 '13 at 23:53

score 3 · Accepted Answer · edited Sep 02 '17 at 16:27

I modified the code a little bit. First a summary of modifications:

Make MAX_SIZE significantly larger, to ensure that there is a REAL difference when things change. (It does now use a full 2 GB of memory, so don't do this on a 32-bit OS)
Run loop 1 a couple of times (on my machine, this makes some difference, because my machine will run slower the first time around - this is probably because malloc doesn't actually map in the memory into the process' address space, so in the first loop, we get some extra overhead for mapping in the memory). It also ensures that the CPU is running at "full speed" when performing the other loops, rather than being that "power-save" speed.
Vary the j value more rapidly in the second loop by multiplying by 2 ( <<= 1 is the same as *= 2 in this case - old habit to use shift)
Use += 3 instead of *= 3. (Multiply is a little slower than +=, but makes little difference in this case.
Add a loop3 which does exactly the same number of operations as loop2, but over a small memory range [using & with a 2ⁿ-1 value to restrict the range].

I compiled the code with gcc -Wall -O3 -sdc=c99, using version 4.6.3 and running on a quad-core Athlon 965, Fedora Core 16 x86-64 and 16 GB of RAM.

Here's the code:

#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>

#define MAX_SIZE (512*1024*1024)

int main()
{
    clock_t start, end;
    double cpu_time;
    int i = 0;
    int j = 0;
    /* MAX_SIZE array is too big for stack.This is an unfortunate rough edge of the way the stack works.
       It lives in a fixed-size buffer, set by the program executable's configuration according to the
       operating system, but its actual size is seldom checked against the available space. */
    /* int arr[MAX_SIZE]; */

    int *arr = (int*)malloc(MAX_SIZE * sizeof(int));

    /* CPU clock ticks count start */

    for(int k = 0; k < 3; k++)
    {
        start = clock();

        /* Loop 1 */
        for (i = 0; i < MAX_SIZE; i++)
            arr[i] += 3;

        /* CPU clock ticks count stop */
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 1 %.6f secs.\n", cpu_time);
    }

    for (j = 1 ; j <= 1024 ; j <<= 1)
    {
        /* CPU clock ticks count start */
        start = clock();

        /* Loop 2 */
        for (i = 0; i < MAX_SIZE; i += j)
            arr[i] += 3;

        /* CPU clock ticks count stop */
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
    }


    // Third loop, performing the same operations as loop 2,
    // but only touching 16KB of memory
    for (j = 1 ; j <= 1024 ; j <<= 1)
    {
        /* CPU clock ticks count start */
        start = clock();

        /* Loop 3 */
        for (i = 0; i < MAX_SIZE; i += j)
            arr[i & 0xfff] += 3;

        /* CPU clock ticks count stop */
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 3 (j = %d) %.6f secs.\n", j, cpu_time);
    }
    return 0;
}

And the results:

CPU time for loop 1 2.950000 secs.
CPU time for loop 1 0.630000 secs.
CPU time for loop 1 0.630000 secs.
CPU time for loop 2 (j = 1) 0.780000 secs.
CPU time for loop 2 (j = 2) 0.700000 secs.
CPU time for loop 2 (j = 4) 0.610000 secs.
CPU time for loop 2 (j = 8) 0.540000 secs.
CPU time for loop 2 (j = 16) 0.560000 secs.
CPU time for loop 2 (j = 32) 0.280000 secs.
CPU time for loop 2 (j = 64) 0.140000 secs.
CPU time for loop 2 (j = 128) 0.090000 secs.
CPU time for loop 2 (j = 256) 0.060000 secs.
CPU time for loop 2 (j = 512) 0.030000 secs.
CPU time for loop 2 (j = 1024) 0.040000 secs.
CPU time for loop 3 (j = 1) 0.470000 secs.
CPU time for loop 3 (j = 2) 0.240000 secs.
CPU time for loop 3 (j = 4) 0.120000 secs.
CPU time for loop 3 (j = 8) 0.050000 secs.
CPU time for loop 3 (j = 16) 0.030000 secs.
CPU time for loop 3 (j = 32) 0.020000 secs.
CPU time for loop 3 (j = 64) 0.010000 secs.
CPU time for loop 3 (j = 128) 0.000000 secs.
CPU time for loop 3 (j = 256) 0.000000 secs.
CPU time for loop 3 (j = 512) 0.000000 secs.
CPU time for loop 3 (j = 1024) 0.000000 secs.

As you can see, the first few of the loop2 takes the same amount of time - once we get to 32 the time starts to go down, because the processor doesn't need EVERY cache-line, but in the loop3 case, the number of operations in each loop affects the total time rather directly.

Edit:

The multiply (*=3) vs add (+=3) doesn't really make that much difference EXCEPT in the loop3 case, where it adds approximately 30% on the loop time.

You can fix the problem with the loop by memsetting the array after mallocing it. — SoapBox, Jun 16 '13 at 23:57
@SoapBox: Yes, I presume so, but that doesn't stop other things, such as the CPU not running at full speed (due to power-save modes) for the first couple of tens of a second [assuming memset is fast enough to not raise the CPU speed to max - not sure if it is or not] — Mats Petersson, Jun 16 '13 at 23:58

CPU cache understanding using a C program

1 Answers1