analysis of cpu cache access time

Question

I have the following program which I with the help of someother on stackoverflow wrote to understand cachelines and CPU caches.I have the result of the calculation posted below.

    1     450.0  440.0
    2     420.0  230.0
    4     400.0  110.0
    8     390.0   60.0
   16     380.0   30.0
   32     320.0   10.0
   64     180.0   10.0
  128      60.0    0.0
  256      40.0   10.0
  512      10.0    0.0
 1024      10.0    0.0

I have plotted a graph using gnuplot which is posted below.

enter image description here

I have the following questions.

is my timing calculation in milliseconds correct ? 440ms seems to be a lot of time?
From the graph cache_access_1 (redline) can we conclude that the size of cache line is 32 bits (and not 64-bits?)
Between the for loops in the code is it a good idea to clear the cache? If yes how do I do that programmatically?
As you can see I have some 0.0 values in the result above.? What does this indicate? is the granularity of measurement too coarse?

Kindly reply.

#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>

#define MAX_SIZE (512*1024*1024)

int main()
{
    clock_t start, end;
    double cpu_time;
    int i = 0;
    int k = 0;
    int count = 0;

    /*
     * MAX_SIZE array is too big for stack.This is an unfortunate rough edge of the way the stack works. 
     * It lives in a fixed-size buffer, set by the program executable's configuration according to the 
     * operating system, but its actual size is seldom checked against the available space.
     */

    /*int arr[MAX_SIZE];*/

    int *arr = (int*)malloc(MAX_SIZE * sizeof(int));

    /*cpu clock ticks count start*/

    for(k = 0; k < 3; k++)
    {
        start = clock();
        count = 0;

        for (i = 0; i < MAX_SIZE; i++)
        { 
            arr[i] += 3;
            /*count++;*/
        }

        /*cpu clock ticks count stop*/
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("cpu time for loop 1 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
    }

    printf("\n");
    for (k = 1 ; k <= 1024 ; k <<= 1)
    {
        /*cpu clock ticks count start*/
        start = clock();
        count = 0;

        for (i = 0; i < MAX_SIZE; i += k) 
        {
            /*count++;*/
            arr[i] += 3;
        }

        /*cpu clock ticks count stop*/
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("cpu time for loop 2 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
    }
    printf("\n");

    /* Third loop, performing the same operations as loop 2, 
       but only touching 16KB of memory
     */
    for (k = 1 ; k <= 1024 ; k <<= 1)
    {
        /*cpu clock ticks count start*/
        start = clock();
        count = 0;

        for (i = 0; i < MAX_SIZE; i += k) 
        {
            count++;
            arr[i & 0xfff] += 3;
        }

        /*cpu clock ticks count stop*/
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("cpu time for loop 3 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
    }

    return 0;
}

I assume you are using clock() because you are using linux. I only know that both osx and windows have better high accuracy performance counters. But if you run long enough any timer should do eventually :) Now one thing that I noticed is that you do power of two strides. To measure you need to do random patterns. Many caches are implemented with very simple hashes that just take the low bits. Triggering cache aliasing by accessing data in exactly power of two strides is actually a performance pitfall on some cpus. — starmole, Jun 18 '13 at 05:22
I do the similar things like you in that question:http://stackoverflow.com/questions/16249694/the-accessing-time-of-traversal-a-array-via-a-constant-length. — Sayakiss, Jun 18 '13 at 05:23
Not that it matters, but no need to cast the malloc result. That line should read `int *arr = malloc(MAX_SIZE * sizeof *arr);`. — Jens, Jun 18 '13 at 05:52
@starmole: "windows ha[s] better high accuracy performance counters" - nothing usable at <~15ms granularity unless you bind the process to a core (and hope there's nothing else similarly bound to compete with it). Windows is *horrible* for high-resolution timing. — Tony Delroy, Jun 18 '13 at 06:23
Tony D: http://msdn.microsoft.com/en-us/library/windows/desktop/ms644904(v=vs.85).aspx QueryPerformance counter is usually at clock rate. Same for osx: https://developer.apple.com/library/mac/#qa/qa1398/_index.html. I am sure linux has an equivalent, but I just don't know the name there. And yes, every timing will be off if you have much load on other cores etc. I just read up on clock a bit and it seems about like the right call. Just had an odd ring to it, sorry about that. — starmole, Jun 18 '13 at 06:34
@starmole: from the msdn link you provide "However, you can get different results on different processors due to bugs in the basic input/output system (BIOS) or the hardware abstraction layer (HAL).". Doesn't matter if Microsoft blame BIOS/HAL writers - fact is it's often broken and the cross-core deltas can easily be of multiple-second magnitude. — Tony Delroy, Jun 18 '13 at 10:17
@Tony D: There is no need to bash individual implementations here. When I saw clock() I was very quick to assume it's not a good way to do performance timing. I am still not sure if it only ever is accurate for time slices. But most likely I was wrong. It's intended for that, and a good choice. But let's see what we can learn here: (1) If you are trying to measure a thing as low level as cache line sizes, you need to really understand the timer you use for doing that. (2) Make sure that for testing you mask to one cpu. (3) Make sure you test the right thing, not pow2 strides. — starmole, Jun 18 '13 at 10:54

score 5 · Accepted Answer · answered Jun 18 '13 at 19:09

Since you are on Linux, I'll answer from that perspective. I will also write with an Intel (i.e., x86-64) architecture in mind.

440 ms is probably accurate. A better way to look at the results would be time per element or access. Note that increasing your k reduces the number of elements accessed. Now, cache access 2 shows a fairly steady result of 0.9ns / access. This time is roughly comparable to 1 - 3 cycles per access (depending on CPU's clock rate). So sizes 1 - 16 (maybe 32) are accurate.
No (although I will first assume you mean 32 versus 64 byte). You should ask yourself, what does "cache line size" look like? If you access smaller than the cache line, then you will miss and subsequently hit one or more times. If you are greater than or equal to the cache line size, every access will miss. At k=32 and above, the access time for access 1 is relatively constant at 20ns per access. At k=1-16, overall access time is constant, suggesting that there are approximately the same number of cache misses. So I would conclude that the cache line size is 64 bytes.
Yes, at least for the last loop that is only storing ~16KB. How? Either touch a lot of other data, like another GB array. Or call an instruction like x86's WBINVD, which writes to memory and then invalidates all cache contents; however, it requires you to be in kernel-mode.
As you noted, beyond size 32, the times hover around 10ms, which is showing your timing granularity. You need to either increase the time required (so that a 10ms granularity is sufficient) or switch to a different timing mechanism, which is what the comments are debating. I'm a fan of using the instruction rdtsc (read timestamp counter (i.e., cycle count)), but this can be even more problematic than the suggestions above. Switching your code to rdtsc basically required switching clock, clock_t, and CLOCKS_PER_SEC. However, you could still face clock drift if your thread migrates, but this is a fun test so I wouldn't concern myself with this issue.

More caveats: the trouble with consistent strides (like powers of 2) is that the processor likes to hide the cache miss penalty by prefetching. You can disable the prefetcher on many machines in the BIOS (see "Changing the Prefetcher for Intel Processors").

Page faults may also be impacting your results. You are allocating 500M ints or about 2GB of storage. Loop 1 tries to touch the memory so that the OS will allocate pages, but if you don't have this much available memory (not just total, as the OS, etc takes up some space) then your results will be skewed. Furthermore, the OS may start reclaiming some of the space so that you will always be page faulting on some of your accesses.

Related to the previous, the TLB is also going to have some impact on the results. The hardware keeps a small cache of mappings from virtual to physical address in a translation lookaside buffer (TLB). Each page of memory (4KB on Intel) needs a TLB entry. So your experiment is going to need 2GB / 4KB => ~500,000 entries. Most TLBs hold less than 1000 entries, so the measurements are also skewed by this miss. Fortunately, it is only once every 4KB or 1024 ints. It is possible that malloc is allocating "large" or "huge" pages for you, for more details - Huge Pages in Linux.

Another experiment would be to repeat the third loop, but change the mask that you are using, so that you can observe the size of each cache level (L1, L2, maybe L3, rarely L4). You may also find that different cache levels use different cacheline sizes.

analysis of cpu cache access time

1 Answers1