34

Full Re-Write/Update for clarity (and your sanity, its abit too long) ... (Old Post)

For an assignment, I need to find the levels (L1,L2,...) and size of each cache. Given hints and what I found so far: I think the idea is to create arrays of different sizes and read them. Timing these operations:

sizes = [1k, 4k, 256K, ...]
foreach size in sizes 
    create array of `size`

    start timer
    for i = 0 to n // just keep accessing array
        arr[(i * 16) % arr.length]++ // i * 16 supposed to modify every cache line ... see link
    record/print time

UPDATED (28 Sept 6:57PM UTC+8)

See also full source

Ok now following @mah's advice, I might have fixed the SNR ratio problem ... and also found a method of timing my code (wall_clock_time from a lab example code)

However, I seem to be getting incorrect results: I am on a Intel Core i3 2100: [SPECS]

  • L1: 2 x 32K
  • L2: 2 x 256K
  • L3: 3MB

The results I got, in a graph:

lengthMod: 1KB to 512K

enter image description here

The base of the 1st peak is 32K ... reasonable ... the 2nd is 384K ... why? I'm expecting 256?

lengthMod: 512k to 4MB

enter image description here

Then why might this range be in a mess?


I also read about prefetching or interference from other applications, so I closed as many things as possible while the script is running, it appears consistently (through multiple runs) that the data of 1MB and above is always so messy?

Community
  • 1
  • 1
Jiew Meng
  • 84,767
  • 185
  • 495
  • 805
  • You've added multiple questions, which makes it difficult to answer in SO format since this isn't really a discussion board. 1) the size of arr is not 262144, it's 1M * sizeof(int) -- the array size (1024*1024) is the number if ints it holds, not the number of bytes. 2) you're correct; the code you're copying assumes 16 bytes per entry. 3) there is a mod operator, but and'ing is _much_ faster, and reliable for powers of 2. 4) you can declare a single buffer (largest size) but use varying amounts of it. – mah Sep 26 '12 at 11:12
  • 5) correct, other processes on your system will taint your results, so either run many trials or remove other processes. finally) little difference because of the poor signal-to-noise ratio I described. If you run on machine with poor memory access speeds you will get more useful results. I'll post another answer below with something else you can try to improve this... – mah Sep 26 '12 at 11:14
  • @mah, do you mean my latest update? I've updated recently to simplify the question, by quite a lot? – Jiew Meng Sep 26 '12 at 11:14
  • our updates were done at roughly the same time; I was commenting while you were updating so I had not seen your reduction :) See my answer update below to address your size/time correlation problem, but for more clarification: do not change the array size, leave it at 1M (or at least leave it larger than the largest cache size). Your changes for sizes you're testing can be to the lengthMod variable only; just make sure it's always a power of two minus one. So, for 1k set it to (1 * 1024) - 1; for 4j set it to (4 * 1024) - 1; etc. – mah Sep 26 '12 at 11:27
  • Regarding the wall-clock time, you should read this thread about the different clocks available in C and use a better clock: http://stackoverflow.com/questions/12392278/getrusage-vs-clock-gettime-vs-clock-vs-gettimeofday/12480485#12480485 – Douglas B. Staple Sep 28 '12 at 18:18

6 Answers6

41

After 10 minutes of searching the Intel instruction manual and another 10 minutes of coding I came up with this (for Intel based processors):

void i386_cpuid_caches () {
    int i;
    for (i = 0; i < 32; i++) {

        // Variables to hold the contents of the 4 i386 legacy registers
        uint32_t eax, ebx, ecx, edx; 

        eax = 4; // get cache info
        ecx = i; // cache id

        __asm__ (
            "cpuid" // call i386 cpuid instruction
            : "+a" (eax) // contains the cpuid command code, 4 for cache query
            , "=b" (ebx)
            , "+c" (ecx) // contains the cache id
            , "=d" (edx)
        ); // generates output in 4 registers eax, ebx, ecx and edx 

        // See the page 3-191 of the manual.
        int cache_type = eax & 0x1F; 

        if (cache_type == 0) // end of valid cache identifiers
            break;

        char * cache_type_string;
        switch (cache_type) {
            case 1: cache_type_string = "Data Cache"; break;
            case 2: cache_type_string = "Instruction Cache"; break;
            case 3: cache_type_string = "Unified Cache"; break;
            default: cache_type_string = "Unknown Type Cache"; break;
        }

        int cache_level = (eax >>= 5) & 0x7;

        int cache_is_self_initializing = (eax >>= 3) & 0x1; // does not need SW initialization
        int cache_is_fully_associative = (eax >>= 1) & 0x1;

        // See the page 3-192 of the manual.
        // ebx contains 3 integers of 10, 10 and 12 bits respectively
        unsigned int cache_sets = ecx + 1;
        unsigned int cache_coherency_line_size = (ebx & 0xFFF) + 1;
        unsigned int cache_physical_line_partitions = ((ebx >>= 12) & 0x3FF) + 1;
        unsigned int cache_ways_of_associativity = ((ebx >>= 10) & 0x3FF) + 1;

        // Total cache size is the product
        size_t cache_total_size = cache_ways_of_associativity * cache_physical_line_partitions * cache_coherency_line_size * cache_sets;

        printf(
            "Cache ID %d:\n"
            "- Level: %d\n"
            "- Type: %s\n"
            "- Sets: %d\n"
            "- System Coherency Line Size: %d bytes\n"
            "- Physical Line partitions: %d\n"
            "- Ways of associativity: %d\n"
            "- Total Size: %zu bytes (%zu kb)\n"
            "- Is fully associative: %s\n"
            "- Is Self Initializing: %s\n"
            "\n"
            , i
            , cache_level
            , cache_type_string
            , cache_sets
            , cache_coherency_line_size
            , cache_physical_line_partitions
            , cache_ways_of_associativity
            , cache_total_size, cache_total_size >> 10
            , cache_is_fully_associative ? "true" : "false"
            , cache_is_self_initializing ? "true" : "false"
        );
    }
}

Reference: Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 2A , page 3-190, CPUID—CPU Identification.

This is much more reliable then measuring cache latencies as it is pretty much impossible to turn off cache prefetching on a modern processor. If you require similar info for a different processor architecture you will have to consult the respective manual.

skovorodkin
  • 9,394
  • 1
  • 39
  • 30
Sergey L.
  • 21,822
  • 5
  • 49
  • 75
  • Erm ... I am not so familiar with C, tho I have programming background. So it will be nice to have some comments. Are u reading an BIOS codes or system config? Thats not allowed in this assignment :( – Jiew Meng Oct 02 '12 at 23:06
  • 4
    This is the engineers answer, love it :). If the information is there, why not just look it up – user1443778 Oct 03 '12 at 07:17
  • No, this does not interact with the BIOS in any way, it only uses the CPU. This code queries the CPU directly for it's specifications, similar to reading it's serial number. I will add a couple comments to the code later today. – Sergey L. Oct 03 '12 at 09:44
  • As promised I have commented my code and added some more informative output. – Sergey L. Oct 03 '12 at 13:21
15

The time it takes to measure your time (that is, the time just to call the clock() function) is many many (many many many....) times greater than the time it takes to perform arr[(i*16)&lengthMod]++. This extremely low signal-to-noise ratio (among other likely pitfalls) makes your plan unworkable. A large part of the problem is that you're trying to measure a single iteration of the loop; the sample code you linked is attempting to measure a full set of iterations (read the clock before starting the loop; read it again after emerging from the loop; do not use printf() inside the loop).

If your loop is large enough you might be able to overcome the signal-to-noise ratio problem.

As to "what element is being incremented"; arr is an address of a 1MB buffer; arr[(i * 16) & lengthMod]++; causes (i * 16) * lengthMod to generate an offset from that address; that offset is the address of the int that gets incremented. You're performing a shift (i * 16 will turn into i << 4), a logical and, an addition, then either a read/add/write or a single increment, depending on your CPU).

Edit: As described, your code suffers from a poor SNR (signal to noise ratio) due to the relative speeds of memory access (cache or no cache) and calling functions just to measure the time. To get the timings you're currently getting, I assume you modified the code to look something like:

int main() {
    int steps = 64 * 1024 * 1024;
    int arr[1024 * 1024];
    int lengthMod = (1024 * 1024) - 1;
    int i;
    double timeTaken;
    clock_t start;

    start = clock();
    for (i = 0; i < steps; i++) {
        arr[(i * 16) & lengthMod]++;
    }
    timeTaken = (double)(clock() - start)/CLOCKS_PER_SEC;
    printf("Time for %d: %.12f \n", i, timeTaken);
}

This moves the measurement outside the loop so you're not measuring a single access (which would really be impossible) but rather you're measuring steps accesses.

You're free to increase steps as needed and this will have a direct impact on your timings. Since the times you're receiving are too close together, and in some cases even inverted (your time oscillates between sizes, which is not likely caused by cache), you might try changing the value of steps to 256 * 1024 * 1024 or even larger.

NOTE: You can make steps as large as you can fit into a signed int (which should be large enough), since the logical and ensures that you wrap around in your buffer.

mah
  • 39,056
  • 9
  • 76
  • 93
  • FYI, updated my post with full source (Github Gist), see *UPDATE (An Attempt)* – Jiew Meng Sep 26 '12 at 10:26
  • Ok thanks for your update. Perhaps I am not really understanding well the SNR thing. When I run ur code varying the length mod seems to work as expected. Next step is plotting a graph. Then I will need someway of automating getting values for different `lengthMod`. Thats the reason why I have a for loop. Its actually not to record the time per operation but the time for 1 "size" (1K, 4K, ...) of `steps` accesses. See **[https://gist.github.com/3787223#file_try.c](https://gist.github.com/3787223#file_try.c)**. I will get all access time ~0.64s why is that? The inner loop is the 1 doing work – Jiew Meng Sep 26 '12 at 14:00
  • If all else fails, I could make my program accept a execution option `cin` to specify the lengthMod? – Jiew Meng Sep 26 '12 at 14:01
  • 1
    The SNR aspect is... lets say it takes 25ns to access memory in the cache, 200ns to access memory not in the cache. These are very small amounts of time, and your clock resolution is quite small - in the order of 1ms which means your measurements can be off by millions of memory access cycles. For that reason, you must measure multiple millions of test cycles at once instead. In your code, increase `steps` to do this for a single cache size test, and use lengthMod as the variable to define which size test you run. You can use cin for that, but it would be better to trial several automatically. – mah Sep 26 '12 at 14:07
  • Hmm... I was wondering if there are more accurate ways of measuring performance, in a recent lab, we were very briefly introduced to `pref` and `papi`. `perf` is easier to use ... but is there a way to get timing/stats from within the C program? Or must I run it through the commpand line only? `perf stat ./test` – Jiew Meng Sep 27 '12 at 02:21
  • Ok got it working however, how do I determine the levels of cache or for that matter size of cache by code? For now, I plot a graph and if I see a spike, its likely its a transition to another cache ... but this spike is more or less sometimes – Jiew Meng Oct 04 '12 at 13:51
  • There's no clear answer, just a best guess. You want to run many trials to develop good statistics of the timing, ideally when there is nothing else using the system. Then when your statistics are as reliable as you want them to be, you use them to measure where the spike occurs and that infers your answer. – mah Oct 04 '12 at 14:01
  • while I was working on another sub-task, to determine the write policy, I was wondering if its better to do just a read rather read/write(increment)? `tmp += (data[(k * 16) & lengthMod] & SIZE);`, what I actually need here is only the read speeds right? If I write and its write through, all the times should be similar? As it will write to all levels of cache? – Jiew Meng Oct 06 '12 at 13:53
  • I think that you need to first determine cache size using reads exclusively. Once you know the size of the cache, you could then time write operations to possibly develop some meaningful intuitions about the write policy based on its timing. I'm uncertain though how write-through actually works -- is it possible the cache hardware takes care of the write-through in parallel to the processor running the next instruction, continuing its write to RAM in parallel to the next instruction fetch (and stalling the CPU if that cache line needs to be flushed on the next instruction)? – mah Oct 07 '12 at 01:09
  • I tried to get the readonly timing but there appears to be very little difference between the times from L1 - L2, and within L2 (32 - 256K) the access times seems to be increasing, only stabalizes in L3, any idea why might that be the case? **[source](https://github.com/jiewmeng/cs3210-assign1/blob/2065e8e22b6631aa2569fadd8af35faa643f9f25/cache.cpp)** – Jiew Meng Oct 07 '12 at 11:16
  • I've no idea why this would happen; sorry. – mah Oct 07 '12 at 11:26
3

I know this! (In reality it is very complicated because of pre-fetching)

 for (times = 0; times < Max; time++) /* many times*/
     for (i=0; i < ArraySize; i = i + Stride)
           dummy = A[i]; /* touch an item in the array */

Changing stride allows you to test the properties of caches. By looking at a graph you will get your answers.

Look at slides 35-42 http://www.it.uu.se/edu/course/homepage/avdark/ht11/slides/11_Memory_and_optimization-1.pdf

Erik Hagersten is a really good teacher (and also really competent, was lead architect at sun at one point) so take a look at the rest of his slides for more great explanations!

user1443778
  • 581
  • 1
  • 5
  • 20
  • I tried this, but I notice that as ArraySize increases, so does the time. Probably because if Stride is constant, the number of inner loop increases as ArraySize increases. Also, I updated my question ... see *UPDATE (An Attempt)* – Jiew Meng Sep 26 '12 at 10:27
  • First property is cache-line size (e.g. 64B), if the stride is less then 64B you will get a cache hit at least every second access. Then we want to see the L1 cache size (e.g. 64KB) so here we need to look at the number of strides for each inner loop. So lets say we have a 1024K ArraySize and Stride=128B then number of inner loops is: 16384. With the known cache line size we know this will take 512K to store and will not fit in L1 cache. Remember LRU and line size. – user1443778 Sep 27 '12 at 02:13
  • It is important that you actually run the outer loop many times (millions of times!!!!) and get the average time for each loop. We are interested in properties in the limit not the transient process of starting up the program etc. – user1443778 Sep 27 '12 at 02:16
2

To answer your question of weird numbers above 1MB, it's pretty simple; cache eviction policies having to do with branch prediction, and the fact that the L3 cache is shared between the cores.

A core i3 has a very interesting cache structure. Actually any modern processor does. You should read about them on wikipedia; there are all sorts of ways for it to decide "well, I probably won't need this..." then it can say "I'll put it in the victim cache" or any number of things. L1-2-3 cache timings can be very complex based on a large number of factors and the individual design decisions made.

On top of that, all these decisions and more (see wikipedia articles on the subject) have to be synchronized between the two cores' caches. The methods to synchronize the shared L3 cache with separate L1 and L2 caches can be ugly, they can involve back-tracking and re-doing calculations or other methods. It's highly unlikely you'll ever have a completely free second core and nothing competing for L3 cache space and thus causing synchronization weirdness.

In general if you are working on data, say, convoluting a kernel, you want to make sure it fits within L1 cache and shape your algorithm around that. L3 cache isn't really meant for working on a data set the way you're doing it (though it is better than main memory!)

I swear if I was the one having to implement cache algorithms I'd go insane.

std''OrgnlDave
  • 3,912
  • 1
  • 25
  • 34
1

For the benchmarking with varying strides, you could try lat_mem_rd from lmbench package, it's open-source: http://www.bitmover.com/lmbench/lat_mem_rd.8.html

I had posted my port for Windows to http://habrahabr.ru/post/111876/ -- it's quite lengthy to be copypasted here. That's from two years ago, I didn't test it with modern CPUs.

Artyom Skrobov
  • 311
  • 1
  • 11
-1

For windows, you can use the GetLogicalProcessorInformation method.

For linux, you may use sysconf(). You can find valid arguments for sysconf in /usr/include/unistd.h or /usr/include/bits/confname.h

Mark
  • 6,033
  • 1
  • 19
  • 14
  • They are not allowed to use system config information for this assignment and `sysconf` does not give you cache information the last time I checked. – Sergey L. Oct 05 '12 at 10:29
  • 1
    It does. Call it with `_SC_LEVEL(1_I|1_D|2_|3_|4_)CACHE_(SIZE | ASSOC | LINESIZE)`. Edit: This is why I included the include-files where these values are defined. – Mark Oct 08 '12 at 09:07