How to write a program in C to measure the speed of cache?

Question

Write a program and try to compare(measure, if you can) the time of accessing data from main memory and cache.

If you can do that, then how to measure the speed of each level of cache?

Is tnis even feasible? The caches aren't under your control, you have no way of knowing when data is loaded from where. (Okay, maybe you could trace cache misses but I'm guessing the tracing overhead would confound the results.) — millimoose, Apr 26 '13 at 17:37
maybe use registers and mmap? But this seems highly subjective (there are other processes running on a computer besides this). Sounds like something that should be done on a hardware level; otherwise other threads/processes/OS-stuff will get in the way — cegfault, Apr 26 '13 at 17:38
As far as I can see, if I define a large array in C, when I access a element in that array, the data around this element seems to be stored in cache. So if I access the array from begin to end it will be faster than random access(access every element once)---this is true, but I don't know it's the result by caching or something else. — Sayakiss, Apr 26 '13 at 17:40
taking a WILD guess, maybe this exercise/project is from a very old book where this kind of test was possible due to poorly optimized hardware/cacheing? — im so confused, Apr 26 '13 at 17:40
@millimoose: They are not under a direct control, but it is still feasible using certain heuristics. For example, one can write a program to cause a cache miss, and then compare the memory access speed to one without a cache miss. Again, there are multiple levels of cache, etc. So this is not a trivial task. — , Apr 26 '13 at 17:40
perhaps you should just trust that you can retrieve an element from cache real, real fast... — Grady Player, Apr 26 '13 at 17:41
My best guess would be to allocate a huge array of multiple blocks of memory on the heap (> cache size), then read in each "page" for your tests. That would pull the page (< cache size) to your cache, then do your ops. Then do it once on a randomly chosen page, which will likely not be on the cache. See how long that takes — im so confused, Apr 26 '13 at 17:41
@Sayakiss That's still assuming that nothing optimizes this under the hood by prefetching or whatever. (Remember that you probably want more samples of hits and misses, and that might be enough information for a sufficiently smart whatever.) I honestly don't know *what* could happen, modern CPUs and compilers are just insanely complex and it seems like it would take very extensive knowledge to even begin to do this correctly. — millimoose, Apr 26 '13 at 17:43
Also, remember that taking the system time will also probably change the contents of the cache. So that also leaves you with the assumption that nothing causes the page you thought is cached to be nuked. — millimoose, Apr 26 '13 at 17:47
A Google for *cache benchmarking* will produce a gazillion hits including some well-respected cache benchmarking programs. — High Performance Mark, Apr 26 '13 at 17:54

Michael Dorgan · Answer 1 · 2013-04-26T17:57:59.687

You need to come up with a heuristic that forces a 100% (or very close) cache miss (hopefully you have a cache invalidation op code?) and 100% cache hit. Hooray, that works for 1 level of cache. Now, how to do the same for level 2 and 3?

In all seriousness, there probably isn't a way to do this 100% reliably without special hardware and traces connected to the CPU and memory, but here's what I would do:

Write a "bunch" of stuff to 1 location in memory - enough that you can be sure that it is hitting the L1 cache consistantly and record the time (which affects your cache so beware). You should do this set of writes without branches to try and get rid of branch prediction inconsistancies. That is best time. Now, every so often, write a cache-line's worth of data to a random far away location in RAM at the end of your known location right and record the new time. Hopefully, this takes longer. Keep doing this recording the various times and hopefully you will see a couple of timings that tend to group up. Each of these groups "could" show timings for L2, L3, and memory access timings. The problem is there is so much other stuff getting in the way. The OS could context switch you and screw up your cache. An interrupt could come along and through your timing off. There will be a lot of stuff that could throw the values off. But, hopefully, you get enough signal in your data to see if it works.

This would probably be easier to do on a simpler, embedded type system where the OS (if any) won't get in your way.

But how to measure the timing of read a data once? It's so short and may be just a few ns! — Sayakiss, Apr 27 '13 at 06:11
By doing it enough times. You should be able to get a count into the millisecond realm. Still, I didn't say this would be easy :) — Michael Dorgan, Apr 29 '13 at 15:47
But after doing it one time, it may be loaded in cache(assume which is not before). — Sayakiss, Apr 29 '13 at 19:42

score 3 · Accepted Answer · answered Apr 26 '13 at 18:01

This generally requires some knowledge of the “geometry” of cache and other aspects of it. It is also helpful to have some control of the system beyond simple user access to it and implementation-dependent things such as finer timing than might be supplied through the standard C clock mechanism.

Here is an initial approach:

Write a routine that takes a pointer to memory, a length, and a number of repetitions and reads all of that memory in consecutive order, repeatedly.
Write a routine that takes a pointer to memory, a length, and a number of repetitions and writes to all of that memory in consecutive order, repeatedly.
The above routines may have to convert their pointers to volatile to prevent the compiler from optimizing away accesses that otherwise have no effect.
Allocate a large amount of memory.
Call each of the above routines, getting the current time before and after each call, and calling with a variety of lengths to see the times for different lengths.

When you do this, you will typically see fast speeds (number of bytes read/written per second) for small lengths and slower speeds for longer lengths. The speed decreases will occur where the sizes of the different levels of cache are exceeded. So you are quite likely to see the sizes of L1 and L2 cache reflected in data collected using the above technique.

Here are some reasons that approach is inadequate:

It does not control the instructions used to read or write cache. The C compiler may well generate load-word and store-word instructions, but many modern processors have instructions that can load and store 16 bytes at a time, and reading and writing may be faster with those instructions than with four-byte word instructions.
Cache will behave differently when you access in sequentially than if you access it randomly. Most caches make some sort of attempt to track when data is used, so that recently-used data is kept in cache while other data is cast out. The access parts of real programs generally differ from the consecutive operations described above.
In particular, consecutive writes to memory may be able to fill an entire cache line, so that nothing needs to be read from memory, whereas a real-world usage pattern that writes only one word to a particular location may have to be implemented by reading the cache line from memory and merging in the changed bytes.
Competition from other processes on your system will interfere with what is in cache and with measurement.

score 2 · Answer 3 · edited May 23 '17 at 11:58

2

Take a look at cachegrind-valgrind:

Cachegrind simulates how your program interacts with a machine's cache hierarchy and (optionally) branch predictor. It simulates a machine with independent first-level instruction and data caches (I1 and D1), backed by a unified second-level cache (L2). This exactly matches the configuration of many modern machines.

See tese nice questions they are somehow related:

edited May 23 '17 at 11:58

Community

1
1

answered Apr 26 '13 at 17:53

0x90

39,472
36
165
245

1

Ah, emulation might be a better option if its quality of emulation is good enough. Good idea. – Michael Dorgan Apr 26 '13 at 17:55

How to write a program in C to measure the speed of cache?

3 Answers3

Linked