x86 Assembly Force Cache Store

Question

I have an assignment where I am required to measure the Latency of accessing data in L1, L2, and L3 cache, as well as main memory. This is to be done in C.

I've spent several hours researching ways to measure your cache latency and have turned up very little. I have downloaded some benchmarking tools which have given me cache access times, but I have not gotten anywhere when it comes to implementing this in my own code. I understand that what happens with the cache is not up to me in C.

My next thought was that if I could force populate the cache with something from x86 assembly (first thought) then just do a clock(), access(), clock() on that data I just loaded, supposedly the time would be the accurate(ish) access time since I know it should be found in the cache since I just put it there with my inline asm or similar method...

If anyone might be able to offer insight to my assignment here, that would be fantastic. Whether it be telling me I am crazy for wanting to use asm to load something in the cache, or introducing me to something else that might help me.

Thanks so much!

maybe this similar question might be of help: http://stackoverflow.com/a/12697439/358328 — Uku Loskit, Sep 02 '13 at 17:48
While that question is relevant, it isn't quite aimed at measuring the latency, which is the only thing I am concerned about. Thanks for the post though! I am trying to work from there. — mrkanaly, Sep 02 '13 at 19:51

score 8 · Accepted Answer · answered Sep 02 '13 at 21:46

There is no reason to use assembly at all for this assignment. Your solution does not require assembly C will work as well. I assume you are running on top of an operating system, so that is going to get in the way of your measurement, both executing things you think you know where they are and also in measuring what you think you are measuring.

Cache basics as far as taking these measurements is concerned...lets say there are four layers of memory. L1, the fastest, but also most expensive and smallest. Then L2. slower, not as expensive, likely larger than L1 in size. L3 less expensive, slower, larger, and then main memory the slowest, cheapest, and largest.

Lets just say we have four chunks of memory we are going to work with A, B, C, and D. L1 can only hold one chunk at a time. L2 two at a time, L3 three of the four and main memory all four.

If we do a read it first goes through L1, if there is a miss then L2, if a miss then L3, and if a miss then it will always be in main memory. Understand though this data is cached on the way back so L3, L2, and L1 will all contain the data just read, evicting as necessary (not always true but assume this simple cache model to understand how to complete your task). So if we read chunk A, then L1, L2, and L3 will all contain chunk A. Now in this hypothetical model if we read chunk B then L1 will contain B, evicting A. L2 will contain A and b and l3 will contain A and B. Read C and L1 will contain C, evicting B, lets say that L2 chooses to evict A, and contains B and C, and L3 contains A, B, and C. Read D and L1 will contain C lets say L2 evicts B and contains C and D, and say L3 evicts A and contains B, C, and D.

Assume that we dont exactly know how each cache chooses what to evict and what to keep. But assume that we do know or can figure out from motherboard specs or other sources, how large each cache is. If the above example happened in that order and L1 has D, L2 has C and D, L3 has B, C, and D and main has all four a,b,c,d. Then if when in that state we read all of block A and time it we are in theory reading it from main memory, it is not purely just the time to read that memory but also if any of the memory being evicted has changed it has to be written upstream possible hits all the way. But ideally if you were mostly doing just reads, then you will be timing mostly the reads.

Lets say that we found ourselves in the situation where chunk D is in l1, c and d in l2, b,c,d in l3 and we read all of chunk B and time it. Wouldnt that be measuring the time to access L3? with those same starting conditions then reading C would give us l2 timing. With those same starting conditions then reading D would be l1 timing right?

The trick is to get yourself into those conditions. The sizes of the caches are likely not such that l2 is twice the size of l1 and so on so to completely control what is in L1 you need to read enough data to fill L1. Moreso if you were to read L3 size amount of data then in theory L3 has all that data, L2 has the last L2 amount of that data and L1 has the last L1 amount of that data.

Using the data cache is easier than instruction cache but you can do this either way, you need at least L3 sized amount of instructions in main memory, a large quantity of nops. executing a linear chunk of instructions is no different than reading a linear chunk of memory. as far as read cycles goes. Instruction is easier as far as enabling and using the I cache. To enable data cache may or may not be simple based on your operating system and how you are managing memory.

Thank you very much for this. It is definitely a great answer. Using this kind of thinking I was successful at making it happen. — mrkanaly, Sep 07 '13 at 02:32

score 1 · Answer 2 · answered Sep 02 '13 at 19:56

1

You should be able to avoid assembler by looking at the assembler output of the compiler to understand the actual operations.

Even if you get a high-resolution clock, there is little you can do about pre-emption by the OS when running the benchmark. You will need to perform many runs to get meaningful results.

Rather than trying to place instructions into the cache, it may be easier to allow the processor to load them as they are run. If you place varying amounts of filler into the procedures, you may be able to get the cache line alignment to what you want.

answered Sep 02 '13 at 19:56

Pekka

3,529
27
45

Thank you for your response. However, I am very confused by what you are saying here. I am ultimately just trying to measure the latency of each level of cache. I am unfamiliar with cache alignment and am unsure what relevance viewing assembler output of the compiler has further than seeing all assembly at once. – mrkanaly Sep 02 '13 at 20:11
You need to review the documentation for the processor and its cache options. Then, look at the locations of the code in memory with a debugger to see how they are aligned with the cache lines. Then, you can count the number of times each line is accessed by the processor to see how many cache hits you have. Depending on the processor and tools you may be able to get actual counters, but this suggestion will work without additional processor support. – Pekka Sep 03 '13 at 20:40
You can determine cache sizes from the processor architecture documentation. Alignment will then depend on how your OS maps the virtual address space into physical memory. Alternately, there may be an API to access this information. In Windows you can use the [WMI Win32_Processor class](http://msdn.microsoft.com/en-us/library/aa394373(v=vs.85).aspx) to determine these values. The purpose of seeing the assembler is to allow you to work out the addresses of the individual code elements after they are loaded when you run the benchmark. – Pekka Sep 11 '13 at 09:43

Leeor · Answer 3 · 2013-09-10T15:15:21.950

You don't really need to look at this on a line granularity basis, it's fairly complicated to maintain (as dwelch points out in his very good answer), and almost impossible to measure unless repeating enough times (which in turn can complicate maintaining the right conditions to force hit a certain cache level).

Instead you could start by writing a simple array that resides in contiguous physical space (you may need some tweaking if your OS has elaborate page assigning mechanism). Fill in data to this array (one access per cacheline is enough), and then start reading it repeatedly. If the array size is small enough to fit in L1 (for 32k for e.g., you could allocate 32k chars or slightly less, and access every 64'th element), after enough repetitions you should get a majority of the accesses hitting there. You could have some corner cases with other cached lines interfering, such as pagemap entries, stack or other heap variables, but in most cases you get an L1 hit so it evens up nicely. Even events like context switch (in case you can't control your system to prevent that) would fade away if you repeat it enough times to get stable results.

Then, start increasing your data set size gradually. Once you pass the L1 size you should be able to see a clear degradation in the time per access (overall time divided by # accesses). Note that the way caches work with LRU, that fact that you access your array in linear order means that the minute it's big enough not to fit the L1, you shouldn't get any partial benefit, since the last elements would probably evict the first ones just in time to prevent the next iteration from finding them there (although you could still enjoy the fact the loads can go out-of-order in modern CPUs). Going further, once you reach the L2 size more or less (if the L2 in your system is not strictly inclusive you may have a tiny bit benefit from having both L1+L2, but the access pattern described should prevent that almost completely). Then again when hitting the L3 size.

Note that some HW features may complicate matters - first and foremost is the HW prefetcher, which is almost granted to kick in and fetch the lines ahead of you. If possible you should disable it through BIOS, otherwise you could jump in larger strides (128 or even 256 instead of 64) - the caches are most likely directly mapped (with some associativity) so this would have the effect of stressing only one in every 2-4 sets and leaving the rest empty, which is fine (as long as you remember to divide the time by the new number of accesses). It would also break the stream enough for you to get the actual timing and not the prefetch-assisted one (you may also have a stride based prefetcher but it's usually not as strong as a streamer).

x86 Assembly Force Cache Store

3 Answers3