4

I'm analyzing some code and using cachegrind to get the number of cachemisses(L2 and L3) in the execution.

My question is how do I determine the time spend waiting for the cache to get readdy based on the cache misses?

I would like to be able to say something like, "my code get 90% cpu utilization"

is it posible to do this based on the cache grind output?

Martin Kristiansen
  • 9,875
  • 10
  • 51
  • 83
  • 1
    According to the [documentation of cachegrind](http://cs.swan.ac.uk/~csoliver/ok-sat-library/internet_html/doc/doc/Valgrind/3.7.0/html/cg-manual.html), "On a modern machine, an L1 miss will typically cost around 10 cycles, an LL miss can cost as much as 200 cycles, and a mispredicted branch costs in the region of 10 to 30 cycles." – Matteo Italia Aug 30 '12 at 09:37
  • You can dig numbers from the spec of your system -- if you find them -- or use tools like [lmbench](http://www.bitmover.com/lmbench/) to measure it. You can then plug them in `cachegrind` (or is it a feature of `kcachegrind` which we are using to visualize the data?). – AProgrammer Aug 30 '12 at 09:45
  • @MatteoItalia: so lets say I have 1.000 L2 misses, that convert into 200.000 unused cycles. 2.0 GHz chip that means 0.001 second of waiting? – Martin Kristiansen Aug 30 '12 at 09:45
  • @AProgrammer: I was just readning the output from `valgrind --tool=cachegrind` – Martin Kristiansen Aug 30 '12 at 09:46
  • @Martin: CPU behaviour is too complex to quantify so simply. A cache miss may have zero impact on performance, if the data which was unavailable turns out not to have been required. –  Aug 30 '12 at 09:49

3 Answers3

2

Cachegrind simply simulates execution on a CPU, emulating how the cache and branch predictor might behave. To be able to know how long you would spend blocking on the cache would require a lot more information. Specifically you need to know when execution can be speculated and how many instructions can be dispatched in parallel (as well as how memory memory accesses can be coordinated simultaneously). Cachegrind can't do this, and any tool that could would depend heavily on the processor (whereas cache misses are much less processor dependent).

If you have access to a modern Intel CPU I'd recommend getting a free copy of VTune (for non-commercial purposes) and seeing what it says. It can tell the processor to collect data on cache misses and will report it back to you, so you can see what actually happened rather then just simulating. It will give you a clocks per instruction for each line of code, and using this you can see which lines are blocking on the cache (and how long for), it can also give you all the other information cachegrind can.

You can get it here:

http://software.intel.com/en-us/articles/non-commercial-software-download/

jleahy
  • 16,149
  • 6
  • 47
  • 66
  • It appears that using vtune on an Ubuntu machine is out of the question :-( Error: Failed to start profiling because the scope of ptrace system call application is limited. To enable profiling, please set /proc/sys/kernel/yama/ptrace_scope to 0. See the documentation for instructions on enabling it permanently. – Martin Kristiansen Aug 30 '12 at 10:23
  • sudo echo 1 > /proc/sys/kernel/yama/ptrace should fix that. – jleahy Aug 30 '12 at 10:30
  • Sorry, I meant zero. Ah, without root access it's going to be difficult. – jleahy Aug 30 '12 at 10:57
1

The only way to be sure is to use your CPU's performance monitoring counters to measure your particular CPU - and even then, the results are very specific and any optimisations you do based on this may perform very badly on CPUs with different cache sizes, bus architecture or memory configuration.

marko
  • 9,029
  • 4
  • 30
  • 46
1

A variable can be fetched from the cache in a few clock cycles.

It can take more than one hundred clock cycles to fetch it from RAM if it isnt in the cache.

gavlaaaaaaaa
  • 612
  • 2
  • 7
  • 11
  • I know and there are different speed issues on the different cache levels, the interesting issue is, what is the suspected speed up from elimination of cache misses? You sorte surgesting a speedup in the 100x which sounds optimistic – Martin Kristiansen Jun 12 '13 at 19:49
  • Yeah it does sound a little optimistic, but I guess if a 1GHz clock speed does 1 billion clock cycles a second, you can see that if you have multiple cache misses this can quickly eat into these cycles. As for the actual algorithm to work out a speed up, I've never seen it mentioned as it all depends on the hardware etc. But there are applications out there that can do this for you such as Cachegrind mentioned above. – gavlaaaaaaaa Jun 17 '13 at 11:15