8

I need to improve the throughput of the system.

The usual cycle of optimization has been done and we have already achieved 1.5X better throughput.

I am now beginning to wonder if I can utilize the cachegrind output to improve the system's throughput.

Can somebody point me to how to begin on this?

What I understand is we need to ensure most frequently used data should be kept small enough so that it remains in L1 cache and the next set of data should fit in the L2.

Is this the right direction I am taking?

rajeshnair
  • 1,587
  • 16
  • 32

4 Answers4

6

It`s true that cachegrind output in itself does not give too much information how to go about optimizing code. One needs to know how to interpret it and what you are saying about data fitting into L1 and L2 is indeed the right direction.

To fully understand how memory access patterns influence performance, I recommend reading an excellent paper "What Every Programmer Should Know About Memory" by Ulrich Drepper, the GNU libc maintainer.

Laurynas Biveinis
  • 10,547
  • 4
  • 53
  • 66
  • Thanks Kaustaurya, that was indeed an excellent article. I remember going through this article earlier but I was able to appreciate the inticracies much better this time. – rajeshnair Nov 13 '09 at 14:23
3

If you're having trouble parsing the cachegrind output, look into KCacheGrind (it should be available in your distro of choice). I use it and find it quite helpful.

Stephen Newell
  • 7,330
  • 1
  • 24
  • 28
2

According to the Cachegrind documentation, the details given to you by cachegrind are the number of cache misses for a given part of your code. You need to know about how caches work on the architecture you are targetting so that you know how to fix the code. In practice this means making data smaller or changing the access pattern of some data so that cached data is still in the cache. However you need to understand your program's data and data access before you can act on the information. As it says in the manual,

In short, Cachegrind can tell you where some of the bottlenecks in your code are, but it can't tell you how to fix them. You have to work that out for yourself. But at least you have the information!

Mr. Shiny and New 安宇
  • 13,822
  • 6
  • 44
  • 64
2

1.5x is a nice speedup. It means you found something that took 33% of the time that you could get rid of. I bet you can do more, even before you get down to low-level issues like data memory cache. This is an example of how. Basically, you could have additional performance problems (and opportunities for speedup) that were not large before, like 25% say. Well, with the 1.5x speedup, that 25% is now 37.5%, so it is "worth more" than it was. Often such a problem is in the form of some mid-stack function call that is requesting work that, once you know how much it costs, you may decide isn't completely necessary. Since kcachegrind does not really pinpoint these, you may not realize it is a problem.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • I mostly agree. However, I wouldn't consider the cache a low-level issue. Any platform you are likely to be targeting will have a cache (even modern CUDA cards). Optimizing for the cache is also likely to yield large improvements and can be done without looking at the assembly output from the compiler. – Jørgen Fogh Jun 14 '13 at 09:05
  • @JørgenFogh: Right. The processor developers have done everything possible in their domain to optimize processing time. We software developers don't always reciprocate by ensuring that our code is "lean and mean". I see it all the time. – Mike Dunlavey Jun 14 '13 at 13:13
  • That is definitely true. My point is that there is no way that a good processor can make up for an inefficient algorithm. This includes algorithms with poor cache performance. Cache efficiency can't be added as an afterthought. – Jørgen Fogh Jun 14 '13 at 14:26