3

I have been using oprofile to try to discover why my program was spending so much time in the kernel. I now have the symbols from the kernel, but apparently no links between my program and kernel that'll tell me which bits of my program are taking so long.

samples  %        image name               app name                 symbol name
-------------------------------------------------------------------------------
  201       0.8911  vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic _raw_spin_lock_irq
  746       3.3073  vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic rb_get_reader_page
  5000     22.1671  vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic default_spin_lock_flags
  16575    73.4838  vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic _raw_spin_lock
22469    11.1862  vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic __ticket_spin_lock
  22469    99.6010  vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic __ticket_spin_lock [self]
  26        0.1153  vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic ret_from_intr

Where do I go from here? How do I discover the places in my program that are causing __ticket_spin_lock?

Dijkstra
  • 2,490
  • 3
  • 21
  • 35

2 Answers2

5

Oprofile takes stack samples. What you need to do is not look at summaries of them, but actually examine the raw samples. If you are spending, say, 30% of time in the kernel, then if you can see 10 stack samples chosen at random, you can expect 3 of them, more or less, to show you the full reason of how you got into the kernel.

That way you will see things the summaries or call graph won't show you.

IN CASE IT ISN'T CLEAR: Since __ticket_spin_lock is on the stack 99.6% of the time, then on each and every stack sample you look at, the probability is 99.6% you will see how you got into that routine. Then if you don't really need to be doing that, you have possibly a 250x speedup. That's like from four minutes down to one second. Screw the "correct" or "automated" approach - get the results.

ADDED: The thing about profilers is they are popular and some have very nice UIs, but sadly, I'm afraid, it's a case of "the emperor's new clothes". If such a tool doesn't find much to fix, you're going to like it, because it says (probably falsely) that your code, as written, is near-optimal.

There are lots of postings recommending this or that profiler, but I can't point to any claim of saving more than some percent of time, like 40%, using a profiler. Maybe there are some.

I have never heard of a profiler being used first to get a speedup, and then being used again to get a second speedup, and so on. That's how you get real speedup - multiple optimizations. Something that was just a small performance problem at the beginning is no longer small after you've removed a larger one. This picture shows how, by removing six problems, the speedup is nearly three orders of magnitude. You can't necessarily do that, but isn't it worth trying?

enter image description here

APOLOGIES for further editing. I just wanted to show easy it is to fool a call graph. The red lines represent call stack samples. Here A1 spends all its time calling C2, and vice-versa. Then suppose you keep the same behavior, but you put in a "dispatch" routine B. Now the call graph loses the information that A1 spends all its time in C2, and vice-versa. You can easily extend this example to multiple levels. enter image description here You can say a call tree would have seen that. Well, here's how you can fool a call tree. A spends all its time in calls to C. Now if instead A calls B1, B2, ... Bn, and those call C, the "hot path" from A to C is broken up into pieces, so the relationship between A and C is hidden. enter image description here There are many other perfectly ordinary programming practices that will confuse these tools, especially when the samples are 10-30 levels deep and the functions are all little, but the relationships cannot hide from a programmer carefully examining a moderate number of samples.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Interesting post, thanks. Is there a way to use the data I've already collected with oprofile? This seems a bit too manual to be "the correct" way of doing this. – Dijkstra Apr 03 '13 at 09:22
  • Correct? Too manual? The >500 votes on [*this post*](http://stackoverflow.com/a/378024/23771) are not there because it's incorrect. If you trace back where the ideas behind performance profiling come from, they are *unfounded*. They are based on holistic concepts of measuring. OTOH, if you have an infinite loop, you don't find it by measuring, do you? No. You treat it as a bug to be found, manually, halting it in a debugger. Well, whenever a program spends more time than it could, the *same technique is best*. You grab random-time samples and examine them carefully. – Mike Dunlavey Apr 03 '13 at 13:00
  • @MikeDunlavey: Yes. First off, `oprofile` uses statistic sampling, which is equivalent to your suggested method (although it requires less manual intervention, and works in cases where it isn't possible to halt the program). In this specific case, suggesting sifting through a large quantity of samples is unfeasible. I'd have probably suggested checking the callgraph settings here, as they seem to be truncated, or using the `opreport` `--accumulated` option, or limiting the results to the image being profiled, using these to identify the problematic call chains. – Hasturkun Apr 03 '13 at 16:51
  • @Hasturkun: 1) The sampling is OK, but what's done with them is not. 2) It is entirely too easy for a problem to hide in the call graph, like if it depends on the data, or the code lines where the calls take place. 3) [*Here are the statistics*](http://scicomp.stackexchange.com/a/2719/1262) why the need for a large number of samples is unfounded. 4) Point 3 of [*this post*](http://stackoverflow.com/a/1779343/23771) shows the problem with call graphs. 5) What speedup factors have you heard of, using any profiler? I'm not saying there aren't any. I'm saying they are small. – Mike Dunlavey Apr 03 '13 at 17:31
  • @MikeDunlavey: Re: 2, in most cases, that would still give you an area of code to focus on (Also, IIRC, the dispatch routine case shouldn't lose information. though I can't verify this right now, I think oprofile will give you instruction level details on the calls). In any case, the profile output should be examined, even looking at the individual samples if needed. It's possible that my opinion is tainted from working on threaded async code, so I'll remove my downvote, although I still feel this doesn't really answer the question. – Hasturkun Apr 03 '13 at 23:00
  • @Hasturkun: Thanks. Threaded async code is another "kettle of fish". I use another method for that. I guess the reason I'm so exercised about this is point 5. Like this little old lady exclaimed in a U.S. fast-food advertisement - "Where's The Beef?" :-) – Mike Dunlavey Apr 04 '13 at 00:03
  • @Hasturkun: BTW, I think I did answer the question, which was " How do I discover the places in my program that are causing __ticket_spin_lock?" Since it is on 99.6% of samples, examining ***one*** stack sample will show where with probability 0.996. If it isn't really necessary and can be removed, that's a possible 250x speedup factor. – Mike Dunlavey Apr 04 '13 at 15:28
  • `__ticket_spin_lock` is on the stack 11.2% of the time. I'm afraid you read the wrong line :( - Still, thank you for your detailed explanation. – Dijkstra Apr 04 '13 at 18:13
  • @Dijkstra: It's on 2 lines, each with >22k samples. So maybe it shows up on average once in 9 samples. Regardless, you'll see what the problem is. – Mike Dunlavey Apr 04 '13 at 18:51
2

I agree with Mike's answer: a callgraph is not the right way to inspect the source of the problem. What you really want is to look at the callchains of the hottest samples.

If you don't want to inspect "by hand" the raw samples collected by oprofile, you could rerun your application with the record command of perf using the -g option in order to collect the stacktraces. You can then display the samples annotated with their callchains using the report command of perf. Since perf is not aggregating the callchains of the individual samples in a global callgraph, you don't have some of the issues outlined in Mike's post.

redblackbit
  • 816
  • 8
  • 12