3

When profiling, we know that 'self time' means the time spent on the method ignoring the functions it calls, whereas 'total time' means the overall time of that method with its children functions called.

I have a simple function where i'm doing only computation on arrays and not calling any other function inside. However when profiled, the times for self and total are quite different, for the function itself and also for each line on the function. Here is a sample profiled line from Zoom Profiler, ran with 32 threads.

timing of the line in for loop

Total        Self         Code
29.4 sec     16.9 sec     id=*(pid); 

What baffles me is that based on the first paragraph it should mean that if a line of code doesn't call any method, the self time and total times spent should be identical. Any ideas on what is happening?

Ali Yekta
  • 45
  • 5
  • 4
    SUGGESTIONS: 1) condense your test case into a simple function called directly from `main()`: no threads, "nothing extra". See if "Total" and "Self" are closer. 2) Try your "simplified test case" on a different compiler/different profiler than Zoom Profiler. For example, gcc/gprof. – paulsm4 Nov 27 '15 at 02:33
  • 1
    I'll guess that despite this also being tagged as c, that you're using c++.. And it's close to impossible to write c++ without some function being called somewhere. But obviously without the code one can only guess... – Voo Nov 27 '15 at 07:31
  • The other two comments are right. Also, I would suggest stepping through the code at the disassembly level. That way, if the line calls a function, you will see it. I can easily imagine that it's doing some validation checking on `pid`. – Mike Dunlavey Nov 27 '15 at 13:01
  • Thank you all for your insightful comments and suggestions. @paulsm4 As the code is a part of a bigger project, i could not reproduce this timing based on putting to main(), however i could manage to run with 1 thread and saw that the self and total times are getting close. ` 5.1sec 4.6sec 2,719 idi= *(pid); ` – Ali Yekta Nov 28 '15 at 04:39
  • @Voo actually it is C, Zoom says that this line of code is calling a thread-handler function in some header file, however i'm not sure if it's the other way around or not. But yes there is this call, so that's why it shows on the total. Although disassembly shows no such call, i'm not sure why it is being called, (to mitigate write conflicts?) – Ali Yekta Nov 28 '15 at 04:40
  • @MikeDunlavey Disassembly level for the above shows 3 lines, so i'm not sure where the mentioned thread-handler is being called `331.5ms 322.7ms + 0xaf mov rax, [rbp-0x50] 525.6ms 34.9ms + 0xb3 mov rax, [rax] 4.3sec 4.2sec + 0xb6 mov [rbp-0x20], rax` Thank you all again – Ali Yekta Nov 28 '15 at 04:40
  • I find it a bit strange that it shows time rather than cycles. I don't know about Zoom; using Valgrind for this Kind of stuff. On the other hand you actual can measure time, but when are the timers started or stopped when measuring that or is it just divided by your FLOPS value? I would prefer seeing the consumed cycles for that reason... – Gerhard Stein Nov 28 '15 at 07:48

1 Answers1

3

I wouldn't worry too much about that.

The statistics that profilers show, even good profilers, have to be taken "with a grain of salt". What those stats imply is that on some stack samples containing that line of code, it was not the last line of code on the stack, and that's hard to believe. It's easier to believe the profiler is mistaken. These days, with the way processors have pipelines that run way ahead, it could be pretty confusing.

In any case, if the reason you are profiling is to find speedups, it's hard to see how to speed up that particular line of code. You're looking for stuff you can speed up, not stuff you can't.

BTW, I would recommend not looking at seconds or milliseconds or any absolute time measurement, but rather look at fractional time - time divided by total. Real speedups tend to account for larger amounts, from 5% to 99%. Furthermore, don't bother with self time - inclusive time includes it, and if the program is of any significant size, so the stack samples get deep, the non-terminal lines on the stack samples (i.e. function calls) are a rich hunting ground for speedups. Finally, measurement precision is not necessary, so a large number of samples is not necessary.

That's why many people use this technique.


EDIT: FWIW, here's what I mean. Here are 5 random-time stack samples:

---------------------------
... some calls above
... exp(4.02) ...
... some calls below
---------------------------
... some calls above
... exp(0.35) ...
... some calls below
---------------------------
... some calls above
... push_back(...) ...
...
... new ...
...
---------------------------
...
... myArray[...]
...
---------------------------
... some calls above
... exp(0.35) ...
... some calls below
---------------------------

Three of them have calls to exp on the stack, in your code, or code you can edit, quite possibly from different places in the code. That implies you are spending roughly 60% of time calling exp. You can't speed up that function, but could you call it less?

Now notice that the second and fifth samples call exp with the same numeric argument. (That's something no profiler could tell you.) What does that mean? It means roughly 40% of time is in exp calls with repeat arguments, which you could possibly avoid by remembering prior values, or by memoizing exp.

If you take samples until you see some pattern that you could do something about, and you see it two or more times, you've got a healthy speedup. The fewer samples you have to take to get to that point, the bigger the speedup. Then you can repeat the whole process, until you can't any more. That's a reliable way to get speed that's hard to beat.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135