automated assembly loop level profiling

Question

Does anyone know any assembly loop level profiler?

I have been using gprof but gprof hides loops and it is function level profiling, yet to optimize my code i want something to go to the loop level. I want it to be automated and just give me the output like gprof. I was recommended to go to dtrace yet I have no idea were to start. anyone can direct me in anyway? for example

main:

pushl   %ebp     
movl    %esp, %ebp     
subl    $16, %esp     
movl    $5000000, -4(%ebp)     
movl    $0, -12(%ebp)     
movl    $0, -8(%ebp)    
jmp .L2 

.L3:   

 movl    -8(%ebp), %eax    
 addl    %eax, -12(%ebp)    
 addl    $1, -8(%ebp) 

.L2:    

movl    -8(%ebp), %eax    
cmpl    -4(%ebp), %eax    
jl  .L3     
movl    $0, %eax    
leave     ret

for example in gprof it would say main executed 1 time and foo executed 100 times. yet I want to know if L2, or L3 executed 1M times then my concentration on optimizing would be here. if my question is vague please ask me to explain more Thanks

Paul R · Answer 1 · 2016-01-05T10:50:00.250

4

It depends on what OS you are using, but for this kind of profiling you generally want to use a sampling profiler rather than an instrumented profiler, e.g.

Linux: Zoom
Mac OS X: Instruments
Windows: VTune

edited Jan 05 '16 at 10:50

answered Jan 04 '11 at 10:26

Paul R

208,748
37
389
560

thank u for your reply! can u plz briefly explain to me the diff between sampling profilers and instrumented profilers? thanks:) – Syntax_Error Jan 04 '11 at 14:03
@Syntax_Error: instrumented profilers use special hooks inserted by the compiler at the started and end of each function. These hooks call a profiling library to generate call trees and timing information for each function. Their granularity is at the function level. Sampling profilers OTOH use an unmodified executable and instead they periodically generate an interrupt, e.g. every 100 µs or 1 ms and take a sample of the program counter. This gives a statistical profile which can be used to analyse execution down to the instruction level (if there are sufficient samples). – Paul R Jan 04 '11 at 14:49
@Syntax_Error: They actually sample the whole function call stack, including the PC. In a little hotspot like this, the PC is about all there is, but in big software, a lot of the code to optimize is actually in the form of calls to lower levels. So any call on the stack X% of time is responsible for X% of execution. The interesting thing is, the percent can be quite inaccurate, and you will still find them, so a large number of samples is nice, but not at all necesssary. – Mike Dunlavey Jan 04 '11 at 19:25

score 1 · Accepted Answer · answered Jan 04 '11 at 10:20

1

I suggest using Callgrind (one of the Valgrind tools, and usually installed with it). This can gather statistics on a much more fine-grained level, and the kcachegrind tool is very good for visualising the results.

answered Jan 04 '11 at 10:20

psmears

26,070
4
40
48

score 1 · Answer 3 · edited May 23 '17 at 12:04

1

If you're on Linux, Zoom is an excellent choice.

If you're on Windows, LTProf might be able to do it.

On any platform, the low-tech method random-pausing can be relied on.

Don't look for how many times instructions are executed. Look for where the program counter is found a large fraction of the time. (They're not the same thing.) That will tell you where to concentrate your optimization efforts.

edited May 23 '17 at 12:04

Community

1
1

answered Jan 04 '11 at 13:03

Mike Dunlavey

40,059
14
91
135

Im on Linux, Ubuntu. so ill be using zoom. does it show the number of times an insns executed and the PC time fraction? or do I have to deduce the PC fraction? – Syntax_Error Jan 04 '11 at 14:05
@Syntax_Error: It gives you the PC fraction (as a percent). As I said, don't even *dream* of instruction counts. It's very simple. It stops a bunch of times, and each time it captures the stack (which includes the PC). So if it is spending X% of its time between L2 and L3, that's what it will tell you, and that's how much total time that code is responsible for. Don't count, don't measure, just let the percents tell you what to optimize. A lot of people think this is complicated. It's not. It's that simple. – Mike Dunlavey Jan 04 '11 at 18:26

score 0 · Answer 4 · answered Jan 04 '11 at 10:18

KCachegrind gives profiling information for each line of source code (see this screenshot), and this includes CPU time, cache misses, etc... It saved my day a couple of times.

However running the code inside the profiler is extremely slow (tens of times slower than native).

automated assembly loop level profiling

4 Answers4

Linked