How do I know the number of times an assembly instruction is hit

Question

I'm currently working on a project where I modify some programs at assembly level. The program transformation is really simple, I just insert some masking operations at specific locations in the code.

I want to know how many times my masking instructions are executed to have a precise idea of the program transformation cost.

Currently I'm using GDB and I set hardware breakpoints where my masking instructions are located. Afterwards I can get the hit count of the breakpoints with info breakpoints.

However GDB is super slow, it does not finish even after a night of computation for programs which should take under 10s normally.

I'm pretty sure I'm using hardware breakpoints, I always set less than 4 of them (I'm running on an Intel processor with 4 debug registers).

My version of GDB is 8.0.

I was thinking of using a profiler and I've taken a quick look at valgrind, gcov and gprof but it does not seem to suit my needs.

Does anyone know any tool that could help me? Or does someone know how to speed up my idea with GDB?

Thanks

EDIT: I run on linux x86-64

Consider using something like [https://github.com/gperftools/gperftools](pprof). While it doesn't give you exact figures, it can give you a very good estimate of how much time is spent in your masking code. — fuz, Sep 14 '17 at 15:36
The OS and processor are likely to matter here. Linux? x86_64? Something else? — Employed Russian, Sep 14 '17 at 15:56
When inserting the masking instructions: Can you also insert a call to a function that counts how many times it has been called? — Terje D., Sep 14 '17 at 16:18
I didn't thought of this since I'm not that confident in my assembly code skills. But this should be doable, thanks for the tip! — NinjaSansBonnet, Sep 14 '17 at 16:21
You can't get a *precise* measure of performance cost by simply counting instructions!! If the masking isn't on the critical path, and there wasn't a front-end bottleneck, executing a couple extra instructions might be nearly free. (e.g. the program was bottlenecked on something else). See http://agner.org/optimize/ for more about x86 asm performance. (Also https://stackoverflow.com/tags/x86/info for more x86 performance links) — Peter Cordes, Sep 15 '17 at 03:49
valgrind's [tool `callgrind`](http://valgrind.org/docs/manual/cl-manual.html) is able to *simulate* execution (at cost of 10x-20x slowdown) of your program and count execution counts for all instructions (in `--dump-instr=yes` mode "*This specifies that event counting should be performed at per-instruction granularity. This allows for assembly code annotation. Currently the results can only be displayed by KCachegrind.*"). But in modern world of Out-of-order CPUs instruction execution count is not the same thing as time taken, as Peter Cordes points. — osgx, Apr 17 '18 at 13:21

ks1322 · Answer 1 · 2017-09-14T16:13:07.047

Does anyone know any tool that could help me?

You can try perf annotate. There is an example of it in perf tutorial:

It is possible to drill down to the instruction level with perf annotate. For that, you need to invoke perf annotate with the name of the command to annotate. All the functions with samples will be disassembled and each instruction will have its relative percentage of samples reported:
perf record ./noploop 5
perf annotate -d ./noploop

------------------------------------------------
 Percent |   Source code & Disassembly of noploop.noggdb
------------------------------------------------
         :
         :
         :
         :   Disassembly of section .text:
         :
         :   08048484 <main>:
    0.00 :    8048484:       55                      push   %ebp
    0.00 :    8048485:       89 e5                   mov    %esp,%ebp
[...]
    0.00 :    8048530:       eb 0b                   jmp    804853d <main+0xb9>
   15.08 :    8048532:       8b 44 24 2c             mov    0x2c(%esp),%eax
    0.00 :    8048536:       83 c0 01                add    $0x1,%eax
   14.52 :    8048539:       89 44 24 2c             mov    %eax,0x2c(%esp)
   14.27 :    804853d:       8b 44 24 2c             mov    0x2c(%esp),%eax
   56.13 :    8048541:       3d ff e0 f5 05          cmp    $0x5f5e0ff,%eax
    0.00 :    8048546:       76 ea                   jbe    8048532 <main+0xae>
[...]
The first column reports the percentage of samples for function ==noploop()== captured for at that instruction. As explained earlier, you should interpret this information carefully.

score 2 · Answer 2 · answered Sep 14 '17 at 21:23

Every time it hits one of those breakpoints it has to stop the program and go off to do something in GDB. So naturally it takes forever.

I would use GDB but not breakpoints. I just halt it at random manually (or you could use a timer). Each time it halts you can see exactly what instruction it is at. If you do this 10 or 20 times you can easily estimate the fraction of time spent at each instruction (in other words, how much of total time that instruction is responsible for), and you can see how it changes with your masking.

You don't get really precise time fractions this way, unless you get a lot of samples, but what you do get is very reliable.

Then again, gprof should also do what you want as much as I hate to admit it :)

score 1 · Answer 3 · answered Sep 14 '17 at 17:55

gdb has nothing to do with it, you have to rely on what the hardware/chip offers. First off you dont count how many times an assembly instruction is it because processors dont understand assembly directly they understand machine code, and although desired not all assembly instructions map to one machine instruction you have to take it on a case by case basis. So barring terminology you have to rely on the silicon.

Next define hit, it takes many clock cycles to process and execute an instruction depending on the design, so many instructions start to see early stages of the pipe until the processor figures out a branch is happening, say the instructions in the shadow of a conditional branch. I assume you are not interested in those (breakpointing and stopping the flow of the program is not the same as running the program it changes how the program runs, you are forcing the instructions in the shadow of the breakpoint to be hit at least twice as much as they would normally based on the definition of what you mean by the term hit.

Breakpoints only work if the processor supports them, you can usually always put an undefined instruction in there and have an undefined handler if the processor supports that (this is all generic, what processor you are using specifically is not relevant, particularly since it appears to be x86 which means there are many different implementations at this point and x86-64 doesnt begin do describe the details needed for a complete answer)

some processors provide no debugging support at all, at best undefined instruction and hope it works. some have a lot and the rest some where in the middle, some offer the feature you are asking for watch for "execution" of a particular address and a counter. Generally the answer is going to be breakpoint, resume and count how many times is your only option. Of course then there is the why, counting how many times one instruction is hit is not immediately relevant to overall performance of a loop, sometimes if it is the only one that touches memory in the instruction, but alignment of that instruction or loop sometimes plays a bigger role than the machine code itself...so wondering how you got to the point where you needed to count the "hits" on a specific instruction. (which you should also be able to do just by analysis of the code).

How do I know the number of times an assembly instruction is hit

3 Answers3