Quick way to count number of instructions executed in a C program

Question

Is there an easy way to quickly count the number of instructions executed (x86 instructions - which and how many each) while executing a C program ?

I use gcc version 4.7.1 (GCC) on a x86_64 GNU/Linux machine.

I agree with Doness' answer that typically people want to profile execution time per function. However, if you really want to get exact counts of each instruction executed, then you need to run your code on an instruction set simulator, such as http://www.simplescalar.com/ — TJD, Nov 09 '12 at 18:18
Can you elaborate on what you are trying to accomplish? On x86, instruction execution performance depends far, far more on context than it does on the actual instruction -- virtually all instructions can optionally be loads or stores, for example. And purely register-to-register instructions are going to depend in complex ways on the pipeline state on modern CPUs. This doesn't sound like useful information to me. — Andy Ross, Nov 09 '12 at 19:08
Why do you ask? Usually *profiling* means something different... Eg compile with `gcc -pg -Wall -O` and use `gprof` or perhaps `oprofile` !! — Basile Starynkevitch, Nov 09 '12 at 19:17
I am implementing a complex mathematical algorithm and I wanted to count the number of multiplications(and divisions) which happens during its execution.I was looking for an easy way other than looking at the high level code and inferring the numbers.Maybe I should use a custom multiply function and insert a counter in it. — Jean, Nov 09 '12 at 19:51
Memory accesses, notably with cache misses, cost much more than divisions. Arithmetic is essentially free on recent processors, what matters is memory accesses and cache misses.... When the processor gets a cache miss and have to fetch data from your RAM modules, it is losing many hundreds of clock cycles (enough to compute dozens of divisions with register operands). — Basile Starynkevitch, Nov 09 '12 at 20:19
I agree,but this application is finally going to be run on a custom hardware with zero wait memory where 32bit/64bit multiplication/division is going to be costly. I wanted to get an estimate of math overhead involved before hand during the prototyping. Math operations are essentially going to remain same during porting to the real platform. — Jean, Nov 09 '12 at 20:38
I'm not sure I believe "zero wait memory", even L1 cache on modern CPUs is 4 cycles! But regardless: looks to tricks like building your app in C++ using a custom `operator*()` implementation. Note that on modern compilers even "multiplication" may not be implemented in an easy to detect way (consider the classic tricks played with the `LEA` instruction). — Andy Ross, Nov 09 '12 at 20:47
Related [How do I determine the number of x86 instructions executed in a C program?](//stackoverflow.com/q/54355631) — Peter Cordes, Jan 25 '19 at 03:37

Ciro Santilli OurBigBook.com · Answer 1 · 2022-12-18T21:24:30.930

Linux perf_event_open system call with config = PERF_COUNT_HW_INSTRUCTIONS

This Linux system call appears to be a cross architecture wrapper for performance events, including both hardware performance counters from the CPU and software events from the kernel.

Here's an example adapted from the man perf_event_open page:

perf_event_open.c

#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>

#include <inttypes.h>
#include <sys/types.h>

static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    int ret;

    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
                    group_fd, flags);
    return ret;
}

int
main(int argc, char **argv)
{
    struct perf_event_attr pe;
    long long count;
    int fd;

    uint64_t n;
    if (argc > 1) {
        n = strtoll(argv[1], NULL, 0);
    } else {
        n = 10000;
    }

    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    pe.config = PERF_COUNT_HW_INSTRUCTIONS;
    pe.disabled = 1;
    pe.exclude_kernel = 1;
    // Don't count hypervisor events.
    pe.exclude_hv = 1;

    fd = perf_event_open(&pe, 0, -1, -1, 0);
    if (fd == -1) {
        fprintf(stderr, "Error opening leader %llx\n", pe.config);
        exit(EXIT_FAILURE);
    }

    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

    /* Loop n times, should be good enough for -O0. */
    __asm__ (
        "1:;\n"
        "sub $1, %[n];\n"
        "jne 1b;\n"
        : [n] "+r" (n)
        :
        :
    );

    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    read(fd, &count, sizeof(long long));

    printf("Used %lld instructions\n", count);

    close(fd);
}

Compile and run:

g++ -ggdb3 -O0 -std=c++11 -Wall -Wextra -pedantic -o perf_event_open.out perf_event_open.c
./perf_event_open.out

Output:

Used 20016 instructions

So we see that the result is pretty close to the expected value of 20000: 10k * two instructions per loop in the __asm__ block (sub, jne).

If I vary the argument, even to low values such as 100:

./perf_event_open.out 100

it gives:

Used 216 instructions

maintaining that constant + 16 instructions, so it seems that accuracy is pretty high, those 16 must be just the ioctl setup instructions after our little loop.

Now you might also be interested in:

prevent reordering of the syscalls: Enforcing statement order in C++
prevent the test loop from being optimized out: How to prevent GCC from optimizing out a busy wait loop?

Other events of interest that can be measured by this system call:

cycle counts: How to get the CPU cycle count in x86_64 from C++?

Tested on Ubuntu 20.04 amd64, GCC 9.3.0, Linux kernel 5.4.0, Intel Core i7-7820HQ CPU.

perf stat CLI utility

The perf CLI utility can print an instruction estimate. Ubuntu 22.04 setup:

sudo apt install linux-tools-common linux-tools-generic
echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid

Usage:

perf stat <mycmd>

Let's test with the following Linux x86 program which loops 1 million times. Each loop has 2 instructions: inc and loop, so we expect about 2 million instruction.

main.S

.text
.global _start
_start:
    mov $0, %rax
    mov $1000000, %rcx
.Lloop_label:
    inc %rax
    loop .Lloop_label

    /* exit */
    mov $60, %rax   /* syscall number */
    mov $0, %rdi    /* exit status */
    syscall

Assemble and run:

as -o main.o main.S
ld -o main.out main.o
perf stat ./main.out

Sample output:

 Performance counter stats for './main.out':

              1.51 msec task-clock                #    0.802 CPUs utilized          
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 2      page-faults               #    1.328 K/sec                  
         5,287,702      cycles                    #    3.511 GHz                    
         2,092,040      instructions              #    0.40  insn per cycle         
         1,017,489      branches                  #  675.654 M/sec                  
             1,156      branch-misses             #    0.11% of all branches        

       0.001878269 seconds time elapsed

       0.001922000 seconds user
       0.000000000 seconds sys

So it says about 2 million instructions. Only about 92k off. So it is not absolutely precise, but good enough for many applications. And we also get some other fun statistics like branch misses and page faults.

The extra instructions presumably come from imprecise sampling barriers that ended up including kernel/other processes' instructions.

perf can also do a bunch more advanced things, e.g. here I show how to use it to profile code: How do I profile C++ code running on Linux?

When I run this I get: "Error opening leader 1". Does this require root privilege? I checked the documentation for perf_event_open and this doesn't seem to be the case but I might be missing something. — Alex Spurling, May 31 '21 at 14:00
@AlexSpurling I have just re-run on Ubuntu 20.10 + same hardware as mentioned in the answer now and it worked without sudo. Therefore, either you're missing some kernel config, or there's some hardware support issue. What's your distro + exact CPU model? Dedicated discussion at: https://stackoverflow.com/questions/38442839/perf-event-open-always-returns-1 — Ciro Santilli OurBigBook.com, May 31 '21 at 14:07

lol · Answer 2 · 2021-05-20T17:03:01.713

Intel Pin's `instcount`

You can use the Binary Instrumentation tool 'Pin' by Intel. I would avoid using a simulator (they are often extremely slow). Pin does most of the stuff you can do with a simulator without recompiling the binary and at a normal execution like speed (depends on the pin tool you are using).

To count the number of instructions with Pin:

Download the latest (or 3.10 if this answer gets old) pin kit from here.
Extract everything and go to the directory: cd pin-root/source/tools/ManualExample/
Make all the tools in the directory: make all
Run the tool called inscount0.so using the command: ../../../pin -t obj-intel64/inscount0.so -- your-binary-here
Get the instruction count in the file inscount.out, cat inscount.out.

The output would be something like:

➜ ../../../pin -t obj-intel64/inscount0.so -- /bin/ls
buffer_linux.cpp       itrace.cpp
buffer_windows.cpp     little_malloc.c
countreps.cpp          makefile
detach.cpp         makefile.rules
divide_by_zero_unix.c  malloc_mt.cpp
isampling.cpp          w_malloctrace.cpp
➜ cat inscount.out
Count 716372

score 2 · Answer 3 · answered Dec 18 '16 at 23:44

2

You can easily count the number of executed instruction using Hardware Performance Counter (HPC). In order to access the HPC, you need an interface to it. I recommended you to use PAPI Performance API.

answered Dec 18 '16 at 23:44

husin alhaj ahmade

451
4
15

1

Could you expand the answer? While a good pointer, for someone who does not know these technologies, it is difficult to know what exactly it is. – user2316602 Feb 16 '19 at 18:30
@user2316602, today processors are equipped with special registers called hardware performance counters, or hardware performance monitoring unit. These registers can be configured to count micro-architecture events like cache miss, number of store , load instruction and the number of executed instructions, also called retired instructions. some operating system provide an interface to access these counters directly. I have been performed many experiments and processes to access and use these counters. The best way is to use the PAPI infrastructure. [PAPI](http://icl.cs.utk.edu/papi/docs/) – husin alhaj ahmade Feb 18 '19 at 07:06

score 1 · Answer 4 · edited May 23 '17 at 11:53

1

Probably a duplicate of this question

I say probably because you asked for the assembler instructions, but that question handles the C-level profiling of code.

My question to you would be, however: why would you want to profile the actual machine instructions executed? As a very first issue, this would differ between various compilers, and their optimization settings. As a more practical issue, what could you actually DO with that information? If you are in the process of searching for/optimizing bottlenecks, the code profiler is what you are looking for.

I might miss something important here, though.

edited May 23 '17 at 11:53

Community

1
1

answered Nov 09 '12 at 18:08

Doness

47
4

2

Number of CPU instructions *executed* would be an easy way to compare algorithms without worrying about hiccups or competing for resources with other programs, independently of processing power although still dependent on instruction set. – mpen Mar 22 '16 at 21:53
4

@mpen: not necessarily, e.g. if you have one algorithm which use large lookup tables, and another which does the same thing using a more computational approach, then the first may have a lot more load instructions, each of which could potentially stall for > 100 cycles due cache misses. Similarly you might have one algorithm which uses a lot of expensive instructions, e.g. `FSQRT`, and another algorithm which avoids such expensive instructions and maybe uses a few more adds/multiplies - the second may well be faster even though it executes more instructions. – Paul R Dec 18 '16 at 23:58
1

you did not answer the question – Adham Zahran Apr 20 '22 at 15:58

score 1 · Answer 5 · answered Apr 18 '19 at 07:04

Although not "quick" depending on the program, this may have been answered in this question. Here, Mark Plotnick suggests to use gdb to watch your program counter register changes:

# instructioncount.gdb
set pagination off
set $count=0
while ($pc != 0xyourstoppingaddress)
    stepi
    set $count++
end
print $count
quit

Then, start gdb on your program:

gdb --batch --command instructioncount.gdb --args ./yourexecutable with its arguments

To get the end address 0xyourstoppingaddress, you can use the following script:

# stopaddress.gdb
break main
run
info frame
quit

which puts a breakpoint on the function main, and gives:

$ gdb --batch --command stopaddress.gdb --args ./yourexecutable with its arguments
...
Stack level 0, frame at 0x7fffffffdf70:
 rip = 0x40089d in main (main_aes.c:33); saved rip 0x7ffff7a66d20
 source language c.
 Arglist at 0x7fffffffdf60, args: argc=3, argv=0x7fffffffe048
...

Here what is important is the saved rip 0x7ffff7a66d20 part. On my CPU, rip is the instruction pointer, and the saved rip is the "return address", as stated by pepero in this answer.

So in this case, the stopping address is 0x7ffff7a66d20, which is the return address of the main function. That is, the end of the program execution.

Quick way to count number of instructions executed in a C program

5 Answers5

Intel Pin's `instcount`

Linked

Quick way to count number of instructions executed in a C program

5 Answers5

Intel Pin's instcount

Linked

Intel Pin's `instcount`