Is there an easy way to quickly count the number of instructions executed (x86 instructions - which and how many each) while executing a C program ?
I use gcc version 4.7.1 (GCC)
on a x86_64 GNU/Linux
machine.
Is there an easy way to quickly count the number of instructions executed (x86 instructions - which and how many each) while executing a C program ?
I use gcc version 4.7.1 (GCC)
on a x86_64 GNU/Linux
machine.
Linux perf_event_open
system call with config = PERF_COUNT_HW_INSTRUCTIONS
This Linux system call appears to be a cross architecture wrapper for performance events, including both hardware performance counters from the CPU and software events from the kernel.
Here's an example adapted from the man perf_event_open
page:
perf_event_open.c
#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
#include <sys/types.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_INSTRUCTIONS;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("Used %lld instructions\n", count);
close(fd);
}
Compile and run:
g++ -ggdb3 -O0 -std=c++11 -Wall -Wextra -pedantic -o perf_event_open.out perf_event_open.c
./perf_event_open.out
Output:
Used 20016 instructions
So we see that the result is pretty close to the expected value of 20000: 10k * two instructions per loop in the __asm__
block (sub
, jne
).
If I vary the argument, even to low values such as 100
:
./perf_event_open.out 100
it gives:
Used 216 instructions
maintaining that constant + 16 instructions, so it seems that accuracy is pretty high, those 16 must be just the ioctl
setup instructions after our little loop.
Now you might also be interested in:
Other events of interest that can be measured by this system call:
Tested on Ubuntu 20.04 amd64, GCC 9.3.0, Linux kernel 5.4.0, Intel Core i7-7820HQ CPU.
perf stat
CLI utility
The perf CLI utility can print an instruction estimate. Ubuntu 22.04 setup:
sudo apt install linux-tools-common linux-tools-generic
echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
Usage:
perf stat <mycmd>
Let's test with the following Linux x86 program which loops 1 million times. Each loop has 2 instructions: inc
and loop
, so we expect about 2 million instruction.
main.S
.text
.global _start
_start:
mov $0, %rax
mov $1000000, %rcx
.Lloop_label:
inc %rax
loop .Lloop_label
/* exit */
mov $60, %rax /* syscall number */
mov $0, %rdi /* exit status */
syscall
Assemble and run:
as -o main.o main.S
ld -o main.out main.o
perf stat ./main.out
Sample output:
Performance counter stats for './main.out':
1.51 msec task-clock # 0.802 CPUs utilized
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
2 page-faults # 1.328 K/sec
5,287,702 cycles # 3.511 GHz
2,092,040 instructions # 0.40 insn per cycle
1,017,489 branches # 675.654 M/sec
1,156 branch-misses # 0.11% of all branches
0.001878269 seconds time elapsed
0.001922000 seconds user
0.000000000 seconds sys
So it says about 2 million instructions. Only about 92k off. So it is not absolutely precise, but good enough for many applications. And we also get some other fun statistics like branch misses and page faults.
The extra instructions presumably come from imprecise sampling barriers that ended up including kernel/other processes' instructions.
perf
can also do a bunch more advanced things, e.g. here I show how to use it to profile code: How do I profile C++ code running on Linux?
instcount
You can use the Binary Instrumentation tool 'Pin' by Intel. I would avoid using a simulator (they are often extremely slow). Pin does most of the stuff you can do with a simulator without recompiling the binary and at a normal execution like speed (depends on the pin tool you are using).
To count the number of instructions with Pin:
cd pin-root/source/tools/ManualExample/
make all
../../../pin -t obj-intel64/inscount0.so -- your-binary-here
inscount.out
, cat inscount.out
.The output would be something like:
➜ ../../../pin -t obj-intel64/inscount0.so -- /bin/ls
buffer_linux.cpp itrace.cpp
buffer_windows.cpp little_malloc.c
countreps.cpp makefile
detach.cpp makefile.rules
divide_by_zero_unix.c malloc_mt.cpp
isampling.cpp w_malloctrace.cpp
➜ cat inscount.out
Count 716372
You can easily count the number of executed instruction using Hardware Performance Counter (HPC). In order to access the HPC, you need an interface to it. I recommended you to use PAPI Performance API.
Probably a duplicate of this question
I say probably because you asked for the assembler instructions, but that question handles the C-level profiling of code.
My question to you would be, however: why would you want to profile the actual machine instructions executed? As a very first issue, this would differ between various compilers, and their optimization settings. As a more practical issue, what could you actually DO with that information? If you are in the process of searching for/optimizing bottlenecks, the code profiler is what you are looking for.
I might miss something important here, though.
Although not "quick" depending on the program, this may have been answered in this question. Here, Mark Plotnick suggests to use gdb
to watch your program counter register changes:
# instructioncount.gdb
set pagination off
set $count=0
while ($pc != 0xyourstoppingaddress)
stepi
set $count++
end
print $count
quit
Then, start gdb
on your program:
gdb --batch --command instructioncount.gdb --args ./yourexecutable with its arguments
To get the end address 0xyourstoppingaddress
, you can use the following script:
# stopaddress.gdb
break main
run
info frame
quit
which puts a breakpoint on the function main
, and gives:
$ gdb --batch --command stopaddress.gdb --args ./yourexecutable with its arguments
...
Stack level 0, frame at 0x7fffffffdf70:
rip = 0x40089d in main (main_aes.c:33); saved rip 0x7ffff7a66d20
source language c.
Arglist at 0x7fffffffdf60, args: argc=3, argv=0x7fffffffe048
...
Here what is important is the saved rip 0x7ffff7a66d20
part. On my CPU, rip
is the instruction pointer, and the saved rip
is the "return address", as stated by pepero in this answer.
So in this case, the stopping address is 0x7ffff7a66d20
, which is the return address of the main
function. That is, the end of the program execution.