How can I monitor the amount of SIMD (SSE, AVX, AVX2, AVX-512) instruction usage of a process? For example, htop
can be used to monitor general CPU usage, but not specifically SIMD instruction usage.

- 8,018
- 2
- 41
- 69
-
vtune is usually your best bet. – robthebloke Feb 07 '20 at 00:54
1 Answers
I think the only reliable way to count all SIMD instructions (not just FP math) is dynamic instrumentation (e.g. via something like Intel PIN / SDE).
See How to characterize a workload by obtaining the instruction type breakdown? and How do I determine the number of x86 machine instructions executed in a C program? specifically sde64 -mix -- ./my_program
to print the instruction mix for your program for that run, example output in libsvm compiled with AVX vs no AVX
I don't think there's a good way to make this like top
/ htop
, if it's even possible to safely attach to already-running processes, especially multi-threaded once.
It might also be possible to get dynamic instruction counts using last-branch-record stuff to record / reconstruct the path of execution and count everything, but I don't know of tools for that. In theory that could attach to already-running programs without much danger, but it would take a lot of computation (disassembling and counting instructions) to do it on the fly for all running processes. Not like just asking the kernel for CPU usage stats that it tracks anyway on context switches.
You'd need hardware instruction-counting support for this to be really efficient the way top
is.
For SIMD floating point math specifically (not FP shuffles, just real FP math like vaddps
), there are perf counter events.
e.g. from perf list
output:
fp_arith_inst_retired.128b_packed_single
[Number of SSE/AVX computational 128-bit packed single precision floating-point instructions retired. Each count represents 4 computations. Applies to SSE* and AVX* packed single precision floating-point instructions: ADD SUB MUL DIV MIN MAX RCP RSQRT SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform multiple calculations per element]
So it's not even counting uops, it's counting FLOPS. There are other events for ...pd
packed double, and 256-bit versions of each. (I assume on CPUs with AVX512, there are also 512-bit vector versions of these events.)
You can use perf
to count their execution globally across processes and on all cores. Or for a single process
## count math instructions only, not SIMD integer, load/store, or anything else
perf stat -e cycles:u,instructions:u,fp_arith_inst_retired.{128,256}b_packed_{double,single}:u ./my_program
# fixme: that brace-expansion doesn't expand properly; it separates with spaces not commas.
(Intentionally omitting fp_arith_inst_retired.scalar_{double,single}
because you only asked about SIMD and scalar instructions on XMM registers don't count, IMO.)
(You can attach perf
to a running process by using -p PID
instead of a command. Or use perf top
as suggested in
See Ubuntu - how to tell if AVX or SSE, is current being used by CPU app?
You can run perf stat -a
to monitor globally across all cores, regardless of what process is executing. But again, this only counts FP math, not SIMD in general.
Still, it is hardware-supported and thus could be cheap enough for something like htop
to use without wasting a lot of CPU time if you leave it running long-term.

- 328,167
- 45
- 605
- 847