(x64) Where can I find CPU instructions usage statistics in contemporary programs?

Question

I'm looking for some statistics which would tell me/show how frequently each instruction from x64 instruction set is used overall in modern programs. I have done some google searches, but I can't find any phrase that would give me anything else than "instruction performance statistics", so I'm asking if, by any chance, someone here knows of something like what I'm trying to find.

I'm trying to find info like this because I'm working on my own 64bit CPU (as an interesting excercise, no other ambitions, so don't worry), and beyond the obvious basic instructions that I know are necessary, I'm aware that x64 processors have a huge amount of instructions from... say... exotic to downright (to me) absurdly weird operations. I therefore I would like to know how often are which used in actual programs, so that I can prioritize which one to learn more about and possibly add to my own CPU, based on the assumption that the most used/occuring ones in existing compiled code, even if they seem weird to me, are actually useful.

If nothing of that sort exists, could you at least point me to some kind of disassembler/analyzer which I can use myself, point it at a program/dll, and it would be able to show me instruction usage stats for it?

To count instruction mix, [How to characterize a workload by obtaining the instruction type breakdown?](https://stackoverflow.com/q/58243626) and [How do I determine the number of x86 machine instructions executed in a C program?](https://stackoverflow.com/q/54355631) — Peter Cordes, Aug 23 '22 at 14:36
Is that kind of information really useful to a project with such a limited scope? Many of the "non-basic" instructions come from various performance-enhancing extensions. Those seem entirely irrelevant if you're not *actually* trying to provide a competition to intel-style chips. — Joachim Sauer, Aug 23 '22 at 14:39
It might be more useful to look at instruction-mix on newer cleaner ISAs like AArch64, without x86's legacy baggage like `lahf`/`sahf` which used to be relevant for branching on FP compares before Pentium Pro, and is still occasionally useful for getting some FLAGS, but less useful than copying the whole FLAGS to a general-purpose register would be. Although to be fair, AArch64 is already a RISC, so you of course won't find programs using instructions that would have been useful but don't exist on AArch64. — Peter Cordes, Aug 23 '22 at 15:14
When you said exotic/weird, do you mean stuff like `cld` / `rep stosb` (memset in microcode)? Or like mov to/from segment registers, and `lgdt`? Or like AAM and other BCD instructions that were removed in 64-bit mode? Or SIMD stuff like `paddb`, or multi-threading (and sleep-state) stuff like `monitor`/`mwait`? Or VM instructions like `vmlaunch`? Or like `cqo` to sign-extend before `idiv`? x86-64 has a lot of old weird stuff, and many special-purpose new things. The old stuff doesn't get used if it's not fast (https://agner.org/optimize/ / https://uops.info/) — Peter Cordes, Aug 23 '22 at 15:20
@JoachimSauer since as I said I'm doing this "as an interesting excersise", meaning LEARNING about CPUs and assembly/machine languages is the point, and I want to learn more about this stuff then YES, that kind of information is really useful. Necessary, even, for the point of the project, since, again the point of the project is me learning more about this stuff. — sh code, Aug 24 '22 at 04:13
@PeterCordes when i said exotic/weird, i meant stuff like INVPCID / MINSD / MOVNTDQ (wait, this one actually is SIMD op, right?) — sh code, Aug 24 '22 at 04:20
@PeterCordes thanks for the links, and AArch64 tip, i'll look into it — sh code, Aug 24 '22 at 04:21
Scalar FP math in XMM regs exists because x87 was clunky and not a good compiler target. NT stores exist for performance, especially of writing to video RAM, but also for avoiding cache pollution and RFO (MESI Read For Ownerhip) in normal memory regions. INVPCID exists as part of process-context IDs to avoid TLB invalidation when changing page tables (CR3) in some cases, like between a small set of processes. These are all things you'd find on other ISAs, (Other ISAs wouldn't have x87 FP, they'd just have scalar FP math in the same regs they use for SIMD.) — Peter Cordes, Aug 24 '22 at 04:50

score 0 · Answer 1 · answered Jun 22 '23 at 04:11

One way to gather such information is to get a selection of relevant example programs, then compile them with options to obtain the assembly listing with bytes and mnemonics . Unlike disassembling attempts this will not generate invalid results from disassembling byte sequences from a wrong starting address.

From such code (no linking required unless doing full program optimization) the set of instructions actually used (but not the execution frequency of those instructions) can be parsed out and studied.

For example, if an if() branch that executes only once in a billion runs uses an instruction, it will count as equally important to implement at all as if that instruction was in an inner loop executed a billion times per second for 23 hours a day, because both code parts are needed to make the examined program work.

Another thing to observe in x86 programs is that some instructions are (preferably smaller or faster) combinations of longer instruction sequences. Things like inc reg instead of add reg,1 ; inc [memvar] instead of a load, inc, store sequence; The entire string instruction family; enter/leave; pusha/popa; etc. For your own CISC designs, you may want to choose other sequences to optimize, within the limitations of what current compiler architectures can use.

Note that some seemingly exotic instructions such as LAHF/SAHF were originally created to ease mechanical translation of 8080 code, not the later uses for things like floating point.

Counting each asm instruction once is the *static* count. The alternative is the *dynamic* count, where instructions inside a loop are counted for each time the loop executes, but error-handling code for errors that don't happen in normal runs aren't counted at all. In terms of what a CPU architect wants to optimize CPUs to do efficiently, weighting by the dynamic counts in real-world programs is usually what you want. See my first comment under the question for tools to get dynamic instruction counts. — Peter Cordes, Jun 22 '23 at 04:50
Sorry, I missed that the particular comment was your own, for designing the set of instruction to include in a design at all, statistics of usage/presence (even if rarely executed) in program code is useful, but for designing a faster CPU, statistics of usage/execution is a key measure for getting the most compute power per second or per Watt. A very rarely executed instruction may be relegated to calling a slow emulation handler placed in an OS or runtime library, if such emulation can be done with the other instructions at all. — jb_dk, Jun 22 '23 at 07:41

(x64) Where can I find CPU instructions usage statistics in contemporary programs?

1 Answers1