Most frequently used instructions

Question

Every now and then I read somewhere that of all of the instruction a CPU has only very few are used most of the time, last time it was here where the author writes: "There are only a handful of different instructions that account for 90% of all operations executed".

What are the most frequently used instructions? I'm most interested in x86-64 and arm64.

I'm just guessing here, but instructions that deal with moving values between registers and the stack as well as branching have the highest usage. So maybe it would include RET, CALL, MOV, PUSH, POP, JE, JNE, and LEA. But I am probably missing a few really obvious ones. — Locke, Nov 18 '22 at 07:07
@Locke: `add` and `cmp` will also be high up there in terms of instructions executed, probably also `sub` and `jmp`. A lot of programs spend a lot of time branching as well as looping. You can check with a simulator or trace tool that can total up instructions by mnemonic, such as Intel's SDE. [How to characterize a workload by obtaining the instruction type breakdown?](https://stackoverflow.com/q/58243626) / [Scan binary for CPU feature usage](https://stackoverflow.com/a/72469653) (that quotes a different section of the SDE output, by extension not mnemonic) — Peter Cordes, Nov 18 '22 at 09:26
[libsvm compiled with AVX vs no AVX](https://stackoverflow.com/q/54689918) quotes an SDE log file for instruction mix for some random program. — Peter Cordes, Nov 18 '22 at 09:28
You can easily work this out for yourself! Decompile your chosen program, write a little sed script to extract just the instruction name, and then use uniq -c to count them (I might do it for you myself later today if I have a spare half an hour). — Tom V, Nov 18 '22 at 10:13
@TomV: That would give you the *static* instruction count (by appearance in the file), not *dynamic* (how many times executed, counting loop bodies N times, counting unexecuted error handling block not at all). It would probably be fairly similar for most programs, but anything with a bunch of configuration parsing and then a long computation phase might have differences. — Peter Cordes, Nov 18 '22 at 11:11
Yes you are correct. Counting how often they are run requires using a profiler, and is much more involved. — Tom V, Nov 18 '22 at 16:43

score 2 · Answer 1 · answered Nov 18 '22 at 10:35

This is going to vary by operating system, compiler, compiler version, programming language, and possibly architecture extensions of the target you're looking at. It further depends on where you draw the line between what is or is not "the same" instruction.

On arm64, if you consider movz w0, 0 and movz w0, 1 to be different, then you can just take any binary, chunk it up into 4-byte blocks, print in hexadecimal and run that through sort | uniq -c.
If you consider all movz to be the same instruction, then the question arises whether movz w0 and movz x0 are the same. And if that is a yes, then there are other instructions like add that have different forms where the third operand is either a register or an immediate. And there's things like ldr with pre- and post-indexing modes that introduce additiona effects. And on top of all of that, there's pseudo-instructions like mov that can be encoded as either movz, movn or orr, depending on what the operands are.

But let's look at a concrete example: the /bin/bash binary on macOS 12.6.1.
The __TEXT.__text section of its arm64e slice has 118587 instructions. If we disassemble the entirety of it and try to normalise it by replacing all immediates with ..., all general-purpose registers as well as (w|x)zr and (w)sp with rN and all floating-point registers with fN with this command:

perl -pe 's/b\.\w{2}/b.cond/g;s/\b(x[0-9]+|sp|xzr)\b/rN/g;s/\bw([0-9]+|sp|zr)/rN/g;s/\b[dq][0-9]+/fN/g;s/-?0x[0-9a-f]+/.../g;s/, -?\d+/, .../g;s/(lsl|lsr|asr|uxtw|sxtw) (\d+|0x[0-9a-f]+)/.../g;s/, \.\.\.\]/]/g'

And then sort the result by frequency with this:

sort | uniq -c | perl -pe 's/^\s+//g' | sort -V

And then choose an arbitrary cutoff at the ret instruction, we get:

265 ret
268 brk ...
269 autibsp
271 eor rN, rN, rN, ...
284 sub rN, rN, rN
299 cset rN, eq
312 cmn rN, ...
322 ldur rN, [rN]
332 add rN, rN, rN, ...
363 csel rN, rN, rN, eq
392 orr rN, rN, ...
432 paciza rN
450 ldr rN, [rN, rN]
476 strb rN, [rN]
507 ldp fN, fN, [rN]
578 sxtw rN, rN
745 and rN, rN, ...
780 tbnz rN, ..., ...
805 tbz rN, ..., ...
866 stp rN, rN, [rN]!
871 add rN, rN, rN
978 ldp rN, rN, [rN], ...
1028 invalid
1141 stp fN, fN, [rN]
1170 retab
1290 pacibsp
1457 sub rN, rN, ...
1520 cbnz rN, ...
1567 cmp rN, rN
1622 ldrb rN, [rN]
1695 adrp rN, ...
2548 ldr rN, ...
3504 stp rN, rN, [rN]
3537 ldp rN, rN, [rN]
3996 adr rN, ...
4644 cmp rN, ...
4664 add rN, rN, ...
4728 cbz rN, ...
4966 b ...
5294 str rN, [rN]
5601 b.cond ...
7451 mov rN, ...
7691 nop
8270 ldr rN, [rN]
10181 bl ...
11585 mov rN, rN

This amounts to 112015 instructions, or 94.5% of the total. A few things stand out here:

This is an "arm64e" binary, which is Apple's name for their ABI with pointer authentication. On a regular arm64 target, all retab would be ret, and pacibsp, autibsp and paciza would not be present at all.
Almost all occurrences of brk are following an autibsp, which is a compiler change Apple was forced to make in order to address a structural weakness of the FEAT_PAuth architecture extension. The remaining occurrences are very likely invocations of __builtin_trap() in the source.
nop is very common. This is due to PC-relative addressing, where the compiler emits two or three instructions like adrp x0, ...; add x0, x0, ..., which can then be optimised at link time to adr x0, ...; nop if the target address is close enough to the instruction. It is debatable whether nops "run" though, or whether they are just deleted from the instruction pipeline.
invalid is very common. These are jump tables found after the functions that use them. They are data, not code, and thus never actually run.
Floating-point registers only show up in two instructions: stp and ldp. The compiler commonly uses floating-point registers for memcpy and memset operations, for one because these registers can hold up to 128 bits, and for another because it avoids having to spill general-purpose registers, which are a lot more likely to have already been allocated than floating-point registers, at least in command-line binaries like bash. I would expect similar conditions in operating system kernels and embedded firmwares, but in graphics-related code or machine learning, this may look very different.
The vast majority of cmn instructions check against the value -1. This is exceedingly likely a result of API design in C, and not inherent to the architecture.

And now it depends whether you wanna reduce some of these instructions or not. But by and large, I think the pattern is clear:

Register moves
Immediate construction
Function calls
Unconditional branches
Conditional branches
Register compares
Memory loads
Memory stores

make up for the vast majority of instructions, and it makes sense, because with these building blocks, you can do essentially anything. Almost all other instructions are some kind of optimisation on those, which are more niche and thus occur less often. The remaining instructions are architecture-specific and expose things you couldn't otherwise get, like brk or msr, but these are going to be even rarer. And this is likely to be the case no matter what architecture you look at.

You're looking at *static* instruction counts, how often they appear in the binary. (Including the NOPs between functions that never execute even once. NOPs for alignment before a loop do execute as the CPU enters the loop, but the instructions inside the loop execute N times more, where N is the average trip count. The question asked for **dynamic** instruction counts, in terms of how often they're actually executed. Other than NOPs (and of course `invalid`!!), the mix might be fairly similar to static, although error-handling code runs rarely and might have a somewhat different mix. — Peter Cordes, Nov 18 '22 at 11:08
@PeterCordes I don't think OP wants dynamic instruction counts. They asked for the most "used" instructions. I take it to mean the most "written"/"generated" instructions. It makes no sense to me to say that `pause` is one of the most used instructions just because it's in a loop that runs billions of times... — Margaret Bloom, Nov 18 '22 at 13:27
@MargaretBloom: The question says they're trying to confirm the statement "There are only a handful of different instructions that account for 90% of all operations **executed**". It's a statement about making the common case fast for CPU architects, I think, probably a lead-up to some RISC philosophising. (I certainly hope `pause` isn't one of the most executed instructions, like in the top 10. That would be a horrible amount of contention spin-waiting vs. useful work, badly architected multi-threading with huge contention and lack of fallback to sleep instead of spinning.) — Peter Cordes, Nov 18 '22 at 13:40

Most frequently used instructions

1 Answers1