3

My objective is to do comparative study of a few instruction set architectures.
For each instruction set architecture, how can i find the most commonly used instructions?

This is the steps i am thinking of:

  • Find common ISAs for a chosen domain
  • Find popular programs for each such ISA
  • Disassemble the program instructions (.code) (which tool?)
  • Collect statistics on instruction format, opcode, type. (which tool?)

Here is a very good study on x86 machine code statistics: https://www.strchr.com/x86_machine_code_statistics

I have tried below command for disassembling, but it does not seem to disassemble properly. Disassembled code shows some das instructions, which should not be present in actual code.

ndisasm -b32 -a $(which which)
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
wolfram77
  • 2,841
  • 3
  • 23
  • 33
  • A disassembler does not know whether something is data or code. They may even be the same – data can be treated as code, code as data (the von Neumann architecture as implemented in all current CPUs). That is why you cannot point a generic disassembler to a random piece of an executable and say, "disassemble this!" You *can* but, as you found, it *will* disassemble whatever you are pointing it to. – Jongware Feb 20 '20 at 18:30
  • Ah, that makes sense, but thankfully `objdump` recognizes ELF file. `objdump --disassemble $(which ls) > ls.log` seems to do the right disassembly. – wolfram77 Feb 20 '20 at 18:48
  • 1
    You cannot be 100% sure about that; there may still be static data – and even unused code! – inside executable sections. – Jongware Feb 20 '20 at 18:52
  • Then i guess, an appropriate way to know the instructions being used would be to run it on a debugger, and somehow let it print out the executed instructions to a file. What do you think? – wolfram77 Feb 20 '20 at 19:02
  • 1
    @wolfram77: I think there's a major difference between "generated by a compiler most often" and "executed by a CPU most often"; and you'll need to figure out which is better for your purposes. – Brendan Feb 20 '20 at 21:03
  • 2
    @usr2564301: Plain non-obfuscated compiler-generated x86 executables do disassemble easily. x86 compilers don't mix code and data; unlike ARM there's no benefit to literal pools near code (between functions) so compilers don't do it. Of course you have to use a disassembler like `objdump` or `objconv` that knows about ELF metadata, which `ndisasm` does not! ndisasm treats everything as a flat binary, including metadata and .data and .rodata – Peter Cordes Feb 21 '20 at 00:55

1 Answers1

4

You can try this, to gather mnemonics from .text section:

objdump --no-show-raw-insn \
        -M intel           \
        -sDj .text $(which *program name*) | # <-- disassemble .text section
             sed -n '/<\.text>/, $ p'      | # <-- skip raw hex
             awk '{$1 = ""; print}'        | # <-- remove offsets
             sed '1d'                        # <-- delete annoying <.text> in first line

After that you can either get only mnemonics name, appending awk '{print $1}' to previous command, or mutating data somehow different.

After all of this add sort | uniq -c to previous steps. So my resulting command looked like:

objdump --no-show-raw-insn \
        -M intel           \
        -sDj .text $(which *program name*) | 
             sed -n '/<\.text>/, $ p'      | 
             awk '{$1 = ""; print}'        |
             sed '1d'                      |
             awk '{print $1}' | sort | uniq -c

Which prints out frequencies of every mnemonic from program's text section

  • Thanks for posting a nice solution. I removed the first `sed` otherwise i dont get anything. Also content of `.text` is not necessary so `-s` is not needed too. Now i do get a listing of instructions with their counts (along with some extras). – wolfram77 Feb 20 '20 at 19:06
  • 1
    @wolfram77: Note that this answer gives you the *static* instruction count. It counts every instruction once, whether it's inside a tight loop or whether it's in error-handling code that never executes at all in normal operation. More often you want the *dynamic* instruction count, e.g. with `sde64 --mix` [How to characterize a workload by obtaining the instruction type breakdown?](//stackoverflow.com/q/58243626) and [How do I determine the number of x86 machine instructions executed in a C program?](//stackoverflow.com/q/54355631) – Peter Cordes Feb 21 '20 at 01:00
  • That looks very promising. The `sde-mix-out.txt` on `ls` is listing a bunch of opcode types, and their counts, and there are several of those blocks. While doing that on `ls` of a directory, it seems all these counts match. When i do a diff, the only changes i see are most likely because of thread id and memory addresses changing. Thanks, i will look into it further, in order to understand the output. – wolfram77 Feb 21 '20 at 10:28
  • Maybe it would only be possible to do static instruction usage statistics for another architecture, like Atmel AVR, or maybe run a few example programs on a simulator. – wolfram77 Feb 21 '20 at 11:01