7

In the homework for day one of Xeno Kovah's Introduction to x86 Assembly hosted on OpenSecurityTraining, he assigns,

Instructions we now know(24)

NOP PUSH/POP CALL/RET MOV/LEA ADD/SUB JMP/Jcc CMP/TEST AND/OR/XOR/NOT SHR/SHL IMUL/DIV REP STOS, REP MOV LEAVE

Write a program to find an instruction we havenʼt covered, and report the instruction tomorrow.

He further predicates the assignment on,

  • Instructions to be covered later which donʼt count: SAL/SAR
  • Variations on jumps or the MUL/IDIV variants of IMUL/DIV also don't count
  • Additional off-limits instructions: anything floating point (since we're not covering those in this class.)
  • He says in the video that you can not use inline assembly. (mentioned when asked).

Rather than objdumping random executable and auditing them then creating the source, is it possible to find the list of x86 assembly instructions that GCC currently outputs?

The foundation for this question seems to be that there is a very small subset of instructions actually used that one needs to know to reverse engineer (which is the focus of the course). Xeno seems to be trying to find a fun instructive way to make that point,

I think that knowing about 20-30 (not counting variations) is good enough that you will have the check the manual very infrequently

While I welcome everyone to join me in this awesome class at OpenSecurityTraining, the question is about my proposed method of figuring it out from GCC (if possible). Not, for people to actually do Xeno's assignment. ;)

Matteo Italia
  • 123,740
  • 17
  • 206
  • 299
Evan Carroll
  • 78,363
  • 46
  • 261
  • 468
  • 1
    Very interesting question! – fuz Feb 27 '18 at 19:59
  • So, objdump, and make a library/hash of what you find in the text files? Compare vs x86 master docs at intel and see what is missed. Objdump output is normal enough that you should be able to string parse pretty easily. – Michael Dorgan Feb 27 '18 at 20:00
  • 1
    *Write a program to find an instruction we havenʼt covered*... what does that exactly mean, to "find" an instruction? You wouldn't really "write a program to find an instruction". Do they mean "find an instruction we haven't covered" and write a program "using" that instruction? In that case, just write it in assembly language. Trying to find a C program that generates such an instruction using `gcc` doesn't sound productive. – lurker Feb 27 '18 at 20:02
  • 1
    @lurker yes, that's what it means. Writing in assembly language is outside of the purview of the assignment. It's a class focused on reverse engineering please see my update. – Evan Carroll Feb 27 '18 at 20:03
  • I see. Maybe you could list the instructions you've covered, then someone here could come up with a C construct that `gcc` would generate some other instruction from. – lurker Feb 27 '18 at 20:05
  • @lurker I can *easily* do the task as assigned. I need help to go about the clever way -- I want to know what instructions GCC can output. Not go looking for which ones it does output. – Evan Carroll Feb 27 '18 at 20:07
  • 1
    `xlat` or possibly `hlt` come to mind as instructions that are rare I would think. `aaa` would also be rare now adays as that is left over from 70s style work. – Michael Dorgan Feb 27 '18 at 20:07
  • @MichaelDorgan Though `sahf` and `lahf` are generated as well as things like `jp`, `cbw`, and `xchg r,r` are not entirely off the cards. – fuz Feb 27 '18 at 20:14
  • Not pretty sure whether I understand correctly. If you want to find all possible instructions, just check the architecture manual. Like [x86](http://www.felixcloutier.com/x86/) – llllllllll Feb 27 '18 at 20:26
  • Interesting that you group `lea` with `mov`, instead of with `add/sub` and shifts. [It's an ALU shift-and-add instruction](https://stackoverflow.com/questions/46597055/address-computation-instruction-leaq/46597375#46597375), except for the special case of using it to read `RIP`. BTW, you mean `rep movs`; `rep mov` doesn't exist. – Peter Cordes Feb 27 '18 at 20:48
  • @liliscent OP wants to know which of these are actually ever generated by gcc. – fuz Feb 28 '18 at 10:38

2 Answers2

5

The foundation for this question seems to be that there is a very small subset of instructions actually used that one needs to know to reverse engineer

Yes, that's generally true. There are some instructions gcc will never emit, like enter (because it's much slower than push rbp / mov rbp, rsp / sub rsp, some_constant on modern CPUs).

Other old / obscure stuff like xlat and loop will also be unused because they aren't faster, and gcc's -Os doesn't go all-out optimizing for size without caring about performance. (clang -Oz is more aggressive, but IDK if anyone's bothered to teach it about the loop instruction.)

And of course gcc will never emit privileged instructions like wrmsr. There are intrinsics (__builtin_... functions) for some unprivileged instructions like rdtsc or cpuid which aren't "normal".


is it possible to find the list of x86 assembly instructions that GCC currently outputs?

This would be the gcc machine-definition files. GCC as a portable compiler has it's own text-based language for machine-definition files which describe the instruction-set to the compiler. (What each instruction does, what addressing modes it can use, and some kind of "cost" the optimizer can minimize.)

See the gcc-internals documentation for them.


The other approach to this question would be to look at an x86 instruction reference manual (e.g. this HTML extract, and see other links in the tag wiki) and look for ones you haven't seen yet. Then write a function where gcc would find it useful.

e.g. if you haven't seen movsx (sign extension) yet, then write

long long foo(int x) { return x; }

and gcc -O3 will emit (from the Godbolt compiler explorer)

    movsx   rax, edi
    ret

Or to get cdqe (aka cltq in AT&T syntax) for sign-extension within rax, force gcc to do math before sign extending, so it can produce the result in eax first (with a copy-and-add lea).

long long bar(unsigned x) { return (int)(x+1); }

    lea     eax, [rdi+1]
    cdqe
    ret

   # clang chooses inc edi  /  movsxd rax, edi

See also Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”, and How to remove "noise" from GCC/clang assembly output?.


Getting gcc to emit rotate instructions is interesting. Best practices for circular shift (rotate) operations in C++. You write it as shifts/OR that gcc can recognize as a rotate.

Because C doesn't provide standard functions for lots of things modern CPUs can do (rotate, popcnt, count leading / trailing zeros), the only portable thing is to write an equivalent function and have the compiler to recognize that pattern. gcc and clang can optimize a whole loop into a single popcnt instruction when compiling with -mpopcnt (enabled by -march=haswell, for example), if you're lucky. If not, you get a stupid slow loop. The reliable non-portable way is to use __builtin_popcount(), which compiles to a popcnt instruction if the target supports it, otherwise a table lookup. _mm_popcnt_u64 is popcnt or nothing: it doesn't compile if the target doesn't support the instruction.


Of course the catch 22 flaw with this approach is that it only works if you already know the x86 instruction set and when any given instruction is the right choice for an optimizing compiler!

(And what gcc chooses to do, e.g. inline string compares to rep cmpsb in some cases for short strings, although I'm not sure this is optimal. Only rep movs / rep stos have "fast strings" support on modern CPUs. But I don't think gcc will ever use lods, or any of the "string" instructions without a rep prefix.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
3

Rather than objdumping random executable and auditing them then creating the source, is it possible to find the list of x86 assembly instructions that GCC currently outputs?

You can look at the machine description files that gcc uses. In its source tree, look under gcc/config/i386 and have a look at the .md files. The core one for x86 is i386.md; there are others for the various extensions to x86 (and possibly containing heuristics tunings to use when optimizing for different processors).

Be warned: it's definitely not an easy read.

I think that knowing about 20-30 (not counting variations) is good enough that you will have the check the manual very infrequently

It's quite true; in my experience doing reverse engineering, 99% of code is always the same stuff, instruction-wise; what is more useful than knowing the entire x86 instruction set is to get familiar with the assembly idioms, especially those frequently emitted by compilers.


That being said, from the top of my mind some very common instructions missing (emitted quite often and without enabling extended instruction sets) are:

  • movzx/movsx
  • inc/dec (rare with gcc, common with VC++)
  • neg
  • cdq (before idiv)
  • jcxz/jecxz (rare with gcc, somewhat common with VC++)
  • setCC
  • cmpxchg (in synchronization code);
  • cmovCC
  • adc (when doing 64 bit arithmetic in 32 bit code)
  • int3 (often emitted on function boundaries and in general as a filler)
  • some other string instructions (scas/cmps), especially as canned sequences on older compilers

And then there's the whole world of SSE & co...

Matteo Italia
  • 123,740
  • 17
  • 206
  • 299
  • gcc won't emit `inc` (except maybe with `-Os`). It always `add dst,1`, even for register destinations with `-march=skylake` which should tell it that you don't care about Silvermont/KNL or Pentium 4 ([where `inc` is slower](https://stackoverflow.com/questions/36510095/inc-instruction-vs-add-1-does-it-matter)), but gcc's tuning options aren't that well maintained. Ironically, `clang` uses `inc` with no tuning option, but uses `add reg,1` with `-march=skylake.` /facepalm. – Peter Cordes Feb 27 '18 at 20:52
  • I don't think gcc will ever emit `jecxz` / `jrcxz` either. It's not as slow as `loop`, but I don't think gcc knows how to optimize `adc` loops by branching without updating flags. (In general it only knows how to use `adc` well for `__int128` (or `int64_t` on 32-bit machines), not arbitrary precision) – Peter Cordes Feb 27 '18 at 20:54
  • I'm quite sure I've seen a good share of `inc` when debugging and reversing, but it must be said that (1) it wasn't always gcc code, and (2) many of them may have been `lock inc`... – Matteo Italia Feb 27 '18 at 20:55
  • Hmm, maybe it will use `inc` in some cases; I see some in the diassembly of `/bin/bash` on Arch Linux, and metadata indicates that it was compiled with gcc6.3.1, I think. IDK what options were used. – Peter Cordes Feb 27 '18 at 20:56
  • `jecxz` is stuff I discovered when doing reverse engineering over a VC++ 2010 (IIRC) binary; I distinctly remember seeing quite some of them, and I was indeed surprised because I never saw them in gcc code; `adc`: I see it all the time in my 32 bit binaries dealing with `int64_t`. – Matteo Italia Feb 27 '18 at 20:59
  • `int3`: MSVC pads between functions with `int3`, but gcc just emits the same `.p2align` directive as inside functions, the padding is long NOPs. GNU `ld` also inserts `nop` padding when needed when linking an object file that ends at an unaligned location with another file that has an alignment requirement on the same section. gcc itself doesn't even know about instruction lengths of the code it's generating, it just uses labels and leaves everything up to the assembler. – Peter Cordes Feb 27 '18 at 21:28
  • 1
    TIL that `gcc` doesn't (easily) generate `inc`. I didn't totally believe it, but found no counter-examples in an admittedly quick search. – BeeOnRope Feb 28 '18 at 00:20
  • @BeeOnRope: GCC has been generally avoiding `inc` with `-mtune=generic` for a while now, because of Silvermont-family. IIRC it will `inc`/`dec` with `-mtune=znver2` or `-mtune=skylake` or pretty much anything except KNL or a `*mont` CPU. ([INC instruction vs ADD 1: Does it matter?](https://stackoverflow.com/q/36510095)). Now that mainstream Alder Lake has some Gracemont E-cores, that seems like it was probably not a bad decision, plus there are low-power servers and NAS, and low end laptops. IDK how much of a slowdown Gracemont would really have on `inc`. – Peter Cordes Nov 13 '22 at 07:23