How to determine x86 machine opcode values based on real mode offsets and addressing?

Question

I am trying to write raw machine code bytes as 0s and 1s in to a text file, and execute it as that through the BIOS.

I have some problems understanding, however, how addressing, multiplying, offsets, addressing, operands, and instructions work in combinatorial arrangements, i.e. difference between MOV AL, 07 and MOV BL, AL.

I mean it makes sense in Assembly, but in machine code it becomes highly difficult to get the idea of parameters.

So what I want to know is this: How can I better understand this? There are no tutorials I've found that accurately explain/describe the 0s and 1s from instructions in combinatorial correlations or connections between data passing, MMIO, addressing modes, arithmetic, and the like.

On this site http://ref.x86asm.net/coder32.html#x00 it tries, but I don't understand this.

EXAMPLE: Say I want to move 5 in to AL ... would I specify the literal '5' in binary as part of the opcode in binary prefix chained with the AL/MOV instruction, or would I have one fixed binary code for each instruction, regardless of value? That is what I want to know ... how to understad how machine code is written.

http://wiki.osdev.org/X86-64_Instruction_Encoding#ModR.2FM_and_SIB_bytes — Jens Björnhager, Sep 05 '13 at 21:56

score 5 · Answer 1 · edited May 19 '14 at 12:31

Unfortunately, x86 encoding is complex and irregular, and understanding it is hard work. The best "quick start" on the encoding is a set of HTML pages at sandpile.org (it's terse, but pretty thorough).

First: http://sandpile.org/x86/opc_enc.htm - the "instruction encodings" table shows the dozen or so ways in which instructions are coded. The white cells in each row represent the mandatory bytes in the instruction; the following grey cells are there (or not there) based on various fields appearing earlier in the opcode. You should look at the rows starting with a white "0Fh", as well as the first row. At the bottom of the same page are the bitfields appearing in various "extended" opcode fields - you're ignoring all but the "modrm/sib" row (the first row).

Notice that for all but the first row (which is 1-byte opcodes), a "mod r/m" byte must follow the opcode (for the 1-byte opcodes, it depends on the instruction). This encodes the arguments for most 2-argument instructions. The table at http://sandpile.org/x86/opc_rm.htm has the meanings: one of the arguments must be a register, the other argument can be a register or a memory indirection (the "reg" field encodes the register, the "mod" and "r/m" fields encode the other argument). There's usually also a "direction" bit elsewhere in the opcode indicating the order of the arguments. The opcode also indicates whether we're manipulating, eg, AL, AX, EAX or RAX (i.e. different sizes), or one of the extended registers, which is why each 3-bit field is listed as refering to many different registers.

In modrm, if the "mod" bits are "11", then the "r/m" field also refers to a register. Otherwise it usually refers to a memory address constructed by adding the named register to an (optional) displacement appearing after the modrm byte (this constant is 0, 1, or 4 bytes long depending on the "mod" bits). The exception is when the "r/m" bits are "100" (i.e. 0x4), which would usually name "SP" - in this case, the memory argument is described by an additional "sib" byte which immediately follows the modrm byte (any modrm displacement appears after the sib). For the encoding of SIB, look at http://sandpile.org/x86/opc_sib.htm, or click through from the modrm page.

Finally, to understand where the direction and size come from, look at some opcodes: http://sandpile.org/x86/opc_1.htm. The first four entries are all "ADD", with the arguments in two different orders, and being of two different widths. So in this case, the bottom bits of the instruction are encoding the direction and width.

One more thing: if you're C-literate, you might look at one of the (several) open source assemblers or disassemblers, which have all the opcode information organized in tables. For example, the (GPL'd) GNU binutils x86 table is in a file called "i386-opc.c" or "i386-opc.tbl" depending on the version (google the filename); one copy is here: https://github.com/adobe-flash/crossbridge/blob/master/gdb-7.3/opcodes/i386-opc.tbl — mike, Sep 15 '13 at 23:23
Those links you gave are monstrously complex, and I can't even begin to decipher how everything correlates with one instruction. I'd need to have to get an expert to help me step through the binary encoding. Thanks though. — Jump if not Equal, Dec 03 '13 at 22:56
you can see a way to decode the tbl here https://cygwin.com/ml/binutils/2010-09/msg00277.html the table is based on the header here https://github.com/arrogantpenguin/PenguinoOS/blob/73608ed45a03cd3b013303e60accc5dee473ec53/sources/binutils/opcodes/i386-opc.h#L660 — h4ck3rm1k3, May 23 '15 at 11:30
re: two ways to encode `mov al, bl` with either the `r/m` source or destination form - [x86 XOR opcode differences](https://stackoverflow.com/q/50336269) — Peter Cordes, Aug 22 '21 at 17:38

Carl Norum · Answer 2 · 2013-09-05T22:37:08.430

1

There is (mostly) a one-to-one mapping between assembler mnemonics and machine instructions. You can find these mappings in the Intel Software Developers Manual, Volume 2, which contains the complete x86 16-, 32- and 64-bit instruction sets. You'll probably want to start with Chapter 2: Instruction Format which describes the translations you're trying to come up with.

In the case of mov al, 5 it's just as you say, you put the literal there. The instruction in machine code is:

b0 05

Since thats the MOV r8, imm8 form of the MOV instruction. For mov bl, al, you'd want the MOV r/m8,r8 form, which in your case would encode to:

88 c3

The c3 you can look up in Table 2-2 32-Bit Addressing Forms with the ModR/M Byte, where you'll see it at the intersection of the BL row and the AL column. (There's a 16-bit table, too if that's the mode you're in - the value in this case is the same.)

edited Sep 05 '13 at 22:37

answered Sep 05 '13 at 21:59

Carl Norum

219,201
40
422
469

But machine code is in binary; you quoted hexadecimal. b0 05 is 176, or 10110000, and five is 101. Should I punch the hex equivalents directly in their respective binary equivalent, and the encoding would be the same? – Jump if not Equal Sep 09 '13 at 21:38
So what's your point? `10110000 00000101` and `10001000 11000011`, then, if it makes you feel better. – Carl Norum Sep 09 '13 at 21:39
Numbers are just numbers. What base you write them in is just a convenience. – Carl Norum Sep 09 '13 at 21:41
The CPU's firmware only supposedly parses binary though, so I was just wondering if writing directly in hex made all the difference. – Jump if not Equal Sep 09 '13 at 21:43
1

Hex is just notation. You have to put the right bytes in the right place. Whether you write a byte down as a 2 digit hex number or an 8 digit binary number doesn't matter to anyone except you. The computer parses the bytes you put in memory, which are just voltage levels on transistors. You can think of it as binary or hex or whatever you want, as long as the right transistors get the right values. – Carl Norum Sep 09 '13 at 22:11

How to determine x86 machine opcode values based on real mode offsets and addressing?

2 Answers2