6

For the sake of example, imagine I was building a virtual machine. I have a byte array and a while loop, how do I know how many bytes to read from the byte array for the next instruction to interpret an Intel-8086-like instruction?

EDIT: (commented)

The CPU reads the opcode at the instruction pointer, with 8086 and CISC you have one byte and two byte instructions. How do i know if the next instruction is F or FF?

EDIT:

Found an answer myself in this piece of text on http://www.swansontec.com/sintel.html

The operation code, or opcode, comes after any optional prefixes. The opcode tells the processor which instruction to execute. In addition, opcodes contain bit fields describing the size and type of operands to expect. The NOT instruction, for example, has the opcode 1111011w. In this opcode, the w bit determines whether the operand is a byte or a word. The OR instruction has the opcode 000010dw. In this opcode, the d bit determines which operands are the source and destination, and the w bit determines the size again. Some instructions have several different opcodes. For example, when OR is used with the accumulator register (AX or EAX) and a constant, it has the special space-saving opcode 0000110w, which eliminates the need for a separate ModR/M byte. From a size-coding perspective, memorizing exact opcode bits is not necessary. Having a general idea of what type of opcodes are available for a particular instruction is more important.

Community
  • 1
  • 1
Ashley Meah
  • 131
  • 1
  • 9
  • what do you mean by "to interrupt an instruction"? Do you know what the word "interrupt" means? As to the original question: the CPU knows how long an instruction is because that information is either burnt into its hardware or programmed to its microcode. – The Paramagnetic Croissant Aug 03 '14 at 05:42
  • the cpu reads the opcode at the instruction pointer, with 8086 and CISC you have one byte and two byte instructions. How do i know if the next instruction is F or FF? EDIT: sorry, meant interpret – Ashley Meah Aug 03 '14 at 05:49
  • it knows because every instruction has a fixed length. If the instruction is a long jump, it knows that it has to read the next 4 bytes in addition. If it's an add, it knows it has to read the next 1 byte only. etc., etc. – The Paramagnetic Croissant Aug 03 '14 at 06:01
  • every instruction is not a fixed length? How do i know if the next instruction is F or FF? EDIT: Edited question, found ansew – Ashley Meah Aug 03 '14 at 06:38
  • 1
    Sorry, I can't explain it better than that. I didn't write that every instruction has the same length. I wrote that they have a fixed length. E. g. a long jump is always 5 bytes long, an add is maybe 2 bytes long, etc. One can know from the opcode how many additional bytes one should read. – The Paramagnetic Croissant Aug 03 '14 at 06:43
  • possible duplicate of [Get size of assembly instructions](http://stackoverflow.com/questions/23788236/get-size-of-assembly-instructions) – Alexis Wilke Aug 03 '14 at 07:08
  • [With variable length instructions how does the computer know the length of the instruction being fetched?](https://stackoverflow.com/q/24269368/995714), [Instruction decoding when instructions are length-variable](https://stackoverflow.com/q/8204086/995714) – phuclv Sep 20 '18 at 01:19
  • Possible duplicate of [Instruction decoding when instructions are length-variable](https://stackoverflow.com/questions/8204086/instruction-decoding-when-instructions-are-length-variable) – phuclv Sep 20 '18 at 01:19
  • Does this answer your question? [How does the CPU know how many bytes it should read for the next instruction, considering instructions have different lenghts?](https://stackoverflow.com/questions/56385995/how-does-the-cpu-know-how-many-bytes-it-should-read-for-the-next-instruction-co) – phuclv Apr 29 '20 at 16:00

2 Answers2

8

the cpu simply decodes the instruction. IN the case of 8086 the first byte tells the processor how much more to get. It doesnt have to be the first byte the first byte does have to indicate in some way that you need to get more, that more can indicate you need even more. With 8 bit instruction sets like the x86 family where you start with one byte and then see how much more you need, and also being unaligned, you have to treat the instruction stream as a bytestream in order to decode it.

You should write yourself a very simple instruction set simulator, only a handful of instruction, maybe enough to load a register, add something to it and then loop. extremely educational for what you are trying to understand, and takes maybe a half an hour if that to write.

old_timer
  • 69,149
  • 8
  • 89
  • 168
  • this is not a ansew but i already solved it myself, i edited the question before you posted so people please actually read before posting. – Ashley Meah Aug 04 '14 at 05:24
  • Since this has been asked and answered so many times now perhaps you should just delete this question or ask a moderator to delete it. – old_timer Aug 04 '14 at 18:11
  • The other comments is getting the size of a assemblerly instruction where you know the full instruction. E.g mov ax, al – Ashley Meah Aug 05 '14 at 20:23
  • My question is getting the size of the next instruction in memory without knowing what the full instruction is. The 1 byte in is the first instruction byte is only 7-bits and the last bit tells the CPU if it is a word or byte. Im guessing the same thing is used for the next byte for longer instructions if supported in respected instruction set. This is intel so AMD might be different. http://umcs.maine.edu/~cmeadow/courses/cos335/8086-instformat.pdf – Ashley Meah Aug 05 '14 at 20:29
  • I just upped your comment since it was more along the lines of my question compared to everyone else, i have built one or two basic emulators, first one had no stack, the second did. Both had fixed length instruction sets. – Ashley Meah Aug 05 '14 at 20:37
  • @old_timer, " the case of 8086 the first byte tells the processor how much more to get", I'm probably late but where did you get that from? – Trey Nov 18 '17 at 21:06
  • the first byte is the opcode, from that opcode the processor knows the next thing to decode. where I get that from is the intel documentation. so if it sees 0x24 then it knows the next byte is an 8 bit immediate to and with AL. If it sees 0x25 then it knows the next two bytes are a 16 bit immediate to and with AX. 0x34 the next byte is an immediate to xor with al (a two byte instruction) 0x35 a 3 byte instruction. All shown in the documentation... – old_timer Nov 18 '17 at 22:05
  • 1
    sometimes it has to look at the second byte to determine if there are more and how many. Nothing special here this is how processors work in particular variable length cisc types...can replace 8086 in a statement like that with 6502 or z80 or a long list of others, and the how to decode or how the processor decodes is in the various vendor documentation. – old_timer Nov 18 '17 at 22:07
  • reading the title to this question the cpu knows based on its design, the assembler (which is a program that takes assembly language and turns it into machine code) knows because it is written based on the documentation and from the syntax (which is defined by the assembler) determines the instruction then encodes it however many bytes it needs per the documentation and al,17h if I read the docs right should be 0x24,0x17, and al,0x1234 should be 0x25,0x34,0x12 the assembler is written to know that. the cpu designed to decode that. – old_timer Nov 18 '17 at 22:12
7

TLDR:

The solution is more complex than a fixed size array.


It's all about context, this is why disassembler like IDA have complex algorithms to do this.

Instructions are variable length for x86. But if you know the start of an instruction, you know where THAT INSTRUCTION ends. Because of that, you MAY know where the next one begins. I will explain the exceptions soon. But first, here's an example:

ASM:
mov eax, 0
xor eax, eax

Machine:
b8 00 00 00 00
31 c0

Explanation:

Moving to eax is B8, followed by a 32-bit (4-byte) value to move into eax (as eax is 32 bit). In other words, mov eax, immediate will always be 5 bytes. So if you know you are starting on an instruction (not always a safe assumption), and the byte is B8, you know it is a 5 byte instruction, and that the next instruction SHOULD start 5 bytes later.

Note that both instructions (mov eax, 0 and xor eax, eax) effectively do the same thing, clear eax to 0.

Exception:

Things can get tricky with jumps/calls. It is possible to jump into an address space that is in the "middle of an instruction"... but still execute.

Lets look at:

mov eax, 0x90909090

machine code:

b8 90 90 90 90

If we later had a jmp instruction that jumped into the address of the 3rd byte of the above instruction (in the middle of it somewhere), it would just do 3 NOPs (no operation) and fall to the next instruction after it (not setting eax to 0x90909090). This is because a NOP is a 1-byte instruction made up of 0x90.

phuclv
  • 37,963
  • 15
  • 156
  • 475
XlogicX
  • 642
  • 6
  • 10
  • I already solved it myself, and your wrong. Your over thinking and missing out they key reason the cpu knows how many bytes to read. For one byte instructions, there is a certain bit to tell the CPU to read the next byte as part of the instruction. – Ashley Meah Aug 04 '14 at 05:20
  • 4
    I may have been over thinking in the context of your application, but some of the pitfalls I described are not 'wrong.' Jumping to an address in the middle of an instruction is a tactic that malware authors and obfuscators will use. This absolutely will through a linear analysis engine off. Here is a discussion on algorithms to do this: http://resources.infosecinstitute.com/linear-sweep-vs-recursive-disassembling-algorithm/. That aside, I read your reference (http://www.swansontec.com/sintel.html), it's quite good, and as you said, answered/solved your VM application issues. – XlogicX Aug 04 '14 at 10:47
  • Quote "TLDR: The solution is more complex than a fixed size array. It's all about context, this is why disassembler like IDA have complex algorithms to do this. Instructions are variable length for x86. But if you know the start of an instruction, you know where THAT INSTRUCTION ends. Because of that, you MAY know where the next one begins" I am a newb, i have excuses. You however do not - To figure out the size is very important and not hard, you have lived in the software too long. You may know the software, but have no clue of how the hardware understands it what is as or more important. – Ashley Meah Aug 05 '14 at 20:36