3

I am trying to write a disassembler, and I was wondering how the processor differentiates OpCodes from Data-Bytes.

For example, this is the byte representation of "Hello World": 0x48 0x65 0x6c 0x6c 0x6f 0x20 0x57 0x6f 0x72 0x6c 0x64 0x00

But how does the processor "know" that it is saying "Hello World" and not actually this: _ _ INS INS OUTS AND _ OUTS JB INS _ ADD

An explination is very welcome.

Ian Rehwinkel
  • 2,486
  • 5
  • 22
  • 56
  • 7
    It does not know. If the instruction pointer ever points to those bytes then the program is going to crash. Won't happen in practice, but you don't know the practice unless you can simulate the program's execution. In general, being able to distinguish code from data in a disassembler is *very* difficult and never completely reliable when you don't know enough about the compiler that generated the code. – Hans Passant Aug 14 '18 at 13:22
  • It doesn't even know whether some code is 16, 32 or 64 bit, and just load whatever bytes at the current instruction pointer and decode them as instructions in the current mode. [Why are disassembled data becoming instructions?](https://stackoverflow.com/q/36830255/995714). Note: x86 instructions are often longer than 1-byte, with a modrm byte after the opcode byte. For example with the above sequence you'll get a 3-byte AND instead of `AND _ OUTS` as you think – phuclv Aug 14 '18 at 15:50
  • 2
    There was an Apple ][ expansion card which was so tight on code space (256 bytes I think) that the coder made a dodge to save space, by branching into the middle of an instruction to use it as a *different* instruction and squeeze a `SEC; RTS;` from it. A disassembler could be confused by the branch. – Weather Vane Aug 14 '18 at 18:12

2 Answers2

4

It cannot. In a Harward architecture, the explicit separation of data and code will prevent this problem, in a Von Neumann architecture, code is data.

It's up to the programmer to not make the CPU execute unwanted code/data.

Margaret Bloom
  • 41,768
  • 5
  • 78
  • 124
3

The processor knows because the entry points are known. The processor decodes in execution order which is how you should disassemble as well for a variable length instruction set. Fixed length you can just go through memory from the entry point linearly, but variable length you need to go in execution order. This is not foolproof of course, pretty easy to trip up a disassembler, so be aware that it is possible and I recommend you keep track. I generally make a table of the entry point of the instruction (opcode in some ISAs), and the non-entry bytes, so that if I branch into the middle of an instruction I can stop that path of the disassembler there (naturally you have to cover all the possible paths).

With respect to opcodes vs data, so long as the toolchain and programmer did the right job then one instruction will hand off to another jumping over data areas as needed.

Processors are very dumb, they dont have a lot of real functions, some alu stuff, reading and writing from addresses, moving data in and out of registers. Half the job is feeding them programs that follow the rules.

old_timer
  • 69,149
  • 8
  • 89
  • 168