0

Recently I have been playing around a little with a Altair 8800 emulator to understand the basics of computing, and I was wondering how does the processor "know" if a byte is an instruction or data?

pedro santos
  • 337
  • 1
  • 2
  • 9

2 Answers2

6

The processor doesn't know if a byte is an instruction. It just executes whatever code is pointed by the program counter.

If the program counter lands in a data zone (because of a programming error, corrupt stack or whatever), the CPU tries to interpret the instructions until it stumbles on some invalid opcode, in which case it calls a special "illegal instruction" (or other) handler and the program crashes / the OS reboots / whatever behavior is defined to recover from the error.

EDIT: as Ross mentionned that the Altair does not really have illegal instructions, the "program" would run endlessly in chaos, reading and writing in random locations, until someone pulls the plug.

On the other hand, you can load an address register with a zone containing code, and the CPU will only see data.

That said, It is one of the tough challenges of disassembling/reverse engineering to find the zones that the original programmers assembled as code and the ones that are merely defining data, like error messages, graphics... They are sometimes very close/entwined, specially when assembly is used.

Jump tables, dynamically computed entrypoints, on-the-fly decryption, self-modifying code can make the task harder, even for a good disassembler like IDAPro. You often have to help the disassembler to decide, based on the fact that the data looks a lot like code (recognizable opcodes, ex: for 68k: 0x4E75 means RTS, not likely to be data), or that the code looks a lot like data because it does not add up (incoherent, unrelated asm code lines)

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • 4
    Very good, but still ... "*actually* code and *actually* data .." makes John von Neumann spin in his grave like a hard disk powering up. ([Related answer](http://stackoverflow.com/a/26826328/2564301)) – Jongware Sep 20 '16 at 19:35
  • 1
    Sorry - I already upvoted & was just having a bit of fun at your expense :P The crux of the key of the core of the answer is that you can even compile some assembly code (making it *binary* or, if you will, "real" code) and *still print it as an error string*. A CPU does not know nor care. – Jongware Sep 20 '16 at 19:40
  • 2
    I see what you mean. I'm more familiar with the 68000 instruction set, and when I see `NqNqNqNqNqNu` I know this is a serie of `NOP` followed by `RTS`. What I wanted to outline is that if you reverse engineer some code, mistakingly declare data as code, and re-assemble it with optimizer on, it will break the original program because the data will probably be changed. If you successfully identify all the "data" and "code" zones, then you can assemble the new executable into a faster one :) – Jean-François Fabre Sep 20 '16 at 19:48
  • 3
    The Altair's Intel 8080 didn't support any sort of illegal instruction handler. Every opcode did something, but undocumented ones mostly duplicated what some other documented opcode did. – Ross Ridge Sep 21 '16 at 02:14
4

What the processor sees as an instruction is the byte at the address given by the program counter. Depending on the instruction, subsequent bytes can be the next instruction or data for the current instruction ("immediate instruction"). In that case, the program counter is increased to jump over the data.

Some instructions use the content of registers (HL or DE in the case of an 8080) to determine the memory access. What is fetched is considered data by the processor.

Wim Ton
  • 41
  • 3