0

For example, at the address 0x762C51, there is the instruction call sub_E91E50. In bytes, this is E8 FA F1 72 00.

Next, at the address 0x762C56, there is the instruction push 0. In bytes, this is 6A 00.

Now, when it comes to C++ reading a function like this, it would only have the bytes like: E8 FA F1 72 00 6A 00

How can I determine where the first instruction ends and the next begins.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 5
    Short answer - you don't, unless you run the bytes through a parser that knows how to convert them back into human readable instructions. Or, if you memorize the byte patterns of the various instructions. – Remy Lebeau Jan 31 '20 at 00:32
  • @RemyLebeau By that are you saying you could but it would require a full disassembler to determine this? – Luke Mitchell Jan 31 '20 at 00:34
  • If you are just reading the raw bytes as-is, then yes. You will need a tool to parse the bytes and show you what instructions they represent. or at least the separations between the instructions. – Remy Lebeau Jan 31 '20 at 00:35
  • @RemyLebeau So if there is a tool that can separate the instructions. Isn't that what I'm asking? How is this possible. To show the separations – Luke Mitchell Jan 31 '20 at 00:41
  • You will have to find such a tool, or make your own. Asking for tool recommendations is off-topic for StackOverflow, though. If you want to write such a tool, you can ask new questions about that separately. – Remy Lebeau Jan 31 '20 at 00:44
  • You should thank the designers at Intel for the variable length instructions. Other processors have mostly fixed length. – Thomas Matthews Jan 31 '20 at 00:45
  • I recommend obtaining a reference on the assembly language, especially one that shows the bit representation of the assembly instructions. – Thomas Matthews Jan 31 '20 at 00:46
  • 2
    Not a correct statement that x86 is the only or even only in use ISA that is variable length, MIPS, ARM to name a couple, RISC-V most definitely, plus a slew of others. Struggling to think of one that doesnt have this problem (yes, you can make projects in MIPS, ARM and RISC-V that do not have this problem but in general if you examine their instruction set there are problems, not as bad as the old 8 bit instruction based ISAs like x86 but still a problem) – old_timer Jan 31 '20 at 02:48
  • 1
    [Instruction Lengths](https://stackoverflow.com/q/4567903/995714), [How to calculate instruction opcodes length in bytes](https://stackoverflow.com/q/45801447/995714), [Get size of x86-64 instruction](https://stackoverflow.com/q/44228285/995714), [Get size of assembly instructions](https://stackoverflow.com/q/23788236/995714), [How to tell how many symbolic instructions are represented in hex machine code?](https://stackoverflow.com/q/55146134/995714) – phuclv Jan 31 '20 at 03:14
  • The problem of knowing whether you are pointing at the start of an instruction or the middle, or data, is logically prior to the problem of variable-length instructions, and there is no general solution. You have to start disassembling at a known entry point or label. Too broad. – user207421 Jan 31 '20 at 03:37
  • Broadly speaking, you cannot even assume that all in .text is code as some can be constants, even with fixed length instructions. – Erik Eidt Jan 31 '20 at 05:17

1 Answers1

1

For variable length instruction sets you can't really do this. Many tools will try but it is often trivial to mess them up if you try. Compiled code works better.

The best way which won't necessarily result in a complete disassembly is to go in execution order. So you have to have a known correct entry point and go from there following the execution paths and looking for collisions and setting those aside for a human to figure out.

Simulation is even better and it might give you better coverage in some areas, but also will leave gaps where an execution order disassembler wouldn't.

Even gnu assembler messes up with variable length instruction sets (next time please specify the target/isa as stated in the assembly tag). So whatever a "full disassembler" is can't possibly do it either, in general.

If someone has told you like it's a class assignment or you compile a function to an object and based on the label in the disassembly you feel comfortable that is a valid entry point then you can start disassembling there in execution order. Understanding that if it is an object there will be incomplete instructions to be filled in later, if you link then if you assume that a function label address is an entry point then can disassemble in execution order.

C++ really has nothing to do with this if you have a sequence of bytes and you are sure you know the entry point it doesn't much matter how that code was created, unless it intentionally has anti-disassembly items in it (hand written or compiler/tool created) generally this is not a problem, but technically a tool could do this and it is trivial for a human to do it by hand.

halfer
  • 19,824
  • 17
  • 99
  • 186
old_timer
  • 69,149
  • 8
  • 89
  • 168