1

is there a way to read given amount of instructions from a binary executable file on x86 architecture programmatically?

If I had a binary of a simple C program hello.c:

#include <stdio.h>

int main(){
    printf("Hello world\n");
    return 0;
}

Where after compilation using gcc, the disassembled function main looks like this:

000000000000063a <main>:
 63a:   55                      push   %rbp
 63b:   48 89 e5                mov    %rsp,%rbp
 63e:   48 8d 3d 9f 00 00 00    lea    0x9f(%rip),%rdi        # 6e4 <_IO_stdin_used+0x4>
 645:   e8 c6 fe ff ff          callq  510 <puts@plt>
 64a:   b8 00 00 00 00          mov    $0x0,%eax
 64f:   5d                      pop    %rbp
 650:   c3                      retq   
 651:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
 658:   00 00 00 
 65b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

Is there an easy way in C to read for example first three instructions (meaning the bytes 55, 48, 89, e5, 48, 8d, 3d, 9f, 00, 00, 00) from main? It is not guaranteed that the function looks like this - the first instructions may have all different opcodes and sizes.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
Topper Harley
  • 375
  • 4
  • 17
  • 3
    Using a framework that performs disassembly probably would be the easiest path. I would recommend capstone (http://www.capstone-engine.org/), it is easy to use, understands multiple file formats and is usable on multiple platforms. – thurizas Mar 07 '18 at 14:25
  • it is just a file, you can read it like any other file. main the label/symbol means nothing to the processor, so you have to have a file format as you have above that contains symbols and their offsets in the file/memory space. Parse through that based on the file format then find that offset/address in the loadable binary portion of the file and there are your bytes. just a matter of doing your research – old_timer Mar 07 '18 at 15:11
  • @old_timer I know how to read binary file and parse ELF headers. My problem is, that I need to read a number of _whole_ instructions on an architecture with a variable instruction length. – Topper Harley Mar 07 '18 at 15:43
  • that is a completely different question, for which neither the answer you checked nor the answer i just added are relevant. – old_timer Mar 07 '18 at 18:20
  • you have to disassemble, and it has to be in execution order if you want to have some chance of success. – old_timer Mar 07 '18 at 18:20
  • as your example shows you start at a known to be good entry point, we see 0x55 you look that up, disassemble it as a single byte instruction, see the next opcode, that may require examining the next byte or few, and from there you can determine how big that instruction,when you hit a conditional branch make a note of that address if possible, when you hit an unconditional stop that disassembly there, and go back and follow any unresolved code fragments based on branch destinations. repeat until done, assume the rest is either data or code but you cant tell. – old_timer Mar 07 '18 at 18:25
  • 2
    when I write a disassembler for instruction sets like this I have a matching table for every byte, and as I go through I mark the byte as start of an instruction or part of an instruction but not the start, if a branch destination then later lands in the middle of an instruction, declare an error and dont add that to the list. For todays compiler generate code this is rarely a problem but disassemble some old rom (standup arcade game) or someone who intentionally doesnt want you to disassemble and you can fall into this issue. – old_timer Mar 07 '18 at 18:27
  • @old_timer: modern x86 is so convoluted that even that isn't sufficient. In 32-bit mode, C4 and C5 could be the opcode for LDS or LES, or they could be the first byte of a VEX prefix for an AVX instruction. You have to check some bits in the ModR/M byte to tell the difference. (https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions#comment-1919527) – Peter Cordes Mar 08 '18 at 03:04
  • @PeterCordes certainly, assuming you know what mode/variant/add-ons, etc, THEN you have to go execution order. The processor certainly isnt confused when it hits one of these opcodes, nor should the disassembler be. So long as you are parsing using the instruction set/mode/whatever_other_term_you_prefer that matches the code. – old_timer Mar 08 '18 at 04:15
  • @PeterCordes the first instructions shouldn't be an issue, though. but yes, the only way to properly disassemble is "formal execution", else you can mistake data for code... annoying issue for years in my retrogaming/retrocomputing area :) – Jean-François Fabre Mar 08 '18 at 06:49
  • @old_timer: I meant that a table lookup based on bytes isn't sufficient. The same leading byte can mean different things depending on following bits. But you already need to decode the ModR/M to find instruction lengths, so maybe my point wasn't as interesting as I thought. – Peter Cordes Mar 08 '18 at 06:56
  • @PeterCordes which is why I said after looking at the opcode you have to examine the next byte or few, cant tell with x86 from just the first byte. Can be from one to N number of bytes before a specific instruction becomes clear and you know how many more bytes after are immediate if any, one to n bytes before you know the full length of the instruction. – old_timer Mar 08 '18 at 13:01

2 Answers2

7

this prints the 10 first bytes of the main function by taking the address of the function and converting to a pointer of unsigned char, print in hex.

This small snippet doesn't count the instructions. For this you would need an instruction size table (not very difficult, just tedious unless you find the table already done, What is the size of each asm instruction?) to be able to predict the size of each instruction given the first byte.

(unless of course, the processor you're targetting has a fixed instruction size, which makes the problem trivial to solve)

Debuggers have to decode operands as well, but in some cases like step or trace, I suspect they have a table handy to compute the next breakpoint address.

#include <stdio.h>

int main(){
    printf("Hello world\n");
    const unsigned char *start = (const char *)&main;
    int i;
    for (i=0;i<10;i++)
    {
       printf("%x\n",start[i]);
    }    
    return 0;
}

output:

Hello world
55
89
e5
83
e4
f0
83
ec
20
e8

seems to match the disassembly :)

00401630 <_main>:
  401630:   55                      push   %ebp
  401631:   89 e5                   mov    %esp,%ebp
  401633:   83 e4 f0                and    $0xfffffff0,%esp
  401636:   83 ec 20                sub    $0x20,%esp
  401639:   e8 a2 01 00 00          call   4017e0 <___main>
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • 4
    Why the downvote? It looks like the right answer to the OP's question... – andreee Mar 07 '18 at 14:00
  • @andreee it partially answer, as I now see, because OP needs the 3 first instructions. An answer like this would need a table to get the size of the instructions. I'm _not_ doing this :) – Jean-François Fabre Mar 07 '18 at 14:04
  • Okay, agreed :-) A rationale from the downvoter would have been helpful though. – andreee Mar 07 '18 at 14:10
  • 1
    Yes, but people are afraid of revenge downvotes. Not from me anyway!! After editing my answer, I feel I'm fully answering. Finding an instruction size table for OP CPU is off-topic. – Jean-François Fabre Mar 07 '18 at 14:16
  • variable instruction length like x86 you really have to disassemble in execution order, not linearly through memory, adding to your comment, not contradicting it. – old_timer Mar 07 '18 at 15:09
  • the OP said to read bytes from a file, not runtime from memory. – old_timer Mar 07 '18 at 15:09
  • @old_timer: "is there a way to read given amount of instructions from a binary executable file on x86 architecture programmatically?" yes, I supposed that you could hack into the executable you want to read to. If you cannot why not using `objdump` then? – Jean-François Fabre Mar 07 '18 at 15:25
  • Thank you very much. I'll try the way with the table. I asked this question because I wondered whether there was a library to help me with this (like there is a library `elf.h` for parsing ELF files), but it seems there's not. – Topper Harley Mar 07 '18 at 15:26
  • @Jean-FrançoisFabre good question, but the OP but that is what the op was asking if it is possible, certainly is and there happens to be tools to do it...why they didnt want to use the tool, dont know...not hacking in any way, its a file format you read the format. just like a jpg or .wav or mpg or whatever...just a file. – old_timer Mar 07 '18 at 17:58
  • @TopperHarley maybe there are but they're architecture-dependent and probably not very easy to build (I tried to build a compiler toolchain once, only half succeeded, but it took me 3 days with a zillion packages to download, configure & compile). I wouldn't bother with that, and would use `objdump` + some post-processing. – Jean-François Fabre Mar 07 '18 at 19:01
  • Note that ISO C doesn't guarantee this is portable. A Harvard architecture might have a different address space for function pointers vs. data pointers. IIRC, the standard doesn't talk about functions as objects having an object representation. It does of course work in normal cases on normal architectures. **See [C++ How To Read First Couple Bytes Of Function? (32 bit Machine)](https://stackoverflow.com/questions/49022765/c-how-to-read-first-couple-bytes-of-function-32-bit-machine/49022975#49022975) and comments on that for discussion of the caveats.** – Peter Cordes Mar 07 '18 at 23:57
  • @PeterCordes interesting. this slightly looks like a duplicate of that one too... – Jean-François Fabre Mar 08 '18 at 06:41
  • The answers should maybe be swapped, because the other one just wants to read bytes, but this one wants a number of instructions. I gave up without deciding to close either as a dup of the other, because it's a bit of a mess. – Peter Cordes Mar 08 '18 at 06:44
1
.globl _start
_start:
    bl main
    b .

.globl main
main:
    add r1,#1
    add r2,#1
    add r3,#1
    add r4,#1
    b main

intentionally wrong architecture, architecture doesnt matter file format matters. built this into an elf file format, which is very popular, and is simply a file format which is what I understood your question to be, to read a file, not modify the binary to read the program runtime from memory.

it is very much popular and there are tools that do it which you appear to know how to run.

Disassembly of section .text:

00001000 <_start>:
    1000:   eb000000    bl  1008 <main>
    1004:   eafffffe    b   1004 <_start+0x4>

00001008 <main>:
    1008:   e2811001    add r1, r1, #1
    100c:   e2822001    add r2, r2, #1
    1010:   e2833001    add r3, r3, #1
    1014:   e2844001    add r4, r4, #1
    1018:   eafffffa    b   1008 <main>

if I hexdump the file though

00000000  7f 45 4c 46 01 01 01 00  00 00 00 00 00 00 00 00  |.ELF............|
00000010  02 00 28 00 01 00 00 00  00 10 00 00 34 00 00 00  |..(.........4...|
00000020  c0 11 00 00 00 02 00 05  34 00 20 00 01 00 28 00  |........4. ...(.|
00000030  06 00 05 00 01 00 00 00  00 00 00 00 00 00 00 00  |................|
00000040  00 00 00 00 1c 10 00 00  1c 10 00 00 05 00 00 00  |................|
00000050  00 00 01 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000  00 00 00 eb fe ff ff ea  01 10 81 e2 01 20 82 e2  |............. ..|
00001010  01 30 83 e2 01 40 84 e2  fa ff ff ea 41 11 00 00  |.0...@......A...|
00001020  00 61 65 61 62 69 00 01  07 00 00 00 08 01 00 00  |.aeabi..........|
00001030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001040  00 00 00 00 00 10 00 00  00 00 00 00 03 00 01 00  |................|
00001050  00 00 00 00 00 00 00 00  00 00 00 00 03 00 02 00  |................|
00001060  01 00 00 00 00 00 00 00  00 00 00 00 04 00 f1 ff  |................|
00001070  06 00 00 00 00 10 00 00  00 00 00 00 00 00 01 00  |................|
00001080  18 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |................|
00001090  09 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |................|
000010a0  17 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |................|
000010b0  55 00 00 00 00 10 00 00  00 00 00 00 10 00 01 00  |U...............|
000010c0  23 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |#...............|
000010d0  2f 00 00 00 08 10 00 00  00 00 00 00 10 00 01 00  |/...............|
000010e0  34 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |4...............|
000010f0  3c 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |<...............|
00001100  43 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |C...............|
00001110  48 00 00 00 00 00 08 00  00 00 00 00 10 00 01 00  |H...............|
00001120  4f 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |O...............|
00001130  00 73 6f 2e 6f 00 24 61  00 5f 5f 62 73 73 5f 73  |.so.o.$a.__bss_s|
00001140  74 61 72 74 5f 5f 00 5f  5f 62 73 73 5f 65 6e 64  |tart__.__bss_end|
00001150  5f 5f 00 5f 5f 62 73 73  5f 73 74 61 72 74 00 6d  |__.__bss_start.m|
00001160  61 69 6e 00 5f 5f 65 6e  64 5f 5f 00 5f 65 64 61  |ain.__end__._eda|
00001170  74 61 00 5f 65 6e 64 00  5f 73 74 61 63 6b 00 5f  |ta._end._stack._|
00001180  5f 64 61 74 61 5f 73 74  61 72 74 00 00 2e 73 79  |_data_start...sy|
00001190  6d 74 61 62 00 2e 73 74  72 74 61 62 00 2e 73 68  |mtab..strtab..sh|
000011a0  73 74 72 74 61 62 00 2e  74 65 78 74 00 2e 41 52  |strtab..text..AR|
000011b0  4d 2e 61 74 74 72 69 62  75 74 65 73 00 00 00 00  |M.attributes....|
000011c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000011e0  00 00 00 00 00 00 00 00  1b 00 00 00 01 00 00 00  |................|
000011f0  06 00 00 00 00 10 00 00  00 10 00 00 1c 00 00 00  |................|
00001200  00 00 00 00 00 00 00 00  04 00 00 00 00 00 00 00  |................|
00001210  21 00 00 00 03 00 00 70  00 00 00 00 00 00 00 00  |!......p........|
00001220  1c 10 00 00 12 00 00 00  00 00 00 00 00 00 00 00  |................|
00001230  01 00 00 00 00 00 00 00  01 00 00 00 02 00 00 00  |................|
00001240  00 00 00 00 00 00 00 00  30 10 00 00 00 01 00 00  |........0.......|
00001250  04 00 00 00 05 00 00 00  04 00 00 00 10 00 00 00  |................|
00001260  09 00 00 00 03 00 00 00  00 00 00 00 00 00 00 00  |................|
00001270  30 11 00 00 5c 00 00 00  00 00 00 00 00 00 00 00  |0...\...........|
00001280  01 00 00 00 00 00 00 00  11 00 00 00 03 00 00 00  |................|
00001290  00 00 00 00 00 00 00 00  8c 11 00 00 31 00 00 00  |............1...|
000012a0  00 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00  |................|
000012b0

can google the file format and find a lot of info at wikipedia, with a smidge more at one of the links

useful header information

00 10 00 00 entrh
34 00 00 00 phoff
c0 11 00 00 shoff
00 02 00 05 flags
34 00 ehsize
20 00 phentsize
01 00 phnum
28 00 shentsize
06 00 shnum
05 00shstrndx

so if I look at the beginning of the sections there are shnum number of them

0x11C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x11E8 1b 00 00 00 01 00 00 00 06 00 00 00 00 10 00 00 00 10 00 00
0x1210 21 00 00 00 03 00 00 70 00 00 00 00 00 00 00 00 1c 10 00 00
0x1238 01 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 30 10 00 00
0x1260 09 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 30 11 00 00
0x1288 11 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 8c 11 00 00

0x1260 strtab type offset 0x1130 which is broken into null terminated strings until you hit a double null

[0] 00
[1] 73 6f 2e 6f 00 so.o
[2] 24 61 00 $a
[3] 5f 5f 62 73 73 5f 73 74 61 72 74 5f 5f 00 __bss_start__
[4] 5f 5f 62 73 73 5f 65 6e 64 5f 5f 00 __bss_end__
[5] 5f 5f 62 73 73  5f 73 74 61 72 74 00 __bss_start
[6] 6d 61 69 6e 00 main
...

main is at address 0x115F in the file which is offset 0x2F in the strtab.

0x1238 symtab starts at 0x1030, 0x10 or 16 bytes per entry

00001030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001040  00 00 00 00 00 10 00 00  00 00 00 00 03 00 01 00  |................|
00001050  00 00 00 00 00 00 00 00  00 00 00 00 03 00 02 00  |................|
00001060  01 00 00 00 00 00 00 00  00 00 00 00 04 00 f1 ff  |................|
00001070  06 00 00 00 00 10 00 00  00 00 00 00 00 00 01 00  |................|
00001080  18 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |................|
00001090  09 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |................|
000010a0  17 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |................|
000010b0  55 00 00 00 00 10 00 00  00 00 00 00 10 00 01 00  |U...............|
000010c0  23 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |#...............|
000010d0  2f 00 00 00 08 10 00 00  00 00 00 00 10 00 01 00  |/...............|
000010e0  34 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |4...............|
000010f0  3c 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |<...............|
00001100  43 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |C...............|
00001110  48 00 00 00 00 00 08 00  00 00 00 00 10 00 01 00  |H...............|
00001120  4f 00 00 00 1c 10 01 00  00 00 00 00 10 00 01 00  |O...............|

000010d0 2f 00 00 00 has the 0x2f offset in the symbol table so this is main, from this entry the address 08 10 00 00 or 0x1008 in the processors memory, unfortunately due to the values I chose it happens to also be the file offset, dont get that confused.

this section is type 00000001 PROGBITS

0x11E8 1b 00 00 00 01 00 00 00 06 00 00 00 00 10 00 00 00 10 00 00
offset 0x1000 in the file 0x1C bytes

here is the program, the machine code.

00001000  00 00 00 eb fe ff ff ea  01 10 81 e2 01 20 82 e2
00001010  01 30 83 e2 01 40 84 e2  fa ff ff ea 41 11

so starting at memory offset 0x1008 which is 8 bytes after the entry point (unfortunately I picked a bad address to use) we need to go 0x8 bytes offset into this data

01 10 81 e2 01 20 82 e2

00001008 <main>:
    1008:   e2811001    add r1, r1, #1
    100c:   e2822001    add r2, r2, #1
    1010:   e2833001    add r3, r3, #1

this is all very file dependent, the cpu could care less about labels, main only means something to the humans, not the cpu.

If I convert the elf into other formats which are perfectly executable:

motorola s record:

S00A0000736F2E7372656338
S1131000000000EBFEFFFFEA011081E2012082E212
S10F1010013083E2014084E2FAFFFFEAB1
S9031000EC

raw binary image

hexdump -C so.bin
00000000  00 00 00 eb fe ff ff ea  01 10 81 e2 01 20 82 e2  |............. ..|
00000010  01 30 83 e2 01 40 84 e2  fa ff ff ea              |.0...@......|
0000001c

The instruction bytes of interest are of course there, but the symbol information isnt. It depends on the file format you are interested in as to 1) if you can find "main" and then 2) print out the first few bytes at that address.

Hmm, a bit disturbing, but if you link for 0x2000 gnu ld burns some disk space and puts the offset at 0x2000, but choose 0x20000000 and it burns more disk space but not as much

000100d0  2f 00 00 00 08 00 00 20  00 00 00 00 10 00 01 00 

shows the file offset is 0x010010 but the address in target space is 0x20000008

00010010  01 30 83 e2 01 40 84 e2  fa ff ff ea 41 11 00 00
00010020  00 61 65 61 62 69 00 01  07 00 00 00 08 01

just to demonstrate/enforce the file offset and the target memory space address are two different things.

this is a very nice format for what you are wanting to do

arm-none-eabi-objcopy -O symbolsrec so.elf so.srec
cat so.srec
$$ so.srec
  $a $20000000
  _bss_end__ $2001001c
  __bss_start__ $2001001c
  __bss_end__ $2001001c
  _start $20000000
  __bss_start $2001001c
  main $20000008
  __end__ $2001001c
  _edata $2001001c
  _end $2001001c
  _stack $80000
  __data_start $2001001c
$$ 
S0090000736F2E686578A1
S31520000000000000EBFEFFFFEA011081E2012082E200
S31120000010013083E2014084E2FAFFFFEA9F
S70520000000DA
old_timer
  • 69,149
  • 8
  • 89
  • 168
  • sorry this question was not about offsets or labels it was about dissassembly...for fixed length like these 32 bit arm instructions you can linearly disassemble, but for variable length even thumb with thumb2 extensions you have to go in EXECUTION order, not linear address order, making many/multiple paths. – old_timer Mar 07 '18 at 18:26