1

so I've started learning about machine language today. I wrote a basic "Hello World" program in C which prints "Hello, world!" ten times using a for loop. I then used the Gnu Debugger to disassemble main and look at the code in machine language (my computer has a x86 processor and I've set gdb up to use intel syntax):

user@PC:~/Path/To/Code$ gdb -q ./a.out
Reading symbols from ./a.out...done.
(gdb) list
1      #include <stdio.h>
2
3      int main()
4      {
5           int i;
6           for(i = 0; i < 10; i++) {
7                printf("Hello, world!\n");
8           }
9           return 0;
10      } 
(gdb) disassemble main
Dump of assembler code for function main:
  0x0804841d <+0>:     push    ebp
  0x0804841e <+1>:     mov     ebp,esp
  0x08048420 <+3>:     and     esp,0xfffffff0
  0x08048423 <+6>:     sub     esp,0x20
  0x08048426 <+9>:     mov     DWORD PTR [esp+0x1c],0x0
  0x0804842e <+17>:    jmp     0x8048441 <main+36>
  0x08048430 <+19>:    mov     DWORD PTR [esp],0x80484e0
  0x08048437 <+26>:    call    0x80482f0 <puts@plt>
  0x0804843c <+31>:    add     DWORD PTR [esp+0x1c],0x1
  0x08048441 <+36>:    cmp     DWORD PTR [esp+0x1c],0x9
  0x08048446 <+41>:    jle     0x8048430 <main+19>
  0x08048448 <+43>:    mov     eax,0x0
  0x0804844d <+48>:    leave
  0x0804844e <+49>:    ret
End of assembler dump.
(gdb) x/s 0x80484e0
0x80484e0: "Hello, world!"

I understand most of the machine code and what each of the commands do. If I understood it correctly, the address "0x80484e0" is loaded into the esp register so that can use the memory at this address. I examined the address, and to no surprise it contained the desired string. My question now is - how did that string get there in the first place? I can't find a part in the program that sets the string up at this location.

I also don't understand something else: When I first start the program, the eip points to , where the variable i is initialized at [esp+0x1c]. However, the address that esp points to is changed later on in the program (to 0x80484e0), but [esp+0x1c] is still used for "i" after that change. Shouldn't the adress [esp+0x1c] change when the address esp points to changes?

  • 1
    It's in your binary, when the OS started your program it got loaded into memory, just like the machine code for your program. Note that `[esp]` is not the same as `esp`, the former accesses memory and does not change `esp` itself. – Jester Mar 06 '17 at 20:04
  • It loads into the memory. I'm presuming the memory address of that string starts at `0x8048441` and ends at `0x80484e0`. Keep in mind a string is a list of integers. – Xorifelse Mar 06 '17 at 20:05
  • @Jester Ah, so "mov DWORD PTR [esp],0x80484e0" doesn't actually make esp point to a new address, but just writes 0x80484e0 to the address it currently points to? – Keno Goertz Mar 06 '17 at 20:14
  • Yes, that's exactly what it does. The arguments are passed on the stack in this calling convention. That's gonna be the argument to `puts`. – Jester Mar 06 '17 at 20:15
  • The program does not set up the string at that location, it is the compile/link process which puts `"Hello, world!\n"` into the "read-only-data" section, being a *string literal*. You might find this [possibly duplicate question](http://stackoverflow.com/questions/2589949/c-string-literals-where-do-they-go) of interest. – Weather Vane Mar 06 '17 at 20:18
  • You can use `objdump -d -M intel -s a.out` to see also the ".data" section where the string is stored (edit: NOT in your case, goes to ".text" with code). This part of executable is then loaded into memory as binary block before first instruction of executable is run, so the code will find that part of memory set up with values from the source. (actually the string is in code segment, so that objdump will be a bit more funny, than when you do that with simple human-like ASM "hello world" example... try it anyway :) ) – Ped7g Mar 07 '17 at 03:29

2 Answers2

2

I binary or program is made up of both machine code and data. In this case your string which you put in the source code, the compiler too that data which is just bytes, and because of how it was used was considered read only data, so depending on the compiler that might land in .rodata or .text or some other name the compiler might use. Gcc would probably call it .rodata. The program itself is in .text. The linker comes along and when it links things finds a place for .text, .data, .bss, .rodata, and any other items you may have and then connects the dots. In the case of your call to printf the linker knows where it put the string, the array of bytes, and it was told what its name was (some internal temporary name no doubt) and the printf call was told about that name to so the linker patches up the instruction to grab the address to the format string before calling printf.

Disassembly of section .text:

0000000000400430 <main>:
  400430:   53                      push   %rbx
  400431:   bb 0a 00 00 00          mov    $0xa,%ebx
  400436:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  40043d:   00 00 00 
  400440:   bf e4 05 40 00          mov    $0x4005e4,%edi
  400445:   e8 b6 ff ff ff          callq  400400 <puts@plt>
  40044a:   83 eb 01                sub    $0x1,%ebx
  40044d:   75 f1                   jne    400440 <main+0x10>
  40044f:   31 c0                   xor    %eax,%eax
  400451:   5b                      pop    %rbx
  400452:   c3                      retq   
  400453:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  40045a:   00 00 00 
  40045d:   0f 1f 00                nopl   (%rax)



Disassembly of section .rodata:

00000000004005e0 <_IO_stdin_used>:
  4005e0:   01 00                   add    %eax,(%rax)
  4005e2:   02 00                   add    (%rax),%al
  4005e4:   48                      rex.W
  4005e5:   65 6c                   gs insb (%dx),%es:(%rdi)
  4005e7:   6c                      insb   (%dx),%es:(%rdi)
  4005e8:   6f                      outsl  %ds:(%rsi),(%dx)
  4005e9:   2c 20                   sub    $0x20,%al
  4005eb:   77 6f                   ja     40065c <__GNU_EH_FRAME_HDR+0x68>
  4005ed:   72 6c                   jb     40065b <__GNU_EH_FRAME_HDR+0x67>
  4005ef:   64 21 00                and    %eax,%fs:(%rax)

the compiler will have encoded this instruction but left the address as zeros probably or some fill

  400440:   bf e4 05 40 00          mov    $0x4005e4,%edi

so that the linker could fill it in later. The gnu disassembler attempts to disassemble the .rodata (and .data, etc) blocks which doesnt make sense, so ignore the instructions it is trying to interpret your string which starts at address 0x4005e4.

Before linking a disassembly of the object shows the two sections .text and .rodata

Disassembly of section .text.startup:

0000000000000000 <main>:
   0:   53                      push   %rbx
   1:   bb 0a 00 00 00          mov    $0xa,%ebx
   6:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
   d:   00 00 00 
  10:   bf 00 00 00 00          mov    $0x0,%edi
  15:   e8 00 00 00 00          callq  1a <main+0x1a>
  1a:   83 eb 01                sub    $0x1,%ebx
  1d:   75 f1                   jne    10 <main+0x10>
  1f:   31 c0                   xor    %eax,%eax
  21:   5b                      pop    %rbx
  22:   c3                      retq   

0000000000000000 <.rodata.str1.1>:
   0:   48                      rex.W
   1:   65 6c                   gs insb (%dx),%es:(%rdi)
   3:   6c                      insb   (%dx),%es:(%rdi)
   4:   6f                      outsl  %ds:(%rsi),(%dx)
   5:   2c 20                   sub    $0x20,%al
   7:   77 6f                   ja     78 <main+0x78>
   9:   72 6c                   jb     77 <main+0x77>
   b:   64 21 00                and    %eax,%fs:(%rax)

unlinked it has to just pad this address/offset for the linker to fill in later.

  10:   bf 00 00 00 00          mov    $0x0,%edi

also note the object contains only the string in .rodata. linking with libraries and other items to make it a complete program clearly added more .rodata, but the linker manages all of that.

Perhaps easier to see with this example

void more_fun ( unsigned int, unsigned int, unsigned int );

unsigned int a;
unsigned int b=5;
const unsigned int c=7;

void fun ( void )
{
    more_fun(a,b,c);
}

disassembled as a object

Disassembly of section .text:

0000000000000000 <fun>:
   0:   8b 35 00 00 00 00       mov    0x0(%rip),%esi        # 6 <fun+0x6>
   6:   8b 3d 00 00 00 00       mov    0x0(%rip),%edi        # c <fun+0xc>
   c:   ba 07 00 00 00          mov    $0x7,%edx
  11:   e9 00 00 00 00          jmpq   16 <fun+0x16>

Disassembly of section .data:

0000000000000000 <b>:
   0:   05                      .byte 0x5
   1:   00 00                   add    %al,(%rax)
    ...

Disassembly of section .rodata:

0000000000000000 <c>:
   0:   07                      (bad)  
   1:   00 00                   add    %al,(%rax)
    ...

and for whatever reason you have to link it to see the .bss section. The point of the example is the machine code for the function is in .text, the uninitialized global is in .bss, the initialized global is .data and the const initialized global is .rodata. The compiler was smart enough to know that a const even if it is global wont change so it can just hardcode that value into the math and not need to read from ram, but the other two variables it has to read from ram so generates an instruction with the address zeros to be filled in by the linker at link time.

In your case your read only/const data was a collection of bytes and it wasnt a math operation so the bytes as defined in your source file were placed in memory so they could be pointed at as the first parameter to printf.

There is more to a binary than just machine code. And the compiler and linker can have things placed in memory for the machine code to get, the machine code itself does not have to write every value that will be used by the rest of the machine code.

old_timer
  • 69,149
  • 8
  • 89
  • 168
  • 1
    It would be nicer to display object section rodata as data, `dw 1,2` `db "Hello, World!"`, instead of using disassembly of text which produced odd ball sequence of instructions. – rcgldr Mar 07 '17 at 16:01
  • sure but then the code is not as clean to read (-s instead of -c on the compile), it is important to understand it is just bytes an array of bytes...the processor doesnt see ascii... – old_timer Mar 07 '17 at 16:16
  • I assume someone above already mentioned -s to see what was going on...if not then compile with -s or use -save-temps and the assembly that is fed to the assembler wont get deleted. – old_timer Mar 07 '17 at 16:17
0

The compiler 'hard wires' the string into the object code and the linker then 'hard wires' it into the machine code.

Not that the string is embedded into the code, and not stored in a data area meaning that if you took a pointer to the string and attempted to change it you would get an exception.

Quandon
  • 27
  • 4