0

I am trying to study about assembly, compiler(LLVM) and lifter.

I can write just assembly code by nasm.(like this)

Below is my assembly code.

section .data
hello_string db "Hello World!", 0x0d, 0x0a
hello_string_len equ $ - hello_string

section .text
global _start

_start:
    mov eax, 4 ; eax <- 4, syscall number (print) But, never execute.
    mov ebx, 1 ; ebx <- 1, syscall argument1 (stdout) But, never execute.
    mov ecx, hello_string ; ecx <- exit_string, syscall argument2 (string ptr) But, never execute.
    mov edx, hello_string_len ; edx <- exit_string_len, syscall argument3 (string len) But, never execute.
    int 0x80; ; syscall But, never execute.
    mov eax, 1 ; eax <- 1, syscall number (exit) But, never execute.
    mov ebx, 0 ; ebx <- 0, syscall argument1 (return value) But, never execute.
    int 0x80; syscall But, never execute.

;nasm -felf32 hello.x86.s -o hello.o
;ld -m elf_i386 hello.o -o hello.out

And I check binary file.

Here, I can't find Function. and i agree with that call and ret instructions are something combined some instructions.

$readelf -s hello.o
Symbol table '.symtab' contains 7 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
     0: 00000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 00000000     0 FILE    LOCAL  DEFAULT  ABS hello.x86.s
     2: 00000000     0 SECTION LOCAL  DEFAULT    1 
     3: 00000000     0 SECTION LOCAL  DEFAULT    2 
     4: 00000000     0 NOTYPE  LOCAL  DEFAULT    1 hello_string
     5: 0000000e     0 NOTYPE  LOCAL  DEFAULT  ABS hello_string_len
     6: 00000000     0 NOTYPE  GLOBAL DEFAULT    2 _start

But. If i compile c program and check that binary file by readelf. then i can find "function".

P.S

$readelf -s function.o | grep FUNC
     3: 0000000000000000    18 FUNC    GLOBAL DEFAULT    2 add
     4: 0000000000000020    43 FUNC    GLOBAL DEFAULT    2 main

here i can see what is function.

what is function different NOTYPE label?

JaeIL Ryu
  • 159
  • 10
  • 3
    Right, assembly language doesn't truly have functions, just the tools to implement that concept (e.g. jump and store a return address somewhere = call, indirect jump to a return address = ret). The model of execution is purely sequential and local, one instruction at a time (on most ISAs, but some ISAs are VLIW and execute 3 at a time for example, but still local in scope), with each instruction just making a well-defined change to the architectural state. Is that what you're asking? – Peter Cordes Jun 29 '20 at 09:25
  • You're about to do the course. They will teach you. It is beyond absurd to expect an answer to a question this broad in a single SO question. – user207421 Jun 29 '20 at 10:08
  • @PeterCordes if i write c codes, then my everything instructions compiled from code are in function? it is my question. – JaeIL Ryu Jun 29 '20 at 10:17
  • @MarquisofLorne i think it is just Yes or No questions.. and my guess, it is yes in C. every instructions will be in function and executed by call instruction. – JaeIL Ryu Jun 29 '20 at 10:17
  • 1
    Right, a C compiler won't put any instructions outside of functions. (Technically the `_start` entry point that indirectly calls `main` isn't a function; it can't return and has to make an `exit` system call, but that's written in asm and is part of libc. It's not generated by the C compiler proper.) – Peter Cordes Jun 29 '20 at 10:25
  • 1
    I can see at least three questions in your post, and none of them have simple yes/no answers. – user207421 Jun 29 '20 at 11:59
  • 1
    ELF symbol metadata can be set by some assemblers, e.g. in NASM, `global main:function` to mark the symbol type as FUNC. (https://nasm.us/doc/nasmdoc8.html#section-8.9.5). The GAS syntax equivalent (which C compilers emit) is`.type main, function`. e.g. put some code on https://godbolt.org/ and disable filtering to see asm directives in compiler output. But note this is just metadata for linkers and debuggers to use; the CPU doesn't see that when executing. That's why nobody bothers with it for NASM examples. – Peter Cordes Jun 30 '20 at 08:51
  • @PeterCordes very very very very thanks. it is very helpful to me. – JaeIL Ryu Jun 30 '20 at 09:00

2 Answers2

3

ELF symbol metadata can be set by some assemblers, e.g. in NASM, global main:function to mark the symbol type as FUNC. (https://nasm.us/doc/nasmdoc8.html#section-8.9.5).

The GAS syntax equivalent (which C compilers emit) is .type main, function. e.g. put some code on https://godbolt.org and disable filtering to see asm directives in compiler output.

But note this is just metadata for linkers and debuggers to use; the CPU doesn't see that when executing. That's why nobody bothers with it for NASM examples.


Assembly language doesn't truly have functions, just the tools to implement that concept, e.g. jump and store a return address somewhere = call, indirect jump to a return address = ret. On x86, return addresses are pushed and popped on the stack.

The model of execution is purely sequential and local, one instruction at a time (on most ISAs, but some ISAs are VLIW and execute 3 at a time for example, but still local in scope), with each instruction just making a well-defined change to the architectural state. The CPU itself doesn't know or care that it's "in a function" or anything about nesting, other than the return-address predictor stack which optimistically assumes that ret will actually use a return address pushed by a corresponding call. But that's a performance optimization; you do sometimes get mismatched call/ret if code is doing something weird (e.g. a context switch).

A C compiler won't put any instructions outside of functions.

Technically the _start entry point that indirectly calls main isn't a function; it can't return and has to make an exit system call, but that's written in asm and is part of libc. It's not generated by the C compiler proper, only linked with the C compiler's output to make a working program.) See Linux x86 Program Start Up or - How the heck do we get to main()? for example.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1

First off, assembly language is specific to the assembler, the tool that reads it. Not the target (arm, x86, mips, etc).

Function names are basically labels which means addresses. There is no real notion of functions, variable type (unsigned int, float, boolean, etc), address vs data vs instructions outside high level languages. Assembly generally has no real notion of these concepts, because they don't exist at that level. When computing an offset in an struct in order to access some item the base address and offset are just numbers when the add happens they are not addresses nor offsets, and they are only an address for the brief moment that that instruction executes, the one clock cycle when the address is latched and sent through logic toward the bus, otherwise its just bits.

Now saying that some assembly languages has declarations that use words like FUNCTION or PROCEDURE but these are not necessarily like high level languages where you have clearly divided boundaries.

And then there is compiler generated code vs hand generated code and with hand generated there is no expectation of these notions of boundaries.

unsigned int fun0 ( void )
{
    return(0x12345678);
}
void fun1 ( unsigned int y )
{
    static unsigned int x=5;
    x=x+y;
}

For a particular compiler/command line produces this (disassembly of compiled and asssembled output)

Disassembly of section .text:

00000000 <fun0>:
   0:   4800        ldr r0, [pc, #0]    ; (4 <fun0+0x4>)
   2:   4770        bx  lr
   4:   12345678

00000008 <fun1>:
   8:   4902        ldr r1, [pc, #8]    ; (14 <fun1+0xc>)
   a:   680a        ldr r2, [r1, #0]
   c:   1810        adds    r0, r2, r0
   e:   6008        str r0, [r1, #0]
  10:   4770        bx  lr
  12:   46c0        nop         ; (mov r8, r8)
  14:   00000000 

Disassembly of section .data:

00000000 <fun1.x>:
   0:   00000005

The function names are simply labels which means they are just addresses, from the processors perspective there is no notion of label much less function.

So from this view what is your definition of the boundary of the function? Does it end at the return? If so then there are items for the function outside the return of the function. The local global (static local) clearly is in the .data section which is well outside the function.

    .globl  fun0
    .p2align    2
    .type   fun0,%function
    .code   16
    .thumb_func
fun0:
    .fnstart
    ldr r0, .LCPI0_0
    bx  lr
    .p2align    2
.LCPI0_0:
    .long   305419896
.Lfunc_end0:
    .size   fun0, .Lfunc_end0-fun0
    .cantunwind
    .fnend

If you look at clangs output, which is aimed basically at gnu assembler, so this us gnu assembler assembly language you see the notion of a function, likely for debugger purposes, none of it means anything to the processor, there is no notion there, nor really to the assembler.

    .type   fun0,%function

Because this is arm, this serves perhaps as a function definition for high level concepts but also for arm/thumb-interwork it is crucial for the linker to generate the right addresses to things, it basically tells the assembler to tell the linker this label is a function label which means in this context a thumb function label address is the address ORRed with 1, and an arm function label address ORRed with zero or unmodified.

They double dipped here because

    .thumb_func
fun0:

also takes care the ORRed with one thing. The type , function likely adds debugger info where users want to see the illusion of debugging a function when they think they are using a debugger on high level code.

If you remove

.fnstart
.fnend

nothing bad happens

and for thumb you can remove the .type function too, nobody notices other than perhaps folks using tools related to the high level language (debuggers, etc) the code generated is fine and works fine. (arm mode does not have a .arm_func equivalent you have to use .type , function to get the linker to work right)

Outside arm and maybe mips (has a 32/16 bit instruction set mix as well) I don't know if you would need to even care about those kinds of directives when producing working code.

Here again assembly is specific to the assembler, a compiler that generates assembly (gnu and others, it is the sane model to use a toolchain), needs to generate for a specific assembler obviously, and is bound by the features there. users have developed expectations like the illusion of single stepping through high level language and other debugging at the high level language, rather than reality and the tools have evolved to provide more debug information within the toolchain (compile, assemble, link) so that the final binary depending on build options can have that debug info (and that where needed the code can be unoptimized so that the debug view works).

Other questions, in line assembly is a compiler specific feature not necessarily part of a high level language standard. And it is not real assembly, or let's say it is a new assembly language as the compiler is the tool so it can/does vary from the assembler's assembly language in the toolchain. But many compilers depending on the language, support some form of inline assembly (no reason to expect it to be compatible across compilers), so in that context you can put instructions into your C code. Its an act of desperation, but technically possible.

LLVMs ir or bytecode is its own instruction set and language completely separate from a target or high level, its a whole other beast. Sane compiler designs have some form of internal structures/code to keep track of the compiled code on its way toward the target output (often assembly language or machine code), it's a whole other beast.

My understanding of llvm is that you use the compiler (clang) as your "assembler" which is disturbing, but that is how they did it. In that view I don't see it as inline assembly but real assembly. By default the linker isn't built from my experience so gnus linker is used. And at least with bare metal work the objects are compatible between llvm and gnu binutils, the assembly output from clang or llc is compatible with binutils assembly language (gnu assembler), etc. And the gnu disassembler is superior to llvms for debugging using the compiler/assembler output. llvm has made strides to do things internally and not need binutils, and if you build in the linker then you don't need binutils (for projects where you are doing the steps separately and not just clang hello.c -o hello).

halfer
  • 19,824
  • 17
  • 99
  • 186
old_timer
  • 69,149
  • 8
  • 89
  • 168