42

When I see the assembly code of a C app, like this:

emacs hello.c
clang -S -O hello.c -o hello.s
cat hello.s

Function names are prefixed with an underscore (e.g. callq _printf). Why is this done and what advantages does it have?


Example:

hello.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>


int main() {
  char *myString = malloc(strlen("Hello, World!") + 1);
  memcpy(myString, "Hello, World!", strlen("Hello, World!") + 1);
  printf("%s", myString);
  return 0;
}

hello.s

_main:                       ; Here
Leh_func_begin0:
    pushq   %rbp
Ltmp0:
    movq    %rsp, %rbp
Ltmp1:
    movl    $14, %edi
    callq   _malloc          ; Here
    movabsq $6278066737626506568, %rcx
    movq    %rcx, (%rax)
    movw    $33, 12(%rax)
    movl    $1684828783, 8(%rax)
    leaq    L_.str1(%rip), %rdi
    movq    %rax, %rsi
    xorb    %al, %al
    callq   _printf          ; Here
    xorl    %eax, %eax
    popq    %rbp
    ret
Leh_func_end0:

3 Answers3

32

From Linkers and Loaders:

At the time that UNIX was rewritten in C in about 1974, its authors already had extensive assember language libraries, and it was easier to mangle the names of new C and C-compatible code than to go back and fix all the existing code. Now, 20 years later, the assembler code has all been rewritten five times, and UNIX C compilers, particularly ones that create COFF and ELF object files, no longer prepend the underscore.

Prepending an underscore in the assembly results of C compilation is just a name-mangling convention that arose as a workaround. It stuck around for (as far as I know) no particular reason, and has now made its way into Clang.

Outside of assembly, the C standard library often has implementation-defined functions prefixed with an underscore to convey notions of magicalness and don't touch this to the ordinary programmers that stumble across them.

Jon Purdy
  • 53,300
  • 8
  • 96
  • 166
  • 10
    As for leading underscores in C source code: That's a name spacing issue, cf. section 7.1.3 of the C standard. To put it more bluntly: If your C code defines an identifier starting with two underscores or with an underscore followed by a capital letter, it is *broken.* If it defines identifiers with file scope or larger that start with an underscore, it is *broken.* Those are reserved for the compiler and standard library implementation. – Christopher Creutzig May 06 '11 at 08:33
  • Not broken, if you're writing C code for, say, your standard library implementation, though. In other words, you **really** need to know what youre' doing, and if you can't explain why what you're doing is OK, you're doing it wrong. But just in case you are writing the library, the compiler does not prevent you from breaking those rules, just in case anyone was wondering why. –  May 06 '11 at 10:04
  • 4
    @Lars: Unfortunately a lot of bloated egos in semi-system-level but *not standard library* code, like legacy X libs, sound libs, graphics libs, etc. think they're entitled to use underscores as if they were part of the standard library... And then some people blindly import code from various implementations of the standard library without understanding it, and keep the underscores... These usages are all definitely *broken*. – R.. GitHub STOP HELPING ICE May 06 '11 at 10:55
  • 1
    I am a fresher and not that knowledgeable about C. Can someone please explain what is a broken identifier? I am seeing these underscores everywhere and cannot understand why are they used. – Parth Shah Oct 25 '12 at 11:08
  • @ChristopherCreutzig to which version of the standard are you refering? ANSI C88 doesn't have a 7.1.3. http://flash-gordon.me.uk/ansi.c.txt nevermind. I found the C99 ISO standard 9899 – jrwren Nov 16 '12 at 17:47
  • @ParthShah: Simply don't use identifiers as described in my comment above. You're not allowed to; the compiler won't catch you if you do; it's quite likely that it will (seem to) work ok for a long time, but when it stops working (compiler upgrade or whatever), you've got a lot of work you could have avoided right away and no-one to blame except yourself. – Christopher Creutzig Nov 23 '12 at 07:17
  • @R..: If the author of a mid-level library needs to add an identifer for its own internal use, then a name which does not begin with an underscore will risk collisions with existing client code that uses that library; a name which does begin with an underscore will risk collisions with present or future versions of system libraries. Depending upon how the variety of clients compares with the variety of systems where the code will be used, the latter risk may be viewed as being less important than the former. – supercat Jan 19 '14 at 17:58
8

A lot of compilers used to translate C to assembly language, and then run an assembler on that to generate an object file. It's a lot easier than generating binary code directly. (AFAIK GCC still does this. But it also has its own assembler.) During this translation, function names become labels in the assembly source. If you have a function called (for example) ret, though, some assemblers can get confused and think it's an instruction rather than a label. (YASM does, for example, mostly because labels can appear pretty much anywhere and don't require colons. You have to prepend a $ if you want a label called ret.)

Prepending a character (like, say, an underscore) to the C-generated labels was a whole lot easier than writing one's own C-friendly assembler or worrying about labels clashing with assembly instructions/directives.

These days, assemblers and compilers have evolved a bit, and most people work at the C level or higher anyway. So the original need to mangle names in C is largely gone.

cHao
  • 84,970
  • 20
  • 145
  • 172
-1

At first glance the operating system is a Unix/Unix-like running on a PC. According to me, there is nothing much surprising to find _printf in the generated assembly language. C printf is a function which performs an I/O. So it is the responsibility of the kernel + driver to perform the requested I/O.

The machine instructions path taken on any Unix/Unix-like OS is the following:

printf (C code)-> _printf (libc) -> trap -> kernel + driver work -> return from trap -> return from _printf (libc) -> printf completion and return -> next machine instruction in C code

In the case of this assembly code extract, it looks like the C printf is inlined by the compilateur which caused the _printf entry point to be visible in the assembly code.

To make sure the C printf does not get decorated with a prefix (an underscore in this case), best if searching in all C headers for a _printf with a command like:

find /usr/include -name *.h -exec grep _printf {} \; -print

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880