All the extra time from C is dynamic linker and CRT overhead. The asm program is statically linked, and just calls exit(2)
(the sycall directly, not the glibc wrapper). Of course it's faster, but it's just startup overhead and doesn't tell you anything about how fast compiler-emitted code that actually does anything will run.
i.e. if you wrote some code to actually do something in C, and compiled it with gcc -O3 -march=native
, you'd expect it to be ~0.001 seconds slower than a statically linked binary with no CRT overhead. (If the your hand-written asm and the compiler output were both near-optimal. e.g. if you used the compiler output as a starting point for a hand-optimized version, but didn't find anything major. It's usually possible to make some improvements to compiler output, but often just to code-size and probably not much effect on speed.)
If you want to call malloc
or printf
, then the startup overhead is not useless; it's actually necessary to initialize glibc internal data structures so that library functions don't have any overhead of checking that stuff is initialized every time they're called.
From a statically linked hand-written asm program that links glibc, you need to call __libc_init_first
, __dl_tls_setup
, and __libc_csu_init
, in that order, before you can safely use all libc functions.
Anyway, ideally you can expect a constant time difference from the startup overhead, not a factor of 2 difference.
If you're good at writing optimal asm, you can usually do a better job than the compiler on a local scale, but compilers are really good at global optimizations. Moreover, they do it in seconds of CPU time (very cheap) instead of weeks of human effort (very precious).
It can make sense to hand-craft a critical loop, e.g. as part of a video encoder, but even video encoders (like x264, x264, and vpx) have most of the logic written in C or C++, and just call asm functions.
The extra push/mov/pop instructions are because you compiled with optimization disabled, where -fno-omit-frame-pointer
is the default, and makes a stack frame even for leaf functions. gcc defaults to -fomit-frame-pointer
at -O1
and higher on x86 and x86-64 (since modern debug metadata formats mean it's not needed for debugging or exception-handling stack unwinding).
If you'd told your C compiler to make fast code (-O3
), instead of to compile quickly and make dumb code that works well in a debugger (-O0
), you would have gotten code like this for main
(from the Godbolt compiler explorer):
// this is valid C++ and C99, but C89 doesn't have an implicit return 0 in main.
int main(void) {}
xor eax, eax
ret
To learn more about assembly and how everything works, have a look at some of the links in the x86 tag wiki. Perhaps Programming From the Ground Up would be a good start; it probably explains compilers and dynamic linking.
A much shorter article is A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux, which starts with what you did, and then gets down to having _start
overlap with some other ELF headers so the file can be even smaller.