5

I am on a quest to understand low-level computing. I have noticed my compiled binaries are a lot bigger then I think they should be. So I tried to build the smallest possible c program without any stdlib code as follows:

void _start()
{
    while(1) {};
}

gcc -nostdlib -o minimal minimal.c

When I disasseble the binary, it shows me exactly what I expect, namely this exact code in three lines of assembly.

$ objdump -d minimal

minimal:     file format elf64-x86-64


Disassembly of section .text:

0000000000001000 <_start>:
    1000:   55                      push   %rbp
    1001:   48 89 e5                mov    %rsp,%rbp
    1004:   eb fe                   jmp    1004 <_start+0x4>

But my actual executable is still 13856 Bytes in size. What is it, that makes this so large? What else is in that file? Does the OS need more than these 6 Bytes of machine code?

Edit #1: The output of size is:

$ size -A minimal
minimal  :
section              size    addr
.interp                28     680
.note.gnu.build-id     36     708
.gnu.hash              28     744
.dynsym                24     776
.dynstr                 1     800
.text                   6    4096
.eh_frame_hdr          20    8192
.eh_frame              52    8216
.dynamic              208   16176
.comment               18       0
Total                 421
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
sekthor
  • 464
  • 5
  • 15
  • 2
    An executable file contains much more info than just the code itself. It can change between different operating systems and executable file types. – Roy Avidan Aug 31 '20 at 10:35
  • 4
    If you link it as an elf, that is not low-level. – Martin James Aug 31 '20 at 10:36
  • 3
    There's some *meta-data* about the executable file, mandated by the [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) standard. The compiler might add other meta-data or sections, try to [`strip`](https://man7.org/linux/man-pages/man1/strip.1.html) the file as well. – Some programmer dude Aug 31 '20 at 10:36
  • 4
    Also see e.g. [this tutorial](https://www.muppetlabs.com/~breadbox/software/tiny/teensy.html) on how to create minimal executables. – Some programmer dude Aug 31 '20 at 10:37
  • And perhaps [this question](https://stackoverflow.com/questions/53382589/smallest-executable-program-x86-64) is a duplicate. – Some programmer dude Aug 31 '20 at 10:38
  • 2
    You can look up the ELF file format and you can also open the file in a hex editor to see what's in it, and use objdump with different arguments. – user253751 Aug 31 '20 at 10:50
  • 1
    Can you post the output of `size -A minimal` ? – Mark Plotnick Aug 31 '20 at 11:43
  • 4
    If you compile a 16 bit MSDOS .COM program, with no frame pointers, you'll end up with just the code. As already commented above, for most current operating systems, there is information in addition to the compiled code. – rcgldr Aug 31 '20 at 14:44
  • @rcgldr can you compile to a .com file? I think most c compilers can only generate .exe files but I may be wrong. – W. Chang Aug 31 '20 at 14:53
  • @W.Chang - Note I mentioned 16-bit MSDOS .COM file. I'm not aware of any 16 bit tool set for MSDOS that doesn't include the ability to create .COM files. – rcgldr Aug 31 '20 at 14:55
  • @rcgldr Thank you! I got the wrong impression because Turbo C gave error messages like "Fatal: Cannot generate COM file: invalid entry point address". But with some tricks even turbo C can generate .com files: https://stackoverflow.com/questions/55938131/compile-and-link-to-com-file-with-turbo-c – W. Chang Aug 31 '20 at 15:20
  • You can significantly reduce the size by stripping out exception handing (if you are not using EH), .comment section,.note.gnu.build-id section and more. Try `gcc -Wl,--build-id=none -fno-asynchronous-unwind-tables -Qn -nostdlib -o minimal minimal.c` – fpmurphy Aug 31 '20 at 16:22
  • @fpmurphy building is statically with no dynamic relations would also help. – Michael Petch Sep 01 '20 at 00:24
  • @W.Chang : since I am the one who answered the post you are referring to, the OP of that question was trying to build it from the command line so you need to know exactly how to form the compile and link options. If you were to use the IDE it is very easy as you just choose the TINY model and it will generate a COM program. – Michael Petch Sep 01 '20 at 01:42
  • @MichaelPetch. My command does build it statically. `file minimal: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped` – fpmurphy Sep 01 '20 at 04:05
  • @fpmurphy Not necessarily. It will depend on the default GCC options for the distribution of Linux or other OS you are using. For example if you build with Ubuntu 17.x+ you will likely end up with all the dynamic entries and an ELF file that is dynamically linked unless you specifically state it is `-static` . what distro are you using? – Michael Petch Sep 01 '20 at 04:52
  • @MichaelPetch I am on Fedora 32. The `gcc` default is dynamic; just tested it. My command builds a static. – fpmurphy Sep 01 '20 at 05:23
  • @fpmurphy It depends on the full range of options. What is the output of `gcc -v` which will contain the build options. I can tell you for an absolute fact that on Ubuntu 19.04 it will not build it statically and you will end up with extra sections. I finally tried it here. – Michael Petch Sep 01 '20 at 05:35
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/220770/discussion-between-fpmurphy-and-michael-petch). – fpmurphy Sep 01 '20 at 05:45

3 Answers3

6

Modern compilers and linkers aren't really optimized for producing ultra-small code on full-scale platforms. Not because the job is difficult, but because there's usually no need to. It isn't necessarily that the compiler or linker adds additional code (although it might), but rather that it won't try hard to pack your data and code into the smallest possible space.

In your case, I note that you're using dynamic linking, even though nothing is actually linked. Using "-static" will shave off about 8kB. "-s" (strip) will get rid of a bit more.

I don't know if it's even possible with gcc to make a truly minimal ELF executable. In your case, that ought to be about 400 bytes, nearly all of which will be the various ELF headers, section table, etc.

I don't know if I'm allowed to link my own website (I'm sure somebody will put me right if not), but I have an article on producing a tiny ELF executable by building it from scratch in binary:

http://kevinboone.me/elfdemo.html

Kevin Boone
  • 4,092
  • 1
  • 11
  • 15
  • 4
    Oops -- I forgot to say: some of the information written by gcc/ld will be for the benefit of tools like debuggers, and not actually necessary for normal execution. It will still add to the size of the file. – Kevin Boone Aug 31 '20 at 17:39
  • For 32 bit Windows programs, Visual C / C++ 4.0 will generate smaller .EXE files than current versions of Visual Studio, but I haven't investigated why. – rcgldr Aug 31 '20 at 21:24
  • @rcgldr the EXE file size can be easily changed by changing the DOS header because that header contains a real DOS program to display some message if running in DOS – phuclv Sep 01 '20 at 00:49
  • @phuclv - my last comment was about Windows 32 bit programs, not DOS programs. Visual C / C++ 4.0 produces smaller .EXE files than Visual Studio. – rcgldr Sep 01 '20 at 01:38
  • 1
    @rcgldr I'm talking about 32-bit exe files. They always contain a DOS stub to prevent them from being executed in DOS, or in some cases to produce "fat" binary that can run in either environments – phuclv Sep 01 '20 at 02:05
  • @phuclv - OK, but the difference in size of a .EXE between Visual C / C++ 4.0 and Visual Studio is much greater than that. – rcgldr Sep 01 '20 at 03:44
  • @rcgldr that depends on lots of things. The standard library in VC4 is probably very simple so the linked file would be smaller. Later versions may do more inlining so the size would be bigger – phuclv Sep 01 '20 at 03:47
5

There are many different executable file formats. .com, .exe, .elf, .coff, a.out, etc. They ideally contain the machine code and other sections (.text (code), .data, .bss, .rodata and possibly others, names depend on toolchain) plus they contain debugging information. Notice how your disassembly showed the label _start? that is a string among others and other info to be able to connect that string to the address for debugging. The output of objdump also showed that you are using an elf file, you can easily look up the file format and can trivially write your own program to parse through the file, or try to use readelf and other tools to see what is in there (high level not raw).

On an operating system where in general (not always, but think pc) the programs are being loaded into ram and then run, so you want to have first and foremost a file format that is supported by the operating system, there is no reason for them to support more than one, but they might. It is os/system design dependent, but the os may be designed to not only load the code, but also load/initialize the data (.data, .bss). When booting say an mcu you need to embed the data into the binary blob and the application itself copies the data to ram from the flash, but in an os that isn't necessarily required, but in order to do it you need a file format that can distinguish the sections, target locations, and sizes. Which means extra bytes in the file to define this and a file format.

A binary includes the bootstrap code before it can enter the C generated code, depending on the system, depending on the C library (multiple/many C libraries can be used on a computer and bootstrap is specific to the C library in general not the target, nor operating system, not a compiler thing), so some percentage of the file is the bootstrap code, too when your main program is very tiny the a lot of the file size is overhead.

You can for example use strip to make the file smaller by getting rid of some symbols and other non-essential items like that the file size should get smaller but the objdump disassembly will then not have labels and for the case of x86, a variable length instruction set which is difficult at best to disassemble gets much harder, so the output with or without labels may not reflect the actual instructions, but without the labels the gnu disassembler doesn't reset itself at the labels and can make the output worse.

halfer
  • 19,824
  • 17
  • 99
  • 186
old_timer
  • 69,149
  • 8
  • 89
  • 168
  • In this particular example there is no "bootstrap code" because `-nostdlib` is being used. On the flip side, this means that `_start` was not called using standard C calling conventions, which could lead to trouble if the program were more complex. – Nate Eldredge Aug 31 '20 at 21:25
2

If you use clang 10.0 and lld 10.0 and strip out unnecessary sections you can get the size of a 64-bit statically linked executable to under 800 bytes.

$ cat minimal.c
void _start(void)
{
    int i = 0;

    while (i < 11) {
       i++;
    }

    asm( "int $0x80" :: "a"(1), "b"(i) );
}

$ clang -static -nostdlib -flto -fuse-ld=lld -o minimal minimal.c
$ ls -l minimal
-rwxrwxr-x 1 fpm fpm 1376 Sep  4 17:38 minimal

$ readelf --string-dump .comment minimal
String dump of section '.comment':
  [     0]  Linker: LLD 10.0.0
  [    13]  clang version 10.0.0 (Fedora 10.0.0-2.fc32)

$ readelf -W --section-headers minimal
There are 9 section headers, starting at offset 0x320:

Section Headers:
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
  [ 1] .note.gnu.build-id NOTE            0000000000200190 000190 000018 00   A  0   0  4
  [ 2] .eh_frame_hdr     PROGBITS        00000000002001a8 0001a8 000014 00   A  0   0  4
  [ 3] .eh_frame         PROGBITS        00000000002001c0 0001c0 00003c 00   A  0   0  8
  [ 4] .text             PROGBITS        0000000000201200 000200 00002a 00  AX  0   0 16
  [ 5] .comment          PROGBITS        0000000000000000 00022a 000040 01  MS  0   0  1
  [ 6] .symtab           SYMTAB          0000000000000000 000270 000048 18      8   2  8
  [ 7] .shstrtab         STRTAB          0000000000000000 0002b8 000055 00      0   0  1
  [ 8] .strtab           STRTAB          0000000000000000 00030d 000012 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)

$ strip -R .eh_frame_hdr -R .eh_frame minimal
$ strip -R .comment -R .note.gnu.build-id minimal
strip: minimal: warning: empty loadable segment detected at vaddr=0x200000, is this intentional?

$ readelf -W --section-headers minimal
There are 3 section headers, starting at offset 0x240:

Section Headers:
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        0000000000201200 000200 00002a 00  AX  0   0 16
  [ 2] .shstrtab         STRTAB          0000000000000000 00022a 000011 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)

$ ll minimal
-rwxrwxr-x 1 fpm fpm 768 Sep  4 17:45 minimal
fpmurphy
  • 2,464
  • 1
  • 18
  • 22