3

I was wondering what exactly is stored in a .o or a .so file that results from compiling a C++ program. This post gives a quite good overview of the compilation process and the function of a .o file in it, and as far as I understand from this post, .a and .so files are just multiple .o files merged into a single file that is linked in a static (.a) or dynamic (.so) way.

But I wanted to check if I understand correctly what is stored in such a file. After compiling the following code

void f();
void f2(int);

const int X = 25;

void g() {
  f();
  f2(X);
}

void h() {
  g();
}

I would expect to find the following items in the .o file:

  • Machine code for g(), containing some placeholder addresses where f() and f2(int) are called.
  • Machine code for h(), with no placeholders
  • Machine code for X, which would be just the number 25
  • Some kind of table that specifies at which addresses in the file the symbols g(), h() and X can be found
  • Another table that specifies which placeholders were used to refer to the undefined symbols f() and f2(int), which have to be resolved during linking.

Then a program like nm would list all the symbol names from both tables.

I suppose that the compiler could optimize the call f2(X) by calling f2(25) instead, but it would still need to keep the symbol X in the .o file since there is no way to know if it will be used from a different .o file.

Would that be about correct? Is it the same for .a and .so files?

Thanks for your help!

Community
  • 1
  • 1
PieterNuyts
  • 496
  • 5
  • 20
  • 1
    There are bits (0,1) in those files... ;) [I couldn't resist, I'm 100% sure there is a "good" answer somewhere and someome will mark this as a dupe] – Mats Petersson Jul 02 '15 at 08:39
  • 1
    In C++, as opposed to C, all namespace-level objects annotated with `const` have internal linkage unless explicitly 'published' with `extern`. Thus, in your case, a C++ compiler has the right to optimize the `X` away. As to your question, I recommend experimenting with `objdump` and `readelf` utilities, which permit to explore the contents of `.o`, `.a` and `.so` files. – ach Jul 02 '15 at 09:38
  • @Mats: Show me a file that doesn't have bits in it :-) [I know, an empty file...] – PieterNuyts Jul 02 '15 at 09:53
  • @Andrey: Thanks! I didn't know that. Does this also hold for inline functions? – PieterNuyts Jul 02 '15 at 09:59
  • 1
    Inline functions work somewhat differently. The C++ standard says every compilation unit (CU) using an inline function *must* have it defined. Therefore, if in a CU an inline function is not used, the compiler is entitled (but not obliged) to throw it away. However, GCC has [certain pragmas](https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Interface.html) to mitigate code bloat. – ach Jul 02 '15 at 10:16
  • What about static class variables? I suppose they can't be thrown away by default? – PieterNuyts Jul 03 '15 at 07:31
  • 1
    They certainly cannot. There is no way to declare that a static data member of a class cannot be accessed from another compilation unit. – ach Jul 03 '15 at 08:00

2 Answers2

6

You're pretty much correct in the general idea for object files. In the "table that specifies at which addresses in the file" I would replace "addresses" with "offsets", but that's just wording.

.a files are simply just archives (an old format that predates tar, but does the same thing). You could replace .a files with tar files as long as you taught the linker to unpack them and just link with all the .o files contained in them (more or less, there's a little bit more logic to not link with object files in the archive that aren't necessary, but that's just an optimization).

.so files are different. They are closer to a final binary than an object file. An .so file with all symbols resolved can at least theoretically be run as a program. In fact, with PIE (position independent executables) the difference between a shared library and a program are (at least in theory) just a few bits in the header. They contain instructions for the dynamic linker how to load the library (more or less the same instructions as a normal program) and a relocation table that contains instructions telling the dynamic linker how to resolve the external symbols (again, the same in a program). All unresolved symbols in a dynamic library (and a program) are accessed through indirection tables which get populated at dynamic linking time (program start or dlopen).

If we simplify this a lot, the difference between objects and shared libraries is that much more work has been done in the shared library to not do text relocation (this is not strictly necessary and enforced, but it's the general idea). This means that in object files the assembler has only generated placeholders for addresses which the linker then fills in, for a shared library the addresses are filled in with addresses to jump tables so that the text of the library doesn't need to get changed, only a limited jump table.

Btw. I'm talking ELF. Older formats had more differences between programs and libraries.

Art
  • 19,807
  • 1
  • 34
  • 60
  • A `.so` can't contain a `main()` function, can it? Maybe this would be worth pointing out too, given that the `main()` is the entry point for an executable, or am I mistaken on that? – Mäx Müller Jul 02 '15 at 09:39
  • So would it make sense to say that if a .so file contains unresolved symbols, something probably went wrong? – PieterNuyts Jul 02 '15 at 09:55
  • @MäxMüller You are mistaken. I've seen plenty of shared libraries with `main`. `main` is not the entry point of the executable. The entry point is just an address (or offset) in the header. There's plenty of code that is run in a normal program before main, setting up stdio, setting up environment variables, resolving shared libraries, etc. Even the simplest program should have an entry point that at least does something like: `exit(main(argc, argv));` to handle exiting by returning from `main`. – Art Jul 02 '15 at 10:11
  • 2
    @PieterNuyts Shared libraries contain plenty of unresolved symbols, usually to be resolved from other shared libraries, usually libc. Nothing prevents you from creating shared libraries that resolve symbols from the main program either. – Art Jul 02 '15 at 10:13
  • Interesting. That's why I love this site, so much to learn and many great people to answer your questions. Thanks for the insight! – Mäx Müller Jul 02 '15 at 10:57
  • @MäxMüller I made a thing to verify that I'm not remembering things wrong (aka. lying): https://github.com/art4711/shared_main – Art Jul 02 '15 at 11:23
  • Thanks. In the meantime I figured out what my mistake was. I thought my source code had to include a `main`, but the truth is that only the resulting executable has to contain a `main` (no matter where it comes from). – Mäx Müller Jul 02 '15 at 12:14
1

What you described in your question (machine code for functions, initialization data and relocation tables) is pretty much exactly what is inside .o (object) and .so (shared object) files.

.a (archives) are basically multiple .o (object) files bunched together for easier reference during linking. ("Link libraries")

.so (shared object) files include some additional metadata, like which other .so's would need to be linked in. (xyz.so might reference some functions that reside in abc.so, and the information that abc.so would need to be linked in, plus optionally the path where to find abc.so (the RPATH), need to be encoded in xyz.so.)

Windows .dll (dynamic link library) files are basically shared objects (.so) with a different name.

Disclaimer: This is simplifying things significantly, but is close enough to "The Truth (tm)" to serve for everyday developer needs.

DevSolar
  • 67,862
  • 21
  • 134
  • 209