(I briefly look at the building blocks of C programs and then examine the build steps normally hidden behind gcc calls.)
Traditional compiled languages, for example C and C++, are organized in source files which normally are, one by one, "compiled" into one "object file" each. Each source file is one "translation unit" — after all the include directives have been processed. (Therefore, a translation unit typically consists of more than one file, and the same include file typically occurs in more than one translation unit — files and translation units have, strictly spoken, an n:m relation. But pragmatically one can say that a "translation unit" is a C file.)
To compile a single source file into an object file, one passes the -c
flag to the compiler:
gcc -c myfile.c
This creates myfile.o
, or perhaps myfile.obj
, in the same directory.
Object files contain machine code and data (and potentially debug information, but we ignore that here). The machine code contains functions, and the data comes in the shape of variables. Both functions and variables in the object files have names which are called "symbols". The compiler typically transforms the variable and function names in the program by prepending an underscore or the like, and in C++ the generated ("mangled") name contains information about the type and, for functions, parameters.
Some symbols, for example the names of global variables and normal functions, are usable from other object files; they are "exported".
A symbol can, with only slight simplification, be thought of as an address alias: For a function, the name is an alias for the target address of a jump; for variables, the name is an alias for the address of a memory location from which the program can read and to which it can write.
Your file help.c contains the code for the function herp
. Functions in C have by default "external linkage", they can be used from other translation units. Their name — the "symbol" — is exported.
In modern C, a source file using a name defined in a different translation unit must declare the name. This tells the compiler what to do with it, and in which ways it can syntactically be used in the source code (e.g., call a function, assign to a variable, index an array). The compiler produces code that reads from this "symbolic address" or jumps to that "symbolic address"; it is the linker's job to replace all those symbolic addresses with "real" memory locations that point to existing data and code in the final executable, so that the jumps and memory accesses are landing at the desired locations.
The declaration of a name (function, variable) in the file that's using it can be "manual", like void herp();
, appearing directly in your file before the first use. More typically though, the names defined in a translation unit that other translation units can use are declared in a header file, your helper.h
. The using translation unit uses the "canned" declarations in the header file by #include
-ing it. There is no magic here; an include directive simply inserts the include file text as if it were written in the file directly. There is exactly zero difference. In particular, including a header file does not tell the linker to link with the corresponding source file. The reason is simple: The linker never knows about the included file because that piece of knowledge is erased during the compilation into an object file.
This means in your case that help.c
must be compiled, and that the linker must be told to combine ("link") it with the rest of the program, in your case the code from the compilation of main.c
.
The discussion how that is done is a bit more difficult because this procedure is so common that the typical C compiler integrates compilation and link stage: gcc -o myprog help.c main.c
simply does everything necessary to create an executable myprog
.
When we say "compiler", e.g. referring to gcc
, we normally actually mean the "compiler driver" which takes the commands and files from the command line and performs the necessary steps to achieve the desired results, like producing an executable program from our sources. The actual compiler for gcc is cc1
which produces an assembly file which must be "assembled" with as
into an object file. After the source files are compiled, gcc calls the linker with the appropriate options, which produces the executable.
Here is a sample session detailing the stages:
$ ls
Makefile help.c help.h main.c
$ /lib/gcc/x86_64-pc-cygwin/7.4.0/cc1 main.c
main
Analyzing compilation unit
Performing interprocedural optimizations
<*free_lang_data> <visibility> <build_ssa_passes> <opt_local_passes> <targetclone> <free-inline-summary> <emutls> <whole-program> <inline>Assembling functions:
<materialize-all-clones> <simdclone> main
Execution times (seconds)
phase setup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 (22%) wall 1184 kB (86%) ggc
TOTAL : 0.00 0.00 0.01 1374 kB
$ ls
Makefile help.c help.h main.c main.s
$ /lib/gcc/x86_64-pc-cygwin/7.4.0/cc1 help.c
herp
Analyzing compilation unit
Performing interprocedural optimizations
<*free_lang_data> <visibility> <build_ssa_passes> <opt_local_passes> <targetclone> <free-inline-summary> <emutls> <whole-program> <inline>Assembling functions:
<materialize-all-clones> <simdclone> herp
Execution times (seconds)
phase setup : 0.01 (100%) usr 0.00 ( 0%) sys 0.00 (33%) wall 1184 kB (86%) ggc
TOTAL : 0.01 0.00 0.01 1370 kB
$ ls
Makefile help.c help.h help.s main.c main.s
We now have two assembly files, main.s and help.s, which can be assembled into object files with the assembler as
. But let's have a quick look at help.s
:
$ cat help.s
.file "help.c"
.text
.globl some_variable
.data
.align 4
some_variable:
.long 1
.text
.globl herp
.def herp; .scl 2; .type 32; .endef
.seh_proc herp
herp:
pushq %rbp
.seh_pushreg %rbp
movq %rsp, %rbp
.seh_setframe %rbp, 0
.seh_endprologue
nop
popq %rbp
ret
.seh_endproc
.ident "GCC: (GNU) 7.4.0"
Even if we know nothing about assembler we can clearly identify the symbols some_variable
and herp
, which are assembly labels.
Ah yes, I forgot that I added a variable definition to help.c:
$ cat help.c
#include "help.h"
int some_variable = 1;
void herp() {}
We can assemble the assembly files with the assembler as
:
$ as main.s -o main.o
$ ls
Makefile help.c help.h help.s main.c main.o main.s
$ as help.s -o help.o
$ ls
Makefile help.c help.h help.o help.s main.c main.o main.s
Now we have two object files. We can see which symbols are exported ("extern") or needed ("undefined") with the utility nm
("name mangling"):
$ nm --extern-only help.o
0000000000000000 T herp
0000000000000000 D some_variable
$ nm --extern-only main.o
U __main
U herp
"T" indicates that a symbol is in the "text" section, which contains code; "D" is the data section, and "U" stands for "undefined". (The undefined __main
is a gcc and/or cygwin quirk.)
Here you have the source of your problem: Unless you pair your main.o with an object file that defines that undefined symbol, the linker cannot "resolve" the name and cannot produce the jump. There is no jump destination.
Now we can link the two object files to an executable. Cygwin requires us to link against the cygwin.dll; sorry for the circumstance.
$ ld main.o help.o /bin/cygwin1.dll -o main
$ ls
Makefile help.c help.h help.o help.s main* main.c main.o main.s
That's about it. I should add that the program doesn't run properly. It doesn't end, and doesn't react to Ctrl-C; I may be missing some Gnu or Windows build intricacies that gcc does for us.
Ah, Makefiles. Makefiles consist of target definitions and dependencies of these targets: A line
main: help.o main.o
specifies a target "main" depending on the two .o files.
Makefiles normally also contain rules specifying how to produce a target. But Make has built-in rules; it knows that you call the compiler to produce an .o file from a .c file (and it automatically considers this dependency), and it knows that you link the o files together to produce the target depending on them, provided the target has the same name as one of the .o files.
Therefore, we don't need any rules: We simply define the non-implicit dependencies. The entire Makefile for your project boils down to:
$ cat Makefile
CC=gcc
main: help.o main.o
help.o: help.h
main.o: help.h
CC=gcc
specifies the C compiler to use. CC is a built-in make variable specifying the C compiler (CXX would specify the C++ compiler, e.g. g++).
Let's see:
$ make
gcc -c -o main.o main.c
gcc -c -o help.o help.c
gcc main.o help.o -o main
$ ls
Makefile help.c help.h help.o main.c main.exe* main.o
Do the dependencies work?
$ make
make: 'main' is up to date.
$ touch main.c
$ make
gcc -c -o main.o main.c
gcc main.o help.o -o main
$ touch help.h
$ make
gcc -c -o main.o main.c
gcc -c -o help.o help.c
gcc main.o help.o -o main
That looks good: after touching a single source file make compiles only that file; but touching the header on which both files depend makes make compile both. The linking needs to be done in any case.