Why not concatenate C source files before compilation?

Question

I come from a scripting background and the preprocessor in C has always seemed ugly to me. None the less I have embraced it as I learn to write small C programs. I am only really using the preprocessor for including the standard libraries and header files I have written for my own functions.

My question is why don't C programmers just skip all the includes and simply concatenate their C source files and then compile it? If you put all of your includes in one place you would only have to define what you need once, rather than in all your source files.

Here's an example of what I'm describing. Here I have three files:

// includes.c
#include <stdio.h>

// main.c
int main() {
    foo();
    printf("world\n");
    return 0;
}

// foo.c
void foo() {
    printf("Hello ");
}

By doing something like cat *.c > to_compile.c && gcc -o myprogram to_compile.c in my Makefile I can reduce the amount of code I write.

This means that I don't have to write a header file for each function I create (because they're already in the main source file) and it also means I don't have to include the standard libraries in each file I create. This seems like a great idea to me!

However I realise that C is a very mature programming language and I'm imagining that someone else a lot smarter than me has already had this idea and decided not to use it. Why not?

Read about forward declarations and you will understand what is wrong with having only one .c file. And also that would be a crazy mess. — Alexander Lapenkov, Feb 09 '17 at 11:31
C is just not one of the modern languages where symbols can be defined just *somewhere* in a project context. It has very strict rules about name scopes and resolution. Use a different language if you want a project scope for named identifiers. — grek40, Feb 09 '17 at 11:32
@AlexandrLapenkov I'm not suggesting writing my code in one giant file, I agree that would be a mess. I'm talking about writing my code in lots of different files but using a Makefile to concatenate them at the last step? — , Feb 09 '17 at 11:46
The **purpose** of `make` is *to determine which source files need to be recompiled, and which don't*, to reduce recompile times during development. You're asking about replacing that with a script that just does "compile everything" every time. The very existence of `make` should give a hint at the answer. ;-) — DevSolar, Feb 09 '17 at 11:53
If you concatenate all your source files, what happens if multiple source files each define a file-scope variable with the same name? For example, two different libraries might use `static size_t count` internally, to count completely different things; what happens if the linker thinks they're the same variable? — Justin Time - Reinstate Monica, Feb 10 '17 at 01:23
Also, what if you change one line in one source file, in a project with one million source files? Do you want to effectively compile all 1,000,000 files again when only one changed (by concatenating them all), or just compile the one that changed and link it with the object files from the last time you compiled the other 999,999 files? — Justin Time - Reinstate Monica, Feb 10 '17 at 01:27
Also, try this method on, say, a BeagleBoard with 512MB of memory. The OOM kills will probably make you wish you never converted your [large] project. I can't avoid OOM kills on some C++ projects that *don't* use the unity or amalgamation build. — jww, Feb 10 '17 at 02:03
@OhFiddyYouSoWiddy makefiles are challenging. Especially when you are new to makefiles. For small projects, the time to compile via a script vs a proper makefile --> might be similar. **The benefits of makefiles outweighs the costs --> when there are thousands/millions of files and you edit one line in one file and compiling can take hours to recompile everything but if you use a makefile the recompile only takes less than a minute.** (Personally I find makefiles to be too low level and so I use something like `cmake` which is easier and still gets me the benefits (and more) of makefiles.) — Trevor Boyd Smith, Feb 10 '17 at 15:04
@@OhFiddyYouSoWiddy: Remember that C and its development infrastructure (Makefiles, libraries, linkers, &c) was developed when computer resources were orders of magnitude less than today's. If you concatenated a sizeable project's source files and tried to compile the result on a PDP-11, it wouldn't fit in memory, and would take ages to compile. Doing it in small chunks was the only practical way. That organization still works well: if it ain't broke, why fix it? — jamesqf, Feb 10 '17 at 21:21
@JustinTime: The multiple-definitions issue can be avoided with “poor man's namespaces” where each definition is prefixed with the name of its module. So, `a.c` has an internal variable `A_count`, and `b.c` has an internal variable `B_count`, so no conflict. — dan04, Feb 11 '17 at 00:21
Why are you writing one header per function? Who told you to do that? — Lightness Races in Orbit, Feb 11 '17 at 14:48
Just some food for thought: It's interesting to me that all comments and answers draw this implicit and unspoken line between your project, and third-party static libraries (which are essentially, sometimes literally on certain platforms, just archives of compiled translation units). All the great points below can be further reinforced by considering concatenating the source code of all required libraries to your own source and compiling as well. Proprietary source notwithstanding, there isn't really a distinction between source compiled by you vs source compiled by somebody else. — Jason C, Feb 12 '17 at 16:02

score 106 · Accepted Answer · edited May 23 '17 at 12:26

Some software are built that way.

A typical example is SQLite. It is sometimes compiled as an amalgamation (done at build time from many source files).

But that approach has pros and cons.

Obviously, the compile time will increase by quite a lot. So it is practical only if you compile that stuff rarely.

Perhaps, the compiler might optimize a bit more. But with link time optimizations (e.g. if using a recent GCC, compile and link with gcc -flto -O2) you can get the same effect (of course, at the expense of increased build time).

I don't have to write a header file for each function

That is a wrong approach (of having one header file per function). For a single-person project (of less than a hundred thousand lines of code, a.k.a. KLOC = kilo line of code), it is quite reasonable -at least for small projects- to have a single common header file (which you could pre-compile if using GCC), which will contain declarations of all public functions and types, and perhaps definitions of static inline functions (those small enough and called frequently enough to profit from inlining). For example, the sash shell is organized that way (and so is the lout formatter, with 52 KLOC).

You might also have a few header files, and perhaps have some single "grouping" header which #include-s all of them (and which you could pre-compile). See for example jansson (which actually has a single public header file) and GTK (which has lots of internal headers, but most applications using it have just one #include <gtk/gtk.h> which in turn include all the internal headers). On the opposite side, POSIX has a big lot of header files, and it documents which ones should be included and in which order.

Some people prefer to have a lot of header files (and some even favor putting a single function declaration in its own header). I don't (for personal projects, or small projects on which only two or three persons would commit code), but it is a matter of taste. BTW, when a project grows a lot, it happens quite often that the set of header files (and of translation units) changes significantly. Look also into REDIS (it has 139 .h header files and 214 .c files i.e. translation units totalizing 126 KLOC).

Having one or several translation units is also a matter of taste (and of convenience and habits and conventions). My preference is to have source files (that is translation units) which are not too small, typically several thousand lines each, and often have (for a small project of less than 60 KLOC) a common single header file. Don't forget to use some build automation tool like GNU make (often with a parallel build through make -j; then you'll have several compilation processes running concurrently). The advantage of having such a source file organization is that compilation is reasonably quick. BTW, in some cases a metaprogramming approach is worthwhile: some of your (internal header, or translation units) C "source" files could be generated by something else (e.g. some script in AWK, some specialized C program like bison or your own thing).

Remember that C was designed in the 1970s, for computers much smaller and slower than your favorite laptop today (typically, memory was at that time a megabyte at most, or even a few hundred kilobytes, and the computer was at least a thousand times slower than your mobile phone today).

I strongly suggest to study the source code and build some existing free software projects (e.g. those on GitHub or SourceForge or your favorite Linux distribution). You'll learn that they are different approaches. Remember that in C conventions and habits matter a lot in practice, so there are different ways to organize your project in .c and .h files. Read about the C preprocessor.

It also means I don't have to include the standard libraries in each file I create

You include header files, not libraries (but you should link libraries). But you could include them in each .c files (and many projects are doing that), or you could include them in one single header and pre-compile that header, or you could have a dozen of headers and include them after system headers in each compilation unit. YMMV. Notice that preprocessing time is quick on today's computers (at least, when you ask the compiler to optimize, since optimizations takes more time than parsing & preprocessing).

Notice that what goes into some #include-d file is conventional (and is not defined by the C specification). Some programs have some of their code in some such file (which should then not be called a "header", just some "included file"; and which then should not have a .h suffix, but something else like .inc). Look for example into XPM files. At the other extreme, you might in principle not have any of your own header files (you still need header files from the implementation, like <stdio.h> or <dlfcn.h> from your POSIX system) and copy and paste duplicated code in your .c files -e.g. have the line int foo(void); in every .c file, but that is very bad practice and is frowned upon. However, some programs are generating C files sharing some common content.

BTW, C or C++14 do not have modules (like OCaml has). In other words, in C a module is mostly a convention.

^{(notice that having many thousands of very small .h and .c files of only a few dozen lines each may slow down your build time dramatically; having hundreds of files of a few hundred lines each is more reasonable, in term of build time.)}

If you begin to work on a single-person project in C, I would suggest to first have one header file (and pre-compile it) and several .c translation units. In practice, you'll change .c files much more often than .h ones. Once you have more than 10 KLOC you might refactor that into several header files. Such a refactoring is tricky to design, but easy to do (just a lot of copy&pasting chunk of codes). Other people would have different suggestions and hints (and that is ok!). But don't forget to enable all warnings and debug information when compiling (so compile with gcc -Wall -g, perhaps setting CFLAGS= -Wall -g in your Makefile). Use the gdb debugger (and valgrind...). Ask for optimizations (-O2) when you benchmark an already-debugged program. Also use a version control system like Git.

On the contrary, if you are designing a larger project on which several persons would work, it could be better to have several files -even several header files- (intuitively, each file has a single person mainly responsible for it, with others making minor contributions to that file).

In a comment, you add:

I'm talking about writing my code in lots of different files but using a Makefile to concatenate them

I don't see why that would be useful (except in very weird cases). It is much better (and very usual and common practice) to compile each translation unit (e.g. each .c file) into its object file (a .o ELF file on Linux) and link them later. This is easy with make (in practice, when you'll change only one .c file e.g. to fix a bug, only that file gets compiled and the incremental build is really quick), and you can ask it to compile object files in parallel using make -j (and then your build goes really fast on your multi-core processor).

another example is so-called [Unity Build](http://stackoverflow.com/questions/847974/the-benefits-disadvantages-of-unity-builds), used in C++ but I think is applicable to C too. The main reason is to speed up building by reducing IO overheat. — Andriy Tylychko, Feb 09 '17 at 11:38
Having a single header file on any project is a terrible idea. — Jack Aidley, Feb 09 '17 at 12:42
It depends upon the project and the header. It is common practice, and it is necessary as soon as you consider header pre-compilation. — Basile Starynkevitch, Feb 09 '17 at 12:43
Hardware sure became more powerful since the '70, but projects grew in size as well. Building something as large as Firefox as a single translation unit will crash on many modern systems. — Dmitry Grigoryev, Feb 09 '17 at 12:52
@JackAidley: GTK is quite a large project in C but has one *public* header ``. I won't qualify GTK as a "terrible" project. — Basile Starynkevitch, Feb 09 '17 at 12:54
@BasileStarynkevitch: having a single _public_ header is not the same as having a single header. And the gtk.h header does nothing but include somewhat over 200 other headers. — Jack Aidley, Feb 09 '17 at 12:56
Again, it depends upon the project and the habits. For a small project (the one you work alone on) having a single header (with only a few thousand lines) is not that bad (and permits its pre-compilation). And some bigger projects have much less headers than translation units (look into [GCC](http://gcc.gnu.org/) as a concrete example). There is no single universal good way, it is a matter of conventions and habits. — Basile Starynkevitch, Feb 09 '17 at 12:57
Unified building is usually faster than separate files, not slower. Provided you have enough memory. — pjc50, Feb 09 '17 at 13:44
You can do what `lout` does: one single `externs.h` header file, more than 50 `.c` files. I don't feel that bad for a single-person project (and `lout` compiles in 9 seconds with `make -j` on my desktop) — Basile Starynkevitch, Feb 09 '17 at 13:45
@OhFiddyYouSoWiddy SQLite doesn't "do that". It's a way to distribute sqlite to make it easier to include in other projects: instead of a bunch of files that are themselves implementation details nobody cares for, you have one `.c` file and one `.h` file. But the amalgamation is not how the project is developed. The amalgamation is the product of a distribution build process - it's machine-generated. — Kuba hasn't forgotten Monica, Feb 09 '17 at 22:03
@JackAidley So your "hello world" has at least two header files? — user253751, Feb 09 '17 at 23:22
you might want to add that the preprocessor itself allows for some kinds of metaprogramming (e.g. the contents of a header change based on `#define`s). Including a header multiple times (maybe with different `#define`s each time) does a bit more than just "copy & paste" the headers contents. Boost sometimes abuses this. — hoffmale, Feb 10 '17 at 18:14
@BasileStarynkevitch: "Header precompilation" is a legacy feature of some C++ compilers (although it might also be supported by them for C) and makes no sense for C, so I'm not sure why you even brought it up for a question about C. — R.. GitHub STOP HELPING ICE, Feb 10 '17 at 22:02
@hoffmale: _"Including a header multiple times (maybe with different #defines each time) does a bit more than just "copy & paste" the headers content"_ No, it doesn't. The effect you're seeing is the result of the behaviour of those `#defines` in the "copy/pasted" result. `#include` is a "copy & paste", period. — Lightness Races in Orbit, Feb 11 '17 at 14:50
*Obviously, the compile time will increase by quite a lot* Depending on the project compiling amalgamated source files can actually be faster. With all source files concatenated linking is significantly faster, and header files that are included in multiple source files will only be parsed once. — David Brown, Feb 11 '17 at 20:27

score 26 · Answer 2 · edited Feb 10 '17 at 14:13

26

You could do that, but we like to separate C programs into separate translation units, chiefly because:

It speeds up builds. You only need to rebuild the files that have changed, and those can be linked with other compiled files to form the final program.
The C standard library consists of pre-compiled components. Would you really want to have to recompile all that?
It's easier to collaborate with other programmers if the code base is split up into different files.

edited Feb 10 '17 at 14:13

Toby Speight

27,591
48
66
103

answered Feb 09 '17 at 11:32

Bathsheba

231,907
34
361
483

2

I have never heard of translation units before. Thank you, I will go and learn about them. Any good tutorials off the top of your head? – Feb 09 '17 at 11:43
5

1) is not always true, especially for C++; I've seen major build time reductions from concatenation. "translation unit" isn't something that requires a tutorial, it's just a way of saying "a C file + all its include files". It's a phrase that gets used in the C standard a lot - definitions last until the end of the translation unit. – pjc50 Feb 09 '17 at 13:47
3

@pjc50: C++, with its templates, compile time evaluation capabilities, and function overloading is an entirely different beast. (For C++ I use a distributed build environment, but still spend an inordinate amount of time on this site during compilations.) – Bathsheba Feb 09 '17 at 13:49
3) doesn't stand. With cat under your belt, as the OP even suggested, you can still have the "code base split up into different files" and the actual compilation of a single big file. – mihai Jan 31 '18 at 12:37

score 18 · Answer 3 · answered Feb 10 '17 at 11:32

Your approach of concatenating .c files is completely broken:

Even though the command cat *.c > to_compile.c will put all functions into a single file, order matters: You must have each function declared before its first use.

That is, you have dependencies between your .c files which force a certain order. If your concatenation command fails to honor this order, you won't be able to compile the result.

Also, if you have two functions that recursively use each other, there is absolutely no way around writing a forward declaration for at least one of the two. You may as well put those forward declarations into a header file where people expect to find them.
When you concatenate everything into a single file, you force a full rebuild whenever a single line in your project changes.

With the classic .c/.h split compilation approach, a change in the implementation of a function necessitates recompilation of exactly one file, while a change in a header necessitates recompilation of the files that actually include this header. This can easily speed up the rebuild after a small change by a factor of 100 or more (depending on the count of .c files).
You loose all the ability for parallel compilation when you concatenate everything into a single file.

Have a big fat 12 core processor with hyper-threading enabled? Pity, your concatenated source file is compiled by a single thread. You just lost a speedup of a factor greater than 20... Ok, this is an extreme example, but I have build software with make -j16 already, and I tell you, it can make a huge difference.
Compilation times are generally not linear.

Usually compilers contain at least some algorithms that have a quadratic runtime behavior. Consequently, there is usually some threshold from which on aggregated compilation is actually slower than compilation of the independent parts.

Obviously, the precise location of this threshold depends on the compiler and the optimization flags you pass to it, but I have seen a compiler take over half an hour on a single huge source file. You don't want to have such an obstacle in your change-compile-test loop.

Make no mistake: Even though it comes with all these problems, there are people who use .c file concatenation in practice, and some C++ programmers get pretty much to the same point by moving everything into templates (so that the implementation is found in the .hpp file and there is no associated .cpp file), letting the preprocessor do the concatenation. I fail to see how they can ignore these problems, but they do.

Also note, that many of these problems only become apparent with larger project sizes. If your project is less than 5000 lines of code, it's still relatively irrelevant how you compile it. But when you have more than 50000 lines of code, you definitely want a build system that supports incremental and parallel builds. Otherwise, you are wasting your working time.

Btw, an anecdote. One of my larger, long term projects, C++, compiled with Borlands compiler, has a significant number of generated source files containing thousands of lines of function calls. Some weird quirk of their compiler caused compilation times to increase exponentially wrt number of calls, to the point where a 5000 line file would take on the order of minutes. I still can't explain it, but rebuilding the project was a nightmare. Ultimately I switched to generating arrays of data and looping over them, which weirdly decreased compilation times from minutes to milliseconds, but... — Jason C, Feb 12 '17 at 16:18
... Some of the older files that we didn't regenerate remain in the project to this day, constantly traumatizing us with long build times. I had to explicitly structure the project to ensure those files were in their own compilation units specifically to avoid rebuilding them. Gcc and msvc didn't choke on them, it was purely a borland quirk. — Jason C, Feb 12 '17 at 16:20

score 16 · Answer 4 · edited Feb 09 '17 at 11:38

With modularity, you can share your library without sharing the code.
For large projects, if you change a single file, you would end up compiling the complete project.
You may run out of memory more easily when you attempt to compile large projects.
You may have circular dependencies in modules, modularity helps in maintaining those.

There may be some gains in your approach, but for languages like C, compiling each module makes more sense.

score 15 · Answer 5 · edited Feb 11 '17 at 08:39

Because splitting things up is good program design. Good program design is all about modularity, autonomous code modules, and code re-usability. As it turns out, common sense will get you very far when doing program design: Things that don't belong together shouldn't be placed together.

Placing non-related code in different translation units means that you can localize the scope of variables and functions as much as possible.

Merging things together creates tight coupling, meaning awkward dependencies between code files that really shouldn't even have to know about each other's existence. This is why a "global.h" which contains all the includes in a project is a bad thing, because it creates a tight coupling between every non-related file in your whole project.

Suppose you are writing firmware to control a car. One module in the program controls the car FM radio. Then you re-use the radio code in another project, to control the FM radio in a smart phone. And then your radio code won't compile because it can't find brakes, wheels, gears, etc. Things that doesn't make the slightest sense for the FM radio, let alone the smart phone to know about.

What's even worse is that if you have tight coupling, bugs escalate throughout the whole program, instead of staying local to the module where the bug is located. This makes the bug consequences far more severe. You write a bug in your FM radio code and then suddenly the brakes of the car stop working. Even though you haven't touched the brake code with your update that contained the bug.

If a bug in one module breaks completely non-related things, it is almost certainly because of poor program design. And a certain way to achieve poor program design is to merge everything in your project together into one big blob.

Kuba hasn't forgotten Monica · Answer 6 · 2017-02-10T15:04:58.253

Header files should define interfaces - that's a desirable convention to follow. They aren't meant to declare everything that's in a corresponding .c file, or a group of .c files. Instead, they declare all functionality in the .c file(s) that is available to their users. A well designed .h file comprises a basic document of the interface exposed by the code in the .c file even if there isn't a single comment in it. One way to approach the design of a C module is to write the header file first, and then implement it in one or more .c files.

Corollary: functions and data structures internal to the implementation of a .c file don't normally belong in the header file. You might need forward declarations, but those should be local and all variables and functions thus declared and defined should be static: if they are not a part of the interface, the linker shouldn't see them.

score 8 · Answer 7 · edited Feb 11 '17 at 07:35

The main reason is compilation time. Compiling one small file when you change it may take a short amount of time. If you would however compile the whole project whenever you change single line, then you would compile - for example - 10,000 files each time, which could take a lot longer.

If you have - as in the example above - 10,000 source files and compiling one takes 10 ms, then the whole project builds incrementally (after changing single file) either in (10 ms + linking time) if you compile just this changed file, or (10 ms * 10000 + short linking time) if you compile everything as a single concatenated blob.

score 8 · Answer 8 · answered Feb 09 '17 at 13:07

8

While you can still write your program in a modular way and build it as a single translation unit, you will miss all the mechanisms C provides to enforce that modularity. With multiple translation units you have fine control on your modules' interfaces by using e.g. extern and static keywords.

By merging your code into a single translation unit, you will miss any modularity issues you might have because the compiler won't warn you about them. In a big project this will eventually result in unintended dependencies spreading around. In the end, you will have trouble changing any module without creating global side-effects in other modules.

answered Feb 09 '17 at 13:07

Dmitry Grigoryev

3,156
1
25
53

5

In C, **modularity is** mostly **conventional** (and unrelated to organization in files). – Basile Starynkevitch Feb 09 '17 at 14:03
True (notice that I mention *translation units*, not files), but whatever mechanisms there are that **enforce** modularity are still useful. – Dmitry Grigoryev Feb 09 '17 at 18:29

score 4 · Answer 9 · edited Feb 11 '17 at 08:41

If you put all of your includes in one place you would only have to define what you need once, rather than in all your source files.

That's the purpose of .h files, so you can define what you need once and include it everywhere. Some projects even have an everything.h header that includes every individual .h file. So, your pro can be achieved with separate .c files as well.

This means that I don't have to write a header file for each function I create [...]

You're not supposed to write one header file for every function anyway. You're supposed to have one header file for a set of related functions. So your con is not valid either.

score 2 · Answer 10 · answered Feb 09 '17 at 11:57

This means that I don't have to write a header file for each function I create (because they're already in the main source file) and it also means I don't have to include the standard libraries in each file I create. This seems like a great idea to me!

The pros you noticed are actually a reason why this is sometimes done in a smaller scale.

For large programs, it's impractical. Like other good answers mentioned, this can increase build times substantially.

However, it can be used to break up a translation unit into smaller bits, which share access to functions in a way reminiscent of Java's package accessibility.

The way the above is achieved involves some discipline and help from the preprocessor.

For example, you can break your translation unit into two files:

// a.c

static void utility() {
}

static void a_func() {
  utility();
}

// b.c

static void b_func() {
  utility();
}

Now you add a file for your translation unit:

// ab.c

static void utility();

#include "a.c"
#include "b.c"

And your build system doesn't build either a.c or b.c, but instead builds only ab.o out of ab.c.

What does ab.c accomplish?

It includes both files to generate a single translation unit, and provides a prototype for the utility. So that the code in both a.c and b.c could see it, regardless of the order in which they are included, and without requiring the function to be extern.

Why not concatenate C source files before compilation?

10 Answers10

Linked