3

I can understand that:

  • One of the origins of the UB is a performance increase (e.g. by removing never executed code, such as if (i+1 < i) { /* never_executed_code */ }; UPD: if i is a signed integer).
  • UB can be triggered at compile time because C does not clearly distinguish between compile time and run time. The "whole language is based on the (rather unhelpful) concept of an "abstract machine" (link).

However, I cannot understand yet why C preprocessor is a subject of undefined behavior? It is known that preprocessing directives are executed at compile time.

Consider C11, 6.10.3.3 The ## operator, 3:

If the result is not a valid preprocessing token, the behavior is undefined.

Why not make it a constraint? For example:

The result shall be a valid preprocessing token.

The same question goes for all the other "the behavior is undefined" in 6.10 Preprocessing directives.

pmor
  • 5,392
  • 4
  • 17
  • 36
  • Speculation: certain problems are simply too difficult to solve (or impossible). If some constraint reducing to such a problem, it is better to leave the required behavior undefined than force the compiler to solve it. – Eugene Sh. Jan 12 '22 at 15:35
  • 3
    You have predicated your question on a bad answer. The use of an abstract machine is extremely helpful and is a common mathematical tool in working with the semantics of programming languages. You should base your question on the C standard, historical practice, and so on, not bad opinions. – Eric Postpischil Jan 12 '22 at 15:36
  • Everything outside the scope of the C standard is UB. It would be hard to write a fully deterministic language with no unknowns. – Lundin Jan 12 '22 at 15:46
  • 3
    Those instances act as extension points. Because the standard doesn’t define the behaviour in those cases, implementations are free to define it instead, and still claim compliance with the standard. GCC [took advantage of this](https://gcc.gnu.org/onlinedocs/cpp/Variadic-Macros.html). – user3840170 Jan 12 '22 at 15:53
  • 2
    `if (i+1 < i)` depends on the signedness of `i` (and signed int overflow is undefined) – wildplasser Jan 12 '22 at 16:10
  • I think I agree with the original asker - e.g. C11, 6.10.3.3 could've been "the output of the preprocessor is fully defined, in all cases, but whether the preprocessor's output makes sense for the compiler is another matter entirely". – Brendan Jan 12 '22 at 16:18
  • 1
    @Brendan This breaks the existing paradigm where the preprocessor is running first, then the compiler is taking it's output. With this new constraint it will force another pass and complicate the implementation. – Eugene Sh. Jan 12 '22 at 18:18
  • 1
    @EugeneSh.: I don't see how defining the preprocessor's behavior could prevent a pipelined implementation (e.g. single-pass, where pre-processor feeds its output one logical line at a time to compiler to compile immediately and/or in parallel). – Brendan Jan 13 '22 at 01:11
  • @Lundin Can you please tell more on your opinion that the concept of an abstract machine and the formal definition of evaluation is "rather unhelpful"? – pmor Jan 13 '22 at 18:17
  • @pmor Take [this discussion](https://stackoverflow.com/a/58697222/584518) about whether volatile must act as a memory barrier or not. When reading the rules of the abstract machine, then it definitely must. Yet most compilers have implemented memory barriers and instruction re-ordering differently. Therefore the text about the abstract machine is unhelpful, because the part about what optimizations a compiler is allowed to do is all too vague and fuzzy. Whenever the standard says something and then all compilers implement it differently, there's usually a problem in the standard. – Lundin Jan 14 '22 at 08:05
  • @Lundin: The Standard waives jurisdiction over many matters as "quality of implementation" issues. If many compilers handled them differently, but each did so in a manner that would optimally serve its users, such waiver would be a good thing. The real problem is that some compilers interpret such waivers of jurisdiction as invitations to pretend that useful but non-portable programs as "broken" rather than recognizing that a quality general-purpose implementation should be expected to meaningfully process a much wider range of programs than mandated by the Standard. – supercat Jan 17 '22 at 18:25
  • @Lundin A simple question: if the hosted implementation supports threads, then does it mean that the abstract machine is multithreaded? – pmor Jan 17 '22 at 20:32

3 Answers3

5

Why is the C preprocessor a subject of undefined behavior?

When the C standard was created, there were some existing C preprocessors and there was some imaginary ideal C preprocessor in the minds of standardization committee members.

So there were these gray areas, where committee members weren't completely sure what would they want to do and/or existing C preprocessor implementations differed which each other in behavior.

So, these cases are not defined behavior. Because the C committee members are not completely sure what the behavior actually should be. So there is no requirement on what it should be.

One of the origins of the UB

Yes, one of.

UB may exist to ease up implementing the language. Like for example, in case of the preprocessor, the preprocessor writers don't have to care about what happens when an invalid preprocessor token is a result of ##.

Or UB may exist to reconcile existing implementations with different behaviors or as a point for extensions. So a preprocessor that segfaults in case of UB, a preprocessor that accepts and works in case of UB, and a preprocessor that formats your hard drive in case of UB, all can be standard conformant (but I wouldn't want to work on that one that formats your drive).

KamilCuk
  • 120,984
  • 8
  • 59
  • 111
  • Why not formalizing the C preprocessor via, for example, [Dave Prosser's C Preprocessing Algorithm](https://www.spinellis.gr/blog/20060626/) (or similar)? – pmor Jan 13 '22 at 18:26
  • @pmor Why?!? What **problem** would that solve? There's no need to standardize things that "only come into play when the expansion process is abused" when the answer to "Doctor, it hurts when I do this" is "Don't do that". – Andrew Henle Jan 14 '22 at 18:48
  • @pmor `Why not` there are better things to do with your life - spending time with family, exercise, sleep. Out of my personal curiosity, is there any specific reason why are you asking so many vague yet specific questions about specific verses in some document? Is there a specific reason or point in all this? Why, after so much time, answers and research spent on this forum, still you seem to not be able to answer at least some of your questions? Are you solving any actual real-life problems with those questions? – KamilCuk Jan 14 '22 at 20:06
  • In brief: I fix bugs / test C compilers. Often I need to understand the details, rationales, etc. If these questions are inappropriate / undesirable for stackoverflow.com, then what website / forum can you recommend? – pmor Jan 14 '22 at 21:59
  • Re: "What problem would that solve?" Eliminating ambiguities. Formal description is aimed to eliminate the ambiguities. FYI: The algorithm was used by X3J11 (ANSI C standard) committee as a basis for the standard's wording. However, I don't know why X3J11 didn't use it directly in the standard. – pmor Jan 14 '22 at 22:12
  • With C preprocessor you may (unintentionally) trigger UB, which may format your HDD. With formally defined C preprocessor this is impossible, because there is no UB. – pmor Jan 15 '22 at 00:33
  • 2
    @pmor: What should the standard say about a compiler where `#include \`./woozle\`` would execute a program called `woozle` and behave as though its output were inserted into the source file? It's not hard to imagine such a feature being useful, but there would be no way the Standard could say anything about the possible consequences of compiling arbitrary source texts on such a compiler. – supercat Jan 17 '22 at 16:09
  • 3
    @pmor: Compilers are allowed to specify how they will behave in cases where the Standard does not. If the designers of a compiler expect it to process a certain construct a certain way, a test suite for that compiler should confirm that behavior. If the compiler doesn't behave as expected, that would suggest that there is probably a bug somewhere, regardless of whether the Standard would say anything about such behavior. – supercat Jan 17 '22 at 16:12
  • @supercat: exactly. Furthermore, "undefined behaviour" whose manifestation is to format the hard drive is worth reporting as a bug, even though the C standard doesn't prohibit it. (Having said that, the hypothetical `#include \`./run-a-random-program\`` might format your hard-drive, depending on the contents of `run-a-random-program`, so the bug report might not be accepted in all cases.) As a much more mundane example, would one want to ban an implementation which chose to allow `.##.##.` as constructing the token `...`? – rici Jan 17 '22 at 16:31
  • @rici: For what purposes would an implementation that treats `##` as though the left and right operands appeared consecutively without whitespace be superior to one that imposes additional requirements? If some pre-standard implementations were unable to recognize 0x1e-3 as equivalent to 30-3, I could see the Standard cutting them some slack, but is any purpose other than validation of compatibility with inferior implementations served by making compiler writers that handled that construct without difficulty change their compilers to reject it? – supercat Jan 17 '22 at 16:59
  • 1
    @supercat: it's not about whitespace. The problem with `. ## . ## .` is that `..` is not a valid token, so neither evaluation order for `##` can produce `...`. However, it's possible to imagine an implementation which special-cases `. ## .` as a kind of pseudo-token, so that the second `##` will succeed. (Why not? If that's what the compiler writer wants to do.) As a vaguely related case, `3e ## - ## 2` works on most implementations (https://coliru.stacked-crooked.com/a/2eb34b972798d1bf) but it's UB because it relies on the order of performing `##`. – rici Jan 17 '22 at 17:06
  • @rici: It's even easier to imagine implementations that wouldn't try to evaluate whether something preceded or followed by ## was a token, and thus wouldn't care whether it was or wasn't, since that's how compilers used to work before the Standard required them to add additional needless complexity. – supercat Jan 17 '22 at 17:08
  • @rici: Sorry--I misunderstood which side of the argument you were taking. A major problem with the C Standard is that the Committee is populated by at least three opposing factions, of which it's only possible for a usable standard to satisfy two. Many things in the Standard are left undefined not because there was any consensus that they should be considered erroneous, but rather because there wasn't a consensus that they shouldn't be. – supercat Jan 17 '22 at 17:17
  • @rici: I think the confusion stemmed from your line about "not about whitespace". For many pre-standard compilers, white space is what distinguished `a+ ++b` from `a++ +b`, and if a the preprocessor formed `a+##++b`, they would have no reason to know nor care that the two plus signs immediately preceding `b` formed a token. This behavior is inconsistent with the Standard, but would in many cases be a more useful way to handle `##`. – supercat Jan 17 '22 at 17:27
  • @supercat: maybe so. That was basically my point about `. ## . ## .`; it's arguably more useful to allow that to construct `...` than to flag it as an error, but anyway the UB in §6.10.3p3 is not a constraint, so the compiler-writer is free to do either. That's an instance of "Compilers are allowed to specify how they will behave in cases where the standard does not" (in other words, unspecified sometimes just means unspecified), a statement with which I am in complete agreement. Not that my opinion matters much. – rici Jan 17 '22 at 17:47
  • @rici If the `preprocessing-token` is formally defined (C11, 6.4 Lexical elements, Syntax, 1), then the "valid preprocessing token" can be determined. Are you aware of (apparently existing) examples, where in case of gray areas existing C preprocessor implementations differ which each other in behavior w.r.t. determining a "valid preprocessing token"? – pmor Jan 18 '22 at 14:42
  • @rici Re: "would one want to ban an implementation which chose to allow `.##.##.` as constructing the token `...`": FYI: Both Intel and Microsoft C compilers allow. – pmor Jan 18 '22 at 14:46
  • @pmor: I knew that was the historic behaviour of MSVC, but I was under the impression that it had been changed with the newer preprocessor. Anyway, my argument was that accepting that construct is not problematic. Your proposal would make it a violation of the standard. (There are other problems with the old MS preprocessor, and I believe that MS accepts that it violates the standards, which is why there is a new one.) – rici Jan 18 '22 at 15:16
  • Anyway, it seems to me that this question, like many [tag:language-lawyer] questions of the form "Why isn't the standard the way I think it should be?", is clearly opinion-based and therefore well outside of the guidelines for SO. – rici Jan 18 '22 at 15:20
  • @pmor: by the way, just to expand a bit on your observation. I tried the construct (using http://gcc.godbolt.com); with the latest MSVC (using C++, because the C version is outdated) and supplying `/Zc:preprocessor` to activate the new preprocessor, which produced the warning: `warning C5103: pasting '.' and '.' does not result in a valid preprocessing token`. But it still compiled, unlike gcc and clang. ICC, as you say, issues no complaints, but ICX (which I guess is clang-based) produces the expected error message. – rici Jan 18 '22 at 15:58
  • @supercat Re: "``#include `./woozle` ``": C11: "The directive resulting after all replacements shall match one of the two previous forms". Is threw a potential difficulty to determine whether `` ` ./woozle ` `` is a "valid preprocessing token"? – pmor Jan 18 '22 at 16:37
  • @rici: Many "why is X a certain way" questions can be treated as a combination of two concrete questions: Did the creator of X document why it is the way it is, and would alternatives to X have problems which X avoids. The most useful purpose for knowing why something was done is often to know which alternatives that might seem worth exploring, aren't. – supercat Jan 18 '22 at 16:55
  • @pmor: Change that to `#include "\`./woozle\`"` if you like. My point is that the C Standard makes no attempt to contemplate all of the ways that implementations might process various corner cases that might be unusual but would help their users accomplish what needs to be done. – supercat Jan 18 '22 at 18:31
  • @rici Correction: MSVC under `/std:c11` produces `warning C5103: pasting '.' and '.' does not result in a valid preprocessing token`. – pmor Jan 19 '22 at 11:35
  • @rici Re: "opinion-based": the reason of making a signed integer overflow UB is clear: performance increase by removing never executed code, such as `if (i+1 < i) { /* never_executed_code */ };` (if `i` is a signed integer). It is expected that the reasons of making a certain C preprocessor behaviors undefined are clear as well. – pmor Jan 26 '22 at 20:45
2

Suppose a file which is read in via include directive ends with the partial line:

#define foo bar

Depending upon the design of the preprocessor, it's possible that the partial token bar might be concatenated to whatever appears at the start of the line following the #include directive, or that whatever appears on that line will behave as though it were placed on the line with the #define directive, but with a whitespace separating it from the token bar, and it would hardly be inconceivable that a build script might rely upon such behaviors. It's also possible that implementations might behave as though a newline were inserted at the end of the included file, or might ignore the last partial line of such a file.

Any code which relied upon one of the former behaviors would clearly have been non-portable, but if code exploited such behavior to do something that would otherwise not be practical, such code would hardly be "erroneous", and the authors of the Standard would not have wanted to forbid an implementation that would process it usefully from continuing to do so.

When the Standard uses the phrase "non-portable or erroneous", that does not mean "non-portable, therefore erroneous". Prior to the publication of C89, C implementations defined many useful constructs, but none of them were defined by "the C Standard" since there wasn't one. If an implementation defined the behavior of some construct, some didn't, and the Standard left the construct as "Undefined", that would simply preserve the status quo where implementations that chose to define a useful behavior would do so, those that chose not to wouldn't, and programs that relied upon such behaviors would be "non-portable", working correctly on implementations that supported the behaviors, but not on those that didn't.

supercat
  • 77,689
  • 9
  • 166
  • 211
-1

Without getting into specifics, my guess is, there exist several preprocessor implementations which have bugs, but the Standard doesn't want to declare them non-conforming, for compatibility reasons.

In human language: if you write a program which has X in it, preprocessor does weird stuff.

In standardese: the behavior of program with X is undefined.

If the standard says something like "The result shall be a valid preprocessing token", it might be unclear what "shall" means in this context.

  • The programmer shall write the program so this condition holds? If so, the wording with "undefined behavior" is clearer and more uniform (it appears in other places too)
  • The preprocessor shall make sure this condition holds? If so, this requires dedicated logic which checks the condition; may be impractical to implement.
anatolyg
  • 26,506
  • 9
  • 60
  • 134
  • 1
    What do you base this guess on? Getting into specifics would help. – John Kugelman Jan 12 '22 at 18:33
  • Rather than saying "that have bugs", I would say "that handle certain constructs differently from other implementations". If e.g. a header file doesn't end with a newline, the Committee could have tried to guess at all of the ways that implementations might behave that might be useful in some situations, but if an implementation did something useful that the Committee hadn't anticipated, the Committee wouldn't want to be seen as forbidding that. – supercat Jan 18 '22 at 17:02