2

Here's what I mean: i++ + i++ is undefined, and so is out-of-bounds writes to an array.

The undefinedness of out-of-bound array writes is understandable: it can potentially be exploited to run arbitrary code, which is as undefined as you can get. Let's call this runtime undefined behavior.

For i++ + i++, however, the story seems to be different. Let's say a compiler generates something. It's undefined what exactly. Very undefined. In fact, it's so undefined that we used to hear cats could get pregnant (although more recently — as of CppCon 2016, I think — people started to realize that undefined behavior can't get a cat pregnant, after all).

However, once we open the box and see what the compiler has generated, and it's not exploitable code (injection, data race, etc. — for example, compiler chose to throw i++ + i++ away altogether), isn't it exactly what will be executed - isn't it perfectly defined from that point on?

In other words, this last case is what we can call compile-time undefined behavior. In cat terms, it's similar to Schrödinger's cat, whose state is unknown until you open the box (see the generated Assembly), at which point you see the actual reality to be executed. (I wonder if undefined behavior make a poisoned dead cat pregnant.)

Of course, undefined behavior is a legal term meant for the standard. The question is about the "behavior" that happens in reality.

Leo Heinsaar
  • 3,887
  • 3
  • 15
  • 35
  • From the perspective of the C++ standard none of this matters. – jotik Feb 22 '17 at 10:42
  • "isn't it perfectly defined from that point on" - NO , you cannot put the cat back in the box – M.M Feb 22 '17 at 10:44
  • 1
    Undefined is... well, "undefined" for the C++ language standard. Of course compilers and library implementations can implement a well-defined behavior in some or most cases, but it's still formally undefined behavior for the standard. – roalz Feb 22 '17 at 10:51
  • Suppose you are using your computer to keep track of when it is time to give the pills to your cats. Of course a miscalculation here can increase the risk of getting kittens. – Bo Persson Feb 22 '17 at 11:11

5 Answers5

6

"Undefined behavior" is a term that applies to standards. If the standard says the behavior is undefined (or doesn't define it, directly or indirectly) then the implementation of the standard may take any action its authors wish - or let things go their own course, and result in a litter of kittens if so the fate decided. From standard's point of view this is a binary, defined/undefined, no "more" or "less" defined.

Now, besides standards, you have the reality:

So - if a behavior is undefined by standard, the smart, friendly compiler will generate an error, or a warning (or a runtime error and compile-time warning). It's allowed to do so; after all nothing forbids such a reaction!

A slightly less friendly and smart, but still rather friendly and smart compiler will try to create a code that causes least harm. Say, if you didn't initialize your pointer, it will still initialize it to whatever makes most sense, be it nullptr, or the only instance of given class the type of the pointer, or such. This isn't so smart as it obscures bugs, which may bite you if you, say, switch to a different compiler, or the behavior gets standarized and to something else than the compiler produced, or goes against your intentions. Still, it's an approach, and not illegal. Some higher-level languages like Javascript tend to go that way, trying really hard to make sense of faulty code.

Also, if the reaction definitely makes good sense and "saves the day", doing what the author wanted in 99.999% cases - other authors will likely start implementing it too (even if just as a hint for a fix in the warning message) and it may eventually make its way into the standard.

And last but not least, the compiler will apply syntactic rules appropriate to the separate parts, and produce something, likely something that doesn't make much sense. It's unlikely to result in kittens, and you're absolutely not guaranteed the result will be repeatable - but usually it will be repeatable. Say, calling 'delete' on pointer to something that wasn't created with 'new' will about always result in segmentation fault. But that's "only" practice and not guaranteed by any means.

You're unlikely to actually encounter software that would go out of its way to cause definitely undesired or absolutely unrelated results to undefined behavior, so don't spay your cat just yet, but the standard doesn't forbid this. If the author of your software had a really twisted sense of humor, you might face something more creative. Like, the PAM system on Unix, upon encountering lack of the passwd file (all user accounts definitions) telling you upon login "You don't exist. Go away."

SF.
  • 13,549
  • 14
  • 71
  • 107
  • For many kinds of behavior, a compiler which claims to be suitable for some purpose (e.g. systems programming) will endeavor to behave in a fashion appropriate to such purpose, whether the Standard requires it or not, especially if the Standard doesn't specify any means of achieving the behavior (since some application fields would not require it, and not all platforms could provide it) but it could easily be supported using a corner case of another behavior which is defined. On most platforms, for example, if a program would have defined behavior without a `volatile` qualifier on some... – supercat Feb 22 '17 at 20:23
  • ...variable, the presence of a `volatile` qualifier on the variable would not affect behavior in cases where the storage was not in fact accessed by outside means (it might, of course, affect optimizations). If a library declares a variable `volatile` purely because some clients may cause it to be accessed outside the compiler's control, a client which won't be accessing it in such fashion should be able to access it using code that uses non-qualified pointers, provided that the qualifier is cast away when needed. Such clients would fail if the variable *was* accessed by outside means, ... – supercat Feb 22 '17 at 20:36
  • ...but a library's use of `volatile` to accommodate the clients that need it shouldn't have to impose a major burden on clients that don't need that qualifier. – supercat Feb 22 '17 at 20:37
3

The meaning of "undefined" in the C++ standard is behaviour "such as might arise upon use of an erroneous program construct or erroneous data, for which this International Standard imposes no requirements". (This quote directly from ISO/IEC 14882 - the C++ standard of 1998).

There is no provision for concepts of "more" or "less" undefined in this definition. If the standard imposes one or more requirements on required behaviour, the behaviour is not undefined.

Imposing no requirements and imposing one or more requirements are mutually exclusive, not a matter of degree.

Of course, an implementation may do anything it wants, including behaving consistently when presented with code that has some form of undefined behaviour. But what a compiler does has no effect on the standard at all. The standard is the basis for assessing correctness of an implementation (aka compiler, library, etc), not the reverse.

Peter
  • 35,646
  • 4
  • 32
  • 74
  • Does it say "erroneous", or does it say "erroneous *or non-portable*? A religion has evolved around the "erroneous", but I don't think anything in the Standard justifies it. – supercat Jun 19 '20 at 14:52
  • As I said, the first para in the answer was a direct quote (I neither paraphrased nor reworded) from the 1998 C++ standard (first sentence of definition 1.3.12) . It used the word "erroneous". [The relevant paragraph includes a couple more sentences and a note, The word "erroneous" is used three times in that paragraph, and "portable" or "non-portable" (or variations) not at all]. Later updates of the standard have their wordings tweaked, but all still use the word "erroneous" and do not use any variation of "portable". – Peter Jun 19 '20 at 15:17
  • Interesting, since the C Standard expressly uses the phrase "erroneous or non-portable", and the C++ Standard explicitly states that it merely defines what it means for an implementation to be conforming, and does not define any category of conformance for programs. – supercat Jun 19 '20 at 15:34
  • Further, the C++ Standard retains the language from the C Standard which states that one of the possible effects is "behaving during translation or program execution in a documented manner characteristic of the environment"--a concept which would make sense for correct but non-portable programs, but not so much for erroneous ones. – supercat Jun 19 '20 at 15:36
  • That's as may be. There are plenty of cases where the C and C++ standards each use distinct wording to describe the same concept. And cases where both standards (after their first ratified versions) have evolved in a way that has subsequently *introduced* different wordings or explicit incompatibilities with the other language. There are also now plenty of cases where something with undefined behaviour in one of C or C++ has become - due to an update on one of the standards - unspecified or implementation-defined behaviour in the other. – Peter Jun 20 '20 at 04:49
  • What fraction of non-trivial C++ programs for freestanding implementations refrain from doing anything the C++ Standard would characterize as UB? For C programs on most platforms the answer would be none, since all I/O on such platforms is done through constructs that behave "in a documented manner characteristic of the environment" without regard for whether the Standard requires them to do so, but perhaps C++ offers something better? – supercat Nov 30 '21 at 16:13
1

If some action invokes Implementation-Defined behavior, that means

  1. The standard defers to an implementer's judgment as to what behavior should result from that action; it makes no effort to forbid implementations from behaving in capricious fashion, but expects that quality implementations intended for a particular target platform and application field will behave in a fashion that would be reasonable for that platform and field.

  2. An implementation must specify and document a behavior that will result from the action, whether or not it would be practical to implement and document a behavior from which any code would likely benefit.

By contrast, if an action invokes Undefined Behavior, that means

  1. The standard defers to an implementer's judgment as to what behavior should result from that action; it makes no effort to forbid implementations from behaving in capricious fashion, but expects that quality implementations intended for a particular target platform and application field will behave in a fashion that would be reasonable for that platform and field.

  2. An implementation need not specify and document a behavior that will result from the action in cases where, in the implementer's judgment, there would not be any value in defining the behavior.

Some compilers don't interpret "Undefined Behavior" as an invitation to exercise judgment but instead as an indication that no judgment is required. When C89 was published, however, the term was widely understood as having having a meaning which differed from Implementation-Defined Behavior only in cases where defining a particular behavior would have been impractical; I've seen no indication that later standards were intended to change that.

Many implementations make no effort to be suitable for all possible purposes. Code cannot be expected to run correctly on implementations which are not designed to be suitable for its purposes, and the fact that code fails when run on unsuitable implementations does not in any way imply that the code is defective unless it failed to specify the requirements an implementation must meet. While there are no standard-recognized categories of implementations and behaviors they should be expected to implement, common sense will go pretty far in cases where compiler writers attempt to exercise judgment.

supercat
  • 77,689
  • 9
  • 166
  • 211
1

I'm going to focus on practical differences in kinds of UB, not just how the ISO standard rates them.

Related: What Every C Programmer Should Know About Undefined Behavior (http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html) has some good stuff about how data-dependent UB allows compilers to assume things won't happen, optimizing the asm. (The ISO C standard has nothing to say about what should happen for paths of execution that do encounter UB, so whatever ends up actually happening in such cases is fine.) e.g. that's why for(int i = 0 ; i<=n ; i++) can be assumed to be non-infinite because signed int overflow isn't defined the way unsigned is. So n = INT_MAX would lead to UB. But the equivalent with unsigned is potentially infinite for n=UINT_MAX.

You're talking about compile-time-visible UB, though, which you could say is "more undefined" because compilers can notice it and do things on purpose. (For example, emit an illegal instruction to make the program fault if it gets to this point.)

Contrast this with cases where the compiler just optimizes according to any guarantees / assumptions it's allowed to make, for code where it doesn't notice the UB at compile time, or the UB only happens with some possible function args so the compiler still has to make asm that works for inputs that don't lead to UB in the abstract machine.

Some interesting examples of non-runtime-visible UB:

For example, in C++ (unlike C1) it's illegal for execution to fall off the end of a non-void function. Modern GCC and clang optimize accordingly, assuming such paths of execution are never reached and emitting no instructions at all for them, not even a ret.

Let's make some simple examples on Godbolt compiling for x86-64:

int x, y;   // global vars; compiler has to assume assigning to these is a visible side-effect that code in other compilation units could see.

int bad_int_func() {
    x = 0;   // gcc still stores but no ret
    y = 0;   // clang backtracks and emits no instructions for the whole block
    // return 0;
}

compiles like this with GCC11.2 -O2:

bad_int_func():
        mov     DWORD PTR x[rip], 0
        mov     DWORD PTR y[rip], 0
  # missing ret, there'd be one if the function was void

Clang is even more aggressive: we just get the label and no instructions. It didn't emit code for any of the earlier C++ statements in this basic block (sequence of code with no branches or branch targets) that ends by falling off the end.

And yes, both compilers warn about that, e.g. gcc warns twice (even without -Wall or -Wextra): warning: no return statement in function returning non-void [-Wreturn-type] and warning: control reaches end of non-void function [-Wreturn-type]

Where this gets more interesting is in a function with some branches so it's possible for it to be called safely, as part of a well-behaved program. (It would be poor style to write a function exactly like this, but with more complex things, maybe a switch or if(foo) ... return / if (bar) ... return / ..., there can be some paths that the compiler can't prove are never taken in the first place.)


int foo(int a){
    y = 0;

    if (a == 0) {
        y = 1;
        return 0;
    }
    // x = 2;  // if uncommented, GCC does branch.  clang doesn't care

    // return a;  // entirely changes the function vs. no return
}

The only legal path of execution through this function is with a==0, so GCC and clang simply assume that's the case, optimizing away the branch:

foo(int):                                # @foo(int)
        mov     dword ptr [rip + y], 1
        xor     eax, eax
        ret

Of course, this compiles very differently if you compile as C so it's legal to fall off the end without returning a value:

# clang13 -O2 -xc
foo:
        xor     eax, eax
        test    edi, edi
        sete    al                 # tmp = (a == 0)
        mov     dword ptr [rip + y], eax
        xor     eax, eax           # unconditionally return 0
        ret

Since the fall-through path doesn't have a return statement, it doesn't matter what's in the return-value register at that point. Setting it to 0 is what we need for the if() body's return statement, and just unconditionally doing that is much cheaper than compare-and-branching to see if we should zero it or not. (GCC doesn't spot that, and does branch over a y=1 store and an xor eax,eax. It unconditionally did the y=0 store first, even though it did tail duplication with two separate ret instructions :/)

Of course if inlining into a caller that did use the return value, then the same optimizations as with C++ would apply. Like if you put a __builtin_unreachable() there instead of a return.

Footnote 1: In C, it's only UB for the caller to use the return value of such a function, for historical reasons: before void was introduced to the language, every function had a return value, often implicit int. But people didn't bother to put a return 0; at the bottom if they didn't want a return value. (Not just "didn't bother"; saving those bytes of compiler-generated machine code was probably valuable on tiny old machines.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
-2

"see the generated Assembly, at which point you see the actual reality to be executed".

No, that's just another aspect of the Halting problem. Anything could be generated, given UB. You may have a bit of code on hand where you can't even tell if it will halt, let alone what the result would then be.

(Of course, the halting problem also exists within well-defined C++, but you don't have self-modifying C++.)

Now, why is this relevant? The Halting problem doesn't say that all programs are unpredictable, in fact many programs are. The Halting problem merely says that there are three classes of programs, those that certainly halt, those that certainly don't, and a third class where it can't be predicted. So the assumption that you can always determine the behavior of a program from the assembly is untrue. That undermines the logic which asserts that UB will still produce binaries with predictable behavior.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • Oh, great, downvoters who don't understand the answer but also don't ask for explanation. – MSalters Feb 22 '17 at 13:37
  • The fault in your answer lies in treating behavior of the computer after compilation as a black box. A deterministic computer will fully repeatably execute the same machine code, even if the code is nonsensical - and in a real (not fully deterministic) computer, at machine code execution level, far fewer behaviors are undefined. If the generated machine code doesn't cause any of UB as defined by the CPU standard, its behavior will be fully deterministic. And you can determine whether it does or not by analyzing the assembly code. – SF. Feb 24 '17 at 14:04
  • (of course you're not guaranteed the same assembly code on subsequent compilations of the same C++ program containing UB - but within a single compilation, its halting behavior can be determined for program not creating CPU level UB, and always determined if the CPU standard covers all possible behaviors, no UB present.) – SF. Feb 24 '17 at 14:06
  • @SF: No, you can't always determine for a random (assembly) program whether it halts. That is the surprising result of the Halting Problem. – MSalters Feb 24 '17 at 16:27
  • Turing's solution to Halting problem is only valid for "infinite memory". If memory is limited, then the complete number of states of the system is finite. If given complete state repeats, that means the program loops and won't stop without additional external input. If program stops, you know it stops. If state doesn't repeat - this will continue until all possible states are exhausted and the next state is a copy of a prior one! – SF. Feb 24 '17 at 16:41
  • @SF.: C++ programs can do I/O, so the program's internal state isn't all that matters. It can repeat but some future I/O might EOF or error, or read different data, leading to different behaviour. Still, looking at the generated asm is helpful/relevant on a very *small* / local scale for seeing how most compile-time-visible UB compiled, while overall halting is potentially a whole-program thing, so IDK how helpful this answer is. Esp. with this whole question seeming to ignore data-dependent or timing-dependent UB that isn't compile-time visible, like signed-int overflow or data-race UB. – Peter Cordes Nov 27 '21 at 03:08
  • @PeterCordes If we allow adding external data to the program at runtime, with the data not known at the time of analyzing the program for the halting condition, then "halting problem is unsolvable" becomes trivial. Have a program that loads and runs another, unknown one. Without knowing that other program you can't determine if it's halting or not. Or take user input, "terminate? Y/N" and halting depends on conditions external to the system, so it's indeterminate. The halting problem is only interesting if it only depends on initial conditions. – SF. Nov 27 '21 at 11:09
  • 1
    And while I agree the approach of "test all possible states" is only practically viable for the most simple of systems, it just challenges the notion that "anything" can happen in UB. A theoretical Computer Science specialist may claim that. A good hacker will know exactly what *will* happen at given UB on given system with given compiler. – SF. Nov 27 '21 at 11:13
  • @SF.: Oh right. That's probably another reason the Halting Problem isn't a great analogy, then. (Or being generous, a very loose analogy.) And yeah, agreed. It fully depends on the compiler, like in [Does the C++ standard allow for an uninitialized bool to crash a program?](https://stackoverflow.com/a/54125820) These days, GCC and clang sometimes choose to emit an illegal instruction on paths that have some types of compile-time-visible UB, or emit nothing (not even a `ret`) for a path that falls off the end of a non-`void` function in C++ (unlike C where it's only UB to use the result.) – Peter Cordes Nov 27 '21 at 11:20