Undefined behaviour in RE2 which stated to be well defined

Question

Recently I've found that the RE2 library uses this technique for fast set lookups. During the lookup it uses values from uninitialized array, which, as far as I know, is undefined behaviour.

I've even found this issue with valgrind warnings about use of uninitialized memory. But the issue was closed with a comment that this behaviour is indended.

I suppose that in reality an uninitialized array will just contain some random data on all modern compilers and architectures. But on the other hand I treat the 'undefined behaviour' statement as 'literally anything can happen' (including your program formats your hard drive or Godzilla comes and destroys your city).

The question is: is it legal to use uninitialized data in C++?

The contents of uninitialized data is *indeterminate*, and depending on the type of the data it could contain [*trap representations*](https://stackoverflow.com/questions/6725809/trap-representation). In C++ all reading of uninitialized memory is [undefined behavior](http://en.cppreference.com/w/cpp/language/ub) (it's one of the things that differs between C++ and C, because in C it's only UB if the value is a trap representation, not otherwise). — Some programmer dude, Dec 05 '17 at 12:26
This basically amounts to _"Is it defined to invoke undefined behaviour?"_ — underscore_d, Dec 05 '17 at 12:28
@underscore_d: The Standard routinely fails to distinguish between statements meaning (1) compilers are under no general obligation to define *all* actions in a broadly-described class, and (2) compilers are under no obligation to constrain the behavior of *any* actions within a broadly-defined class--even specific actions ones whose meaning would be defined elsewhere. The same phraseology, saying that the behavior of the general action is Undefined, is used in both situations; the fact that the Standard explicitly declines to define the behavior of all actions meeting some description... — supercat, Dec 06 '17 at 22:05
...should not imply that it does not define the behavior of any actions meeting such a description. Unfortunately, compiler writers seem to interpret the general statements in the Standard as universal ones without regard for whether anything in the Rationale would justify such an interpretation. — supercat, Dec 06 '17 at 22:10

score 0 · Answer 1 · answered Dec 05 '17 at 23:25

When C was originally designed, if arr was an array of some type T occupying N bytes, an expression like arr[i] meant "take the base address of arr, add i*N to it, fetch N bytes at the resulting address, and interpret them as a T". If every possible combination of N bytes would have a meaning when interpreted as a type T, fetching an uninitialized array element may yield an arbitrary value, but the behavior would otherwise be predictable. If T is a 32-bit type, an attempt to read an uninitialized array element of type T would yield one of at most 4294967296 possible behaviors; such action would be safe if and only if every one of those 4294967296 behaviors would meet a program's requirements. As you note, there are situations where such a guarantee is useful.

The C Standard, however, describes a semantically-weaker language which does not guarantee that an attempt to read an uninitialized array element will behave in a fashion consistent with any bit pattern the storage might have contain. Compiler writers want to process this weaker language, rather than the one Dennis Ritchie invented, because it allows them to apply a number of optimizations without regard for how they interact. For example, if code performs a=x; and later performs b=a; and c=a;, and if a compiler can't "see" anything between the original assignment and the later ones that could change a or x, it could omit the first assignment and change the latter two assignments to b=x; and c=x;. If, however, something happens between the latter two assignments that would change x, that could result in b and c getting different values--something that should be impossible if nothing changes a.

Applying that optimization by itself wouldn't be a problem if nothing changed x that shouldn't. On the other hand, consider code which uses some allocated storage as type float, frees it, re-allocates it, and uses it as type int. If the compiler knows that the original and replacement request are of the same size, it could recycle the storage without freeing and reallocating it. That could, however, cause the code sequence:

float *fp = malloc(4);
...
*fp = slowCalculation();
somethingElse = *fp;
free(fp);
int *ip = malloc(4);
...
a=*ip;
b=a;
...
c=a;

to get rewritten as:

float *fp = malloc(4);
...
startSlowCalculation(); // Use some pipelined computation unit
int *ip = (int*)fp;
...
b=*ip;

*fp = resultOfSlowCalculation();  // ** Moved from up above
somethingElse = *fp;

...
c=*ip;

It would be rare for performance to benefit particularly from processing the result of the slow calculation between the writes to b and c. Unfortunately, compilers aren't designed in a way that would make it convenient to guarantee that a deferred calculation wouldn't by chance land in exactly the spot where it would cause trouble.

Personally, I regard compiler writers' philosophy as severely misguided: if a programmer in a certain situation knows that a guarantee would be useful, requiring the programmer to work around the lack of it will impose significant cost with 100% certainty. By contrast, a requirement that compiler refrain from optimizations that are predicated on the lack of that guarantee would rarely cost anything (since code to work around its absence would almost certainly block the "optimization" anyway). Unfortunately, some people seem more interested in optimizing the performance of those source texts which don't need guarantees beyond what the Standard mandates, than in optimizing the efficiency with which a compiler can generate code to accomplish useful tasks.

Undefined behaviour in RE2 which stated to be well defined

1 Answers1