Is accessing arrays out-of-bounds legal if what lies beyond those bounds is known in C? Why not and how can this be worked around?

Question

Take the following which works in GCC:

union Int2 {
    int i[2];
};
union Int4 {
    union Int2 i2[2];
};
union Int4 i4;
i4.i2[1].i[-1] = 10;
printf("%d\n", i4.i2[0].i[1]); // 10
static_assert(sizeof(Int4)==sizeof(int[4]), "Unexpected padding");

In this case, ~~the union is not allowed to have any padding, so what lies beyond those internal arrays is definitively known.~~ However according a comment this is technically undefined behavior and not legal.

The rule that specifies pointer arithmetic, C 2018 6.5.6 8, defines it only for arithmetic within an array (including the end position one beyond the last element and treating a single object as an array of one element). This creates a "pointer provenance" property; if p[x] has behavior defined by the C standard, it can refer only to elements of an array p points to. Compilers may use this to reduce pointer arithmetic when optimizing, and this reduction may break code that attempts to use indexing outside the actual array.

If this comment is correct, this answers the first question: No. Which brings us to my actual questions:

If arrays use pointer arithmetic, then doing so would just generate a pointer to the out-of-bounds but known value. Could someone explain what could be going on behind the scenes that could prevent this from working, despite this working in practice on GCC?
Is there any way to 'circumvent' this undefined behavior? Some undefined behavior can be circumvented trivially, such as signed overflow, by casting to an unsigned type before performing. Similarly, assuming I only have access to a pointer to the second element of an i2 member, is there any way to access the first i2 member and/or the constituent int values without invoking undefined behavior?

Related questions: 1. [One-dimensional access to a multidimensional array: is it well-defined behaviour?](https://stackoverflow.com/q/6290956/12149471) 2. [Multidimensional array out of bound access](https://stackoverflow.com/q/48219108/12149471) — Andreas Wenzel, Jun 13 '23 at 19:16
I believe a `union` needs more than one member to make any sense. Otherwise, its just a struct with a single field. `union Int2` only has field `i`, and `union Int4` only has `i2`, so there are no fields that overlap with each other. — abelenky, Jun 13 '23 at 19:26
@user16217248: "padding" exists *between* fields. If there is only one field, there is no opportunity for padding. — abelenky, Jun 13 '23 at 19:29
@user16217248 that's a sweeping statement. A union may contain structs, which may contain padding, and so a union may contain padding. And if its components are a different size, it *must* contain padding. — Weather Vane, Jun 13 '23 at 19:29
You also have "a pointer to the out-of-bounds but known value". A valid pointer may point to the next object, but what it points to is unknown and dereferencing it is undefined behaviour. — Weather Vane, Jun 13 '23 at 19:36
Consider this code: `int foo(int x) { int a[1] = { 4 }; return a[x];`. [Clang compiles this to `mov eax,0x4` / `ret`.](https://godbolt.org/z/dWGTz6drb) That is because the only defined behavior of the routine is to return 4 when `x` is zero, and, as far as the C standard is concerned, it does not matter what happens when `x` is any other value. So the compiler does not implement this by using pointer arithmetic… — Eric Postpischil, Jun 13 '23 at 19:48
… That explains why this reasoning, “If arrays use pointer arithmetic, then doing so would just generate a pointer to the out-of-bounds but known value,” is irrelevant. The compilers does not necessarily use pointer arithmetic to implement arrays. C code only tells the compiler what the program is supposed to do. The compiler chooses how to implement it. The C code might say to look up an element in an array, but that is only for the purposes of specifying what observable behavior the program must exhibit. The compiler may choose to implement it by other means. — Eric Postpischil, Jun 13 '23 at 19:49
You should generally not attempt to "play around" with undefined behavior. If a certain code path invokes undefined behavior, then, if the compiler detects this, the compiler is allowed to "optimize away" that entire code path, by treating it as unreachable. See this article for further information: [Undefined behavior can result in time travel (among other things, but time travel is the funkiest)](https://devblogs.microsoft.com/oldnewthing/20140627-00/?p=633) — Andreas Wenzel, Jun 13 '23 at 19:50
That is a near-degenerate example, with an array of one element. But suppose, in your example, you had `i4.i2[1].i[x]` instead of `i4.i2[1].i[-1]`, and you ran the program with `x` equal to −1 at some point. While compiling, the compiler can see that the only values of `x` for which the program behavior are defined are nonnegative values. So, even if it compiles `i4.i2[1].i[x]`, it could also assume `x` is nonnegative, and it may use this as a premise in optimizing later code. For example, the compiler might assume `x` is nonnegative when optimizing some comparison expression involving it. — Eric Postpischil, Jun 13 '23 at 19:59
*so what lies beyond those internal arrays is definitively known* Oh? And just how is it "definitvely known"? — Andrew Henle, Jun 13 '23 at 20:00
There are some exceptions to using pointer arithmetic outside of an array—notably if we are inside another array. Specifically, if we have a character pointer into a structure or other object, we can move that pointer around (by pointer arithmetic) to any of the bytes in the object, even if that goes beyond an embedded array the pointer initially pointed into. This also means that a routine compiled in isolation with no link-time optimization must implement character pointer arithmetic via actual arithmetic, not optimizations, that work with any size object it could have been passed. — Eric Postpischil, Jun 13 '23 at 20:08
@EricPostpischil: Annex J2 states that given `unsinged char arr[5][3]`;, an attempt to access `arr[0][6]` would invoke UB even though the Standard mandates that `arr[2][0]` would be placed six bytes above `arr[0][0]`, and clang and/or gcc will sometimes process a read of `arr[0][i]` in a way that may arbitrarily corrupt the contents of unrelated storage if `i` would be outside the range 0..2. — supercat, Jun 20 '23 at 17:00
@supercat: No, it does not say that (that may be your interpretation, but it is not what the standard says), and, even if it did, J.2 is a non-normative summary. The normative text is controlling and provides for using character pointers to access the entirety of an object, including an array of arrays. — Eric Postpischil, Jun 20 '23 at 19:41
@EricPostpischil: Quoting N1570 Annex J2: "An array subscript is out of range, even if an object is apparently accessible with the given subscript (as in the lvalue expression `a[1][7]` given the declaration `int a[4][5]`)". Whether or not anything else in the Standard justifies such an interpretation, gcc interprets that text as an invitation to behave nonsensically in cases where code would use an inner-array subscript to access outside the inner array. — supercat, Jun 20 '23 at 20:24
@EricPostpischil: See https://godbolt.org/z/eEWoY6q87 for an example of gcc's treatment. If the source code were processed straightforwardly as written, there would be no mechanism via which `test(4) ` could affect the contents of `arr2[rows]`, but in the code as generated `test(4)` will set `arr[4]` to 1. This is "allowed" because gcc interprets the Standard as waiving jurisdiction over the behavior of attempts to read `arr[0][12]` through `arr[0][14]`. — supercat, Jun 20 '23 at 20:30
@supercat: As noted, the words of the standard are different from what you set previously. Notably, they give an example with an `int` array, whereas I specifically called out **character** pointers as special in the standard, so an `int` example is irrelevant. And still, J.2 is non-normative. As for what GCC does, I do not care. GCC has had pointer provenance bugs before. If `char *p = arr[0];` is inserted in `test2 and `arr[0][i]` is changed to `p[i]`, the program prints “0”, suggesting that GCC is treating a plain character pointer as the standard specifies,… — Eric Postpischil, Jun 20 '23 at 20:58
… even if it is not doing so with `arr[0][i]`. But that is GCC’s problem. — Eric Postpischil, Jun 20 '23 at 20:59
@EricPostpischil: While character pointers are special in some ways, I see nothing within the text of the Standard that would suggest that they are not subject to the same array-subscripting rules as everything else, nor that the equivalence between `*(arrayLValue+n)` and `arrayLValue[n]` doesn't hold for character pointers just as it does for every other type. — supercat, Jun 20 '23 at 21:14
@supercat: Character pointers are subject to the same array subscripting rules as everything else. The difference is that a character pointer can point into the array of bytes that represents an object, per C 2018 6.3.2.3 7, and so that would be the applicable array for applying the rules. — Eric Postpischil, Jun 20 '23 at 21:34
@EricPostpischil: As I noted, recognizing that the application of the "subscripting" operator to arrays as having distinct semantics from pointer arithmetic would make clear that although `arr[0]` could *decay* into a character pointer, such decay would not occur when it is an operand of the `[]` operator. That would be both logical and consistent with how clang and gcc actually *work*, while resolving the contradiction posed by the Standard defining an operation performed using one syntax and characterizing as UB the "same" operation done with a different syntax. — supercat, Jun 20 '23 at 21:40

score 6 · Accepted Answer · answered Jun 13 '23 at 19:58

Is accessing arrays out-of-bounds legal if what lies beyond those bounds is known in C?

I take you to mean evaluating an expression of the form array[i], where array is an expression having array value (prior to decay), and i is either negative or greater than or equal to the number of elements in the array. No, that is not "legal", by which I mean that the C language specification does not define the behavior. It does not matter what lies outside the bounds of the array, or whether that is known in some sense.

Why not

Because the language spec says so. Specifically, it says that array[i] is equivalent to *((array) + (i)), where array is, as usual, subject to decay to a pointer. And it says explicitly that the pointer addition (array) + (i) is defined only if i is between 0 and the number of elements in the array, but that dereferencing the result has undefined behavior if i is equal to the number of elements in the array.

But perhaps you are asking about rationale. The committee has not published an official rationale for those semantics, but it seems reasonable that they favored simpler rules with fewer exceptions. Additionally, they definitely avoid assuming an addressing model that supports "what lies beyond?" even being a sensible question.

and how can this be worked around?

Generally, don't rely on out-of-bounds array accesses.

If the knowledge you claim about what lies beyond an array is based on the array being part of another object, however, then you may be able to use information about the container type. In your particular example, I would just change i4.i2[1].i[-1] to i4.i2[0].i[1]. Under other circumstances, you might be able to use pointer conversions to express the access you want relative to the container. In the worst case, you might need to perform a deeper refactoring to avoid out of bounds accesses.

If arrays use pointer arithmetic,

They do.

then doing so would just generate a pointer to the out-of-bounds but known value.

Actually, yes, but that does not allow you to dereference the pointer. Perhaps that's inconsistent, but it's what the spec says.

Could someone explain what could be going on behind the scenes that could prevent this from working, despite this working in practice on GCC?

A compiler is permitted to assume that your code has well-defined behavior. It is permitted to do anything at all in the event of undefined behavior, whether it recognizes the UB or not. Compilers can and do leverage that to implement optimizations that are correct as long as all program behavior is defined, but that produce results different from what you might naively expect when the behavior is undefined. That you do not observe that in your particular example is not generalizable to other code or other compilers.

Is there any way to 'circumvent' this undefined behavior?

There are lots of ways. Some better, and some worse. Specifics depend on the situation.

assuming I only have access to a pointer to the second element of an i2 member, is there any way to access the first i2 member and/or the constituent int values without invoking undefined behavior?

The only way you could have access to the second element of one of your i2s without also having access to the first is if the access were through a pointer:

process_second_i2(union Int2 *x2) {
    // ...
}

But given that this is a pointer to the second element of an array, it is valid use it to access the first element of the same array:

int x1_0 =  (x2 - 1)[0];

// or

int x1_1 =  x2[-1][1];

But that has UB if x2 does not after all point to the second or subsequent element of an array.

Side note: you claim,

In this case, the union is not allowed to have any padding,

but the C language specification does not support that claim. The ABI of the machine you are compiling for will specify, and it might or might not support that claim. For example, it might specify that int is 4 bytes wide and all structure and union sizes are multiples of 16 bytes. In that case, your union Int2 would indeed contain padding, but your union Int4 would not contain any padding of its own.

Under other circumstances, union Int4 might contain padding.

//It is permitted to do anything at all in the event of undefined behavior, whether it recognizes the UB or not.// Certain ways of treating certain constructs may render implementations unsuitable for some tasks, but the Standard deliberately avoids requiring that all implementations be suitable for all tasks. An implementation intended to be suitable for low-level programming on a platform whose ABI indicates how things are laid out would need to behave predictably in many more situations than mandated by the Standard, but one intended to be suitable for a narrower range of tasks wouldn't. — supercat, Jun 21 '23 at 18:05

score 4 · Answer 2 · answered Jun 13 '23 at 19:38

The behavior on accessing an array out-of-bounds is undefined, period - it doesn't matter if you know the array is contiguous with another object.

Undefined != illegal. All "undefined" means is that neither the compiler nor runtime environment are required to handle the situation in any particular way. The compiler isn't required to issue a diagnostic, the runtime environment is not required to throw a segfault, etc.

supercat · Answer 3 · 2023-06-20T20:51:28.780

There are dialects of C in which the behavior of pointer arithmetic is defined "in a documented manner characteristic of the environment" in all cases where the environment would specify the behavior. Because there are some tasks for which other ways of processing the construct might be more useful, and where optimization might affect program behavior in some scenarios involving out-of-bounds array accesses, the Standard appears to be intended to waive jurisdiction over some scenarios involving indexing beyond the boundaries of "inner" arrays, though I see no evidence of consensus as to what cases programmers should be allowed to rely upon. Annex J2 of N1570 includes the following in its list of "Undefined Behaviors":

An array subscript is out of range, even if an object is apparently accessible with the given subscript (as in the lvalue expression a[1][7] given the declaration int a[4][5]) (6.5.6).

Nothing within the normative text of 6.5.6 would indicate a consensus judgment that a[0]+5 and a[1]+0 should not be viewed as being part of the same "array object" a, but Annex J2 clearly indicates that intention.

Given e.g.

unsigned char arr[5][3];
int test1(int x) { return arr[0][x]; }
int test2(int x) { return *(arr[0]+x); }
int test3(int x) { return *(((char*)arr)+x); }

Annex J2 makes it clear that the Standard is intended to waive jurisdiction of test1 when x is outside the range 0 to 2, but the general principle that objects can be decomposed into a sequence of unsigned char values would imply that test3 should be expected to behave predictably for all values of x from 0 to 14. I see nothing in the normative parts of the Standard, however, that would imply that test1 and test3 are not equivalent.

It's important to note that in many scenarios where pre-standard implementations would almost unanimously process a construct a certain way, but some implementations might have a good reason for processing code in a different fashion that might expose the effects of optimization, the Standard waived jurisdiction to allow implementations that had a good reason for deviating from the commonplace behavior to do so. Such waiver of jurisdiction was never intended to suggest that implementations shouldn't be expected to follow the commonplace behavior in the absence of a documented or obvious reason for doing otherwise, and thus the authors of the Standard saw no need to get bogged down in the details of what precise cases were and were not "defined". If an implementation would have no reason not to process test3 in a manner consistent with the mandated underlying storage layout of the array, nobody should care whether the Standard mandated such behavior.

Incidentally, clang and gcc seem to process the [] operand in a manner that the Standard should have specified but didn't, i.e. specifying that an array operand to [] doesn't decay to a pointer, but is instead processed in a manner analogous to an indexed version of the . operator, which would yield an addressable lvalue when the left operand is an addressable lvalue. Under such a rule, test1 would has defined behavior only for x==0..2, but test2 would has defined behavior for x==0..14. I know of nothing in the documentation for clang or gcc that would specify this behavior, however.

Is accessing arrays out-of-bounds legal if what lies beyond those bounds is known in C? Why not and how can this be worked around?

3 Answers3

Linked

Related