57

As an example, consider the following structure:

struct S {
  int a[4];
  int b[4];
} s;

Would it be legal to write s.a[6] and expect it to be equal to s.b[2]? Personally, I feel that it must be UB in C++, whereas I'm not sure about C. However, I failed to find anything relevant in the standards of C and C++ languages.


Update

There are several answers suggesting ways to make sure there is no padding between fields in order to make the code work reliably. I'd like to emphasize that if such code is UB, then absense of padding is not enough. If it is UB, then the compiler is free to assume that accesses to S.a[i] and S.b[j] do not overlap and the compiler is free to reorder such memory accesses. For example,

    int x = s.b[2];
    s.a[6] = 2;
    return x;

can be transformed to

    s.a[6] = 2;
    int x = s.b[2];
    return x;

which always returns 2.

Nikolai
  • 1,499
  • 12
  • 24
  • 45
    Reading out of bounds invokes UB. – Ron Nov 03 '17 at 11:00
  • 4
    This is undefined behavior in any high level language. – David Hoelzer Nov 03 '17 at 11:06
  • 8
    If you really HAVE to do such a thing, unionize a[8] with filler[4] and b[4] so that b overlays a[4..7]. – Martin James Nov 03 '17 at 11:08
  • 4
    I'm not so sure that it is definitely UB in any language. Firstly, C/C++ guarantee certain data layout. Secondly, there is a 'flexible array members' idiom in C which makes at least some out-of-bounds accesses legal. – Nikolai Nov 03 '17 at 11:09
  • Even if possible, I would consider that to be a really nasty and unnecessary hack making a serious impact on readability – Suppen Nov 03 '17 at 12:58
  • 2
    @Nikolai: There is no C/C++ language, and especially C++ has quite poor layout guarantees. – MSalters Nov 03 '17 at 14:19
  • 12
    @Nikolai The only data layout you're guaranteed is that b follows a in memory. The compiler could legitimately put 200GB of padding between those two elements and still be compliant with the C standard. – Graham Nov 03 '17 at 14:34
  • 2
    @Nikolai Pre C99, the "flexible array" idiom with a single-element array at the end of the structure was always UB. It was a hack, but a clever one which always worked. Post C99, behaviour is defined in the standard so it is no longer UB if you format it as defined in the standard. The old single-element array approach remains UB. – Graham Nov 03 '17 at 14:38
  • @Nikolai [re](https://stackoverflow.com/questions/47094166/in-a-structure-is-it-legal-to-use-one-array-field-to-access-another-one#comment81137994_47094166) "there is a 'flexible array members' idiom in C which makes at least some out-of-bounds accesses legal" Out-of-bounds access is not legal there either. The details are in what is "out-of-bounds". Yet much discussion on VLA belongs in another post. – chux - Reinstate Monica Nov 03 '17 at 14:50
  • 3
    I think the real question here is not _Can I do it?_ but _Why would I do it?_ – Agnishom Chattopadhyay Nov 04 '17 at 04:44
  • Some compilers have an extension that would allow this to work, or offer additional guarantees about layout. For example, Visual Studio supports `#pragma pack(1)` to align all elements in the structure to one-byte boundaries. This is of course not standard or portable. – Davislor Nov 04 '17 at 08:10
  • @Nikolai: a has four elements. Reading or writing a[4] or a[6] is undefined behaviour. Undefined behaviour means for example the compiler is allowed to assume that the code reading s.a[6] will never be executed. – gnasher729 Nov 04 '17 at 23:35
  • 2
    @DavidHoelzer I don't know about that... There are a number of high-level languages that specifically define an out-of-bounds array access to throw an exception. – David Young Nov 04 '17 at 23:43
  • 1
    @DavidYoung: Swift specifically defines an out-of-bounds array access to cause a crash. More precisely, there is a method in the standard library implementation of Array that will cause the crash. – gnasher729 Nov 05 '17 at 15:03
  • I can‘t imagine a situation where this would give you any advantage. Why not define a[8] and *b=a+4 and be secure? – Aganju Nov 05 '17 at 17:59

9 Answers9

64

Would it be legal to write s.a[6] and expect it to be equal to s.b[2]?

No. Because accessing an array out of bound invoked undefined behaviour in C and C++.

C11 J.2 Undefined behavior

  • Addition or subtraction of a pointer into, or just beyond, an array object and an integer type produces a result that points just beyond the array object and is used as the operand of a unary * operator that is evaluated (6.5.6).

  • An array subscript is out of range, even if an object is apparently accessible with the given subscript (as in the lvalue expression a[1][7] given the declaration int a[4][5]) (6.5.6).

C++ standard draft section 5.7 Additive operators paragraph 5 says:

When an expression that has integral type is added to or subtracted from a pointer, the result has the type of the pointer operand. If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integral expression. [...] If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.

msc
  • 33,420
  • 29
  • 119
  • 214
  • 3
    But how it relates to 'flexible array members' which is about well-defined out-of-bounds access? – Nikolai Nov 03 '17 at 11:04
  • 6
    @Nikolai: That is an **extension**: compiler can provide more guaranties than the standard. – Jarod42 Nov 03 '17 at 11:08
  • @Nikolai Read this good reference doc for flexible array: https://www.securecoding.cert.org/confluence/display/c/DCL38-C.+Use+the+correct+syntax+when+declaring+a+flexible+array+member – msc Nov 03 '17 at 11:09
  • `a[1][7]`, i.e. a 2D-array, is actually a different case than two consecutive data members; Anyway, it's UB. – Stephan Lechner Nov 03 '17 at 11:11
  • @Nikolai The standard defines exactly how many elements of a flexible array member can be accessed. The access is not out-of-bounds. – interjay Nov 03 '17 at 11:13
  • 7
    Annex J is also irrelevant to cite because it is informative, it is not normative text. Citing an ISO standard is not just about doing some text search in a pdf and copy/paste random text. The relevant part here is 6.5.6 - which Annex J points at. Annex J is merely a convenient summary of all forms of poorly defined behavior located elsewhere in the standard. – Lundin Nov 03 '17 at 12:27
  • 6
    @Nikolai flexible array members must be the last member of the struct, so you can't have a flexible `a` that would allow accessing elements of `b`. – Pete Kirkham Nov 03 '17 at 13:18
  • 16
    @Lundin: We have a [language-lawyer] tag for when we care about normative text. Annex J is sufficient for non-lawyering. – MSalters Nov 03 '17 at 14:21
33

Apart from the answer of @rsp (Undefined behavior for an array subscript that is out of range) I can add that it is not legal to access b via a because the C language does not specify how much padding space can be between the end of area allocated for a and the start of b, so even if you can run it on a particular implementation , it is not portable.

instance of struct:
+-----------+----------------+-----------+---------------+
|  array a  |  maybe padding |  array b  | maybe padding |
+-----------+----------------+-----------+---------------+

The second padding may miss as well as the alignment of struct object is the alignment of a which is the same as the alignment of b but the C language also does not impose the second padding not to be there.

alinsoar
  • 15,386
  • 4
  • 57
  • 74
  • The arrays are of the same type. There's no padding between them. – Nikolai Nov 03 '17 at 12:08
  • 19
    @Nikolai the Standard does not forbid the padding to exist even if the alignments are the same. – alinsoar Nov 03 '17 at 12:19
  • I'm guessing not, but is there any difference if it's a union and not a struct? – Panzercrisis Nov 03 '17 at 13:43
  • 3
    In @Nikolai's example, there's *probably* no padding. But if `a` was `char a[3]`, the compiler may or may not decide to put a byte of padding before `b`. – dwilliss Nov 03 '17 at 14:51
  • 2
    @Panzercrisis: ISO C++ leaves union type-punning as UB so your question is only relevant in C99/C11, or in GNU C++ or other dialects / implementations where writing one member and then reading another is defined. But then sure, in `union { int a[10], b[5]; }` it's well-defined that elements of `b[]` line up with the early elements of `a[]`. Unions overlap the members so padding doesn't come into it. – Peter Cordes Nov 03 '17 at 17:30
  • 1
    @Nikolai, consecutively declared variables, even of the same type, don't have to be consecutive in memory. For example, `float x; double y; float z; double w;` may be placed in memory in order `x,z,y,w` so that access to `y` would be 64-bit aligned. Ditto arrays. Thus the "padding", which may even contain other variables declared elsewhere. – Michael Nov 03 '17 at 17:30
  • 1
    @Michael: struct members are guaranteed to have increasing addresses, so compilers are *not* free to reorder for optimal packing (unless there's a mix of `private`, `public`, or `protected`). *Nonstatic data members of a (non-union) class with the same access control (Clause 11) are allocated so that later members have higher addresses within a class object.* N4140 9.2.13. ISO C has the same guarantee. Thus, you should generally put the largest members first, or if you know the sizes of types on a specific platform there are other ways to avoid wasting space. – Peter Cordes Nov 03 '17 at 21:57
11

a and b are two different arrays, and a is defined as containing 4 elements. Hence, a[6] accesses the array out of bounds and is therefore undefined behaviour. Note that array subscript a[6] is defined as *(a+6), so the proof of UB is actually given by section "Additive operators" in conjunction with pointers". See the following section of the C11-standard (e.g. this online draft version) describing this aspect:

6.5.6 Additive operators

When an expression that has integer type is added to or subtracted from a pointer, the result has the type of the pointer operand. If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integer expression. In other words, if the expression P points to the i-th element of an array object, the expressions (P)+N (equivalently, N+(P)) and (P)-N (where N has the value n) point to, respectively, the i+n-th and i-n-th elements of the array object, provided they exist. Moreover, if the expression P points to the last element of an array object, the expression (P)+1 points one past the last element of the array object, and if the expression Q points one past the last element of an array object, the expression (Q)-1 points to the last element of the array object. If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined. If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated.

The same argument applies to C++ (though not quoted here).

Further, though it is clearly undefined behaviour due to the fact of exceeding array bounds of a, note that the compiler might introduce padding between members a and b, such that - even if such pointer arithmetics were allowed - a+6 would not necessarily yield the same address as b+2.

Stephan Lechner
  • 34,891
  • 4
  • 35
  • 58
6

Is it legal? No. As others mentioned, it invokes Undefined Behavior.

Will it work? That depends on your compiler. That's the thing about undefined behavior: it's undefined.

On many C and C++ compilers, the struct will be laid out such that b will immediately follow a in memory and there will be no bounds checking. So accessing a[6] will effectively be the same as b[2] and will not cause any sort of exception.

Given

struct S {
  int a[4];
  int b[4];
} s

and assuming no extra padding, the structure is really just a way of looking at a block of memory containing 8 integers. You could cast it to (int*) and ((int*)s)[6] would point to the same memory as s.b[2].

Should you rely on this sort of behavior? Absolutely not. Undefined means that the compiler doesn't have to support this. The compiler is free to pad the structure which could render the assumption that &(s.b[2]) == &(s.a[6]) incorrect. The compiler could also add bounds checking on the array access (although enabling compiler optimizations would probably disable such a check).

I've have experienced the effects of this in the past. It's quite common to have a struct like this

struct Bob {
    char name[16];
    char whatever[64];
} bob;
strcpy(bob.name, "some name longer than 16 characters");

Now bob.whatever will be " than 16 characters". (which is why you should always use strncpy, BTW)

dwilliss
  • 862
  • 7
  • 19
  • But the way this struct is defined, the memory allocation will be the same as if it was a struct of 8 integers (assuming padding doesn't get added). At least in the case of gcc with optimizations on, there is no bounds checking on array indexes. You'll get an invalid memory access if you access past the allocated memory, but it was all allocated as one block. – dwilliss Nov 03 '17 at 13:38
  • Correct. It will access *something*. No guarantee *what* or if it will do so without throwing an invalid access exception. – dwilliss Nov 03 '17 at 14:17
  • 4
    It won't even necessarily access something, the compiler is allowed to assume that undefined behaviour doesn't happen so if it can prove an expression would produce undefined behaviour the compiler is allowed to assume that that code is never reached and optimise accordingly. – SirGuy Nov 03 '17 at 14:26
  • "That depends on your compiler. That's the thing about undefined behavior" afaik that would be implementation defined behaviour while undefined behaviour implies that in principle one cannot rely on the behaviour even for one specific compiler (in pratice the situation might be different) – 463035818_is_not_an_ai Nov 03 '17 at 14:56
  • 1
    @tobi303 Whether it depends on your compiler depends on your compiler, so to speak. (And the run time environment.) A conforming implementation is surely free to define standard-undefined behavior. For example it could introduce index-checking code for all arrays of known size and raise well-defined run-time exceptions for out-of-bounds access. It is just not required to (as opposed to standard-*implementation-defined* behavior which *must* be defined by, well, the implementation). – Peter - Reinstate Monica Nov 03 '17 at 16:33
5

As @MartinJames mentioned in a comment, if you need to guarantee that a and b are in contiguous memory (or at least able to be treated as such, (edit) unless your architecture/compiler uses an unusual memory block size/offset and forced alignment that would require padding to be added), you need to use a union.

union overlap {
    char all[8]; /* all the bytes in sequence */
    struct { /* (anonymous struct so its members can be accessed directly) */
        char a[4]; /* padding may be added after this if the alignment is not a sub-factor of 4 */
        char b[4];
    };
};

You can't directly access b from a (e.g. a[6], like you asked), but you can access the elements of both a and b by using all (e.g. all[6] refers to the same memory location as b[2]).

(Edit: You could replace 8 and 4 in the code above with 2*sizeof(int) and sizeof(int), respectively, to be more likely to match the architecture's alignment, especially if the code needs to be more portable, but then you have to be careful to avoid making any assumptions about how many bytes are in a, b, or all. However, this will work on what are probably the most common (1-, 2-, and 4-byte) memory alignments.)

Here is a simple example:

#include <stdio.h>

union overlap {
    char all[2*sizeof(int)]; /* all the bytes in sequence */
    struct { /* anonymous struct so its members can be accessed directly */
        char a[sizeof(int)]; /* low word */
        char b[sizeof(int)]; /* high word */
    };
};

int main()
{
    union overlap testing;
    testing.a[0] = 'a';
    testing.a[1] = 'b';
    testing.a[2] = 'c';
    testing.a[3] = '\0'; /* null terminator */
    testing.b[0] = 'e';
    testing.b[1] = 'f';
    testing.b[2] = 'g';
    testing.b[3] = '\0'; /* null terminator */
    printf("a=%s\n",testing.a); /* output: a=abc */
    printf("b=%s\n",testing.b); /* output: b=efg */
    printf("all=%s\n",testing.all); /* output: all=abc */

    testing.a[3] = 'd'; /* makes printf keep reading past the end of a */
    printf("a=%s\n",testing.a); /* output: a=abcdefg */
    printf("b=%s\n",testing.b); /* output: b=efg */
    printf("all=%s\n",testing.all); /* output: all=abcdefg */

    return 0;
}
Jed Schaaf
  • 1,045
  • 1
  • 10
  • 19
  • "if you need to guarantee that a and b are in contiguous memory (or at least able to be treated as such), you need to use a union" - can you quote standard on that? – Maciej Piechotka Nov 04 '17 at 02:56
3

No, since accesing an array out of bounds invokes Undefined Behavior, both in C and C++.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
2

Short Answer: No. You're in the land of undefined behavior.

Long Answer: No. But that doesn't mean that you can't access the data in other sketchier ways... if you're using GCC you can do something like the following (elaboration of dwillis's answer):

struct __attribute__((packed,aligned(4))) Bad_Access {
    int arr1[3];
    int arr2[3];
};

and then you could access via (Godbolt source+asm):

int x = ((int*)ba_pointer)[4];

But that cast violates strict aliasing so is only safe with g++ -fno-strict-aliasing. You can cast a struct pointer to a pointer to the first member, but then you're back in the UB boat because you're accessing outside the first member.

Alternatively, just don't do that. Save a future programmer (probably yourself) the heartache of that mess.

Also, while we're at it, why not use std::vector? It's not fool-proof, but on the back-end it has guards to prevent such bad behavior.

Addendum:

If you're really concerned about performance:

Let's say you have two same-typed pointers that you're accessing. The compiler will more than likely assume that both pointers have the chance to interfere, and will instantiate additional logic to protect you from doing something dumb.

If you solemnly swear to the compiler that you're not trying to alias, the compiler will reward you handsomely: Does the restrict keyword provide significant benefits in gcc / g++

Conclusion: Don't be evil; your future self, and the compiler will thank you.

Alex Shirley
  • 385
  • 4
  • 11
  • 1
    This is only safe if you compile with `g++ -fno-strict-aliasing`. Otherwise casting a `struct Bad_Access *` to an `int *` is UB. – Peter Cordes Nov 03 '17 at 22:03
1

Jed Schaff’s answer is on the right track, but not quite correct. If the compiler inserts padding between a and b, his solution will still fail. If, however, you declare:

typedef struct {
  int a[4];
  int b[4];
} s_t;

typedef union {
  char bytes[sizeof(s_t)];
  s_t s;
} u_t;

You may now access (int*)(bytes + offsetof(s_t, b)) to get the address of s.b, no matter how the compiler lays out the structure. The offsetof() macro is declared in <stddef.h>.

The expression sizeof(s_t) is a constant expression, legal in an array declaration in both C and C++. It will not give a variable-length array. (Apologies for misreading the C standard before. I thought that sounded wrong.)

In the real world, though, two consecutive arrays of int in a structure are going to be laid out the way you expect. (You might be able to engineer a very contrived counterexample by setting the bound of a to 3 or 5 instead of 4 and then getting the compiler to align both a and b on a 16-byte boundary.) Rather than convoluted methods to try to get a program that makes no assumptions whatsoever beyond the strict wording of the standard, you want some kind of defensive coding, such as static assert(&both_arrays[4] == &s.b[0], "");. These add no run-time overhead and will fail if your compiler is doing something that would break your program, so long as you don’t trigger UB in the assertion itself.

If you want a portable way to guarantee that both sub-arrays are packed into a contiguous memory range, or split a block of memory the other way, you can copy them with memcpy().

Davislor
  • 14,674
  • 2
  • 34
  • 49
0

The Standard does not impose any restrictions upon what implementations must do when a program tries to use an out-of-bounds array subscript in one structure field to access a member of another. Out-of-bounds accesses are thus "illegal" in strictly conforming programs, and programs which make use of such accesses cannot simultaneously be 100% portable and free of errors. On the other hand, many implementations do define the behavior of such code, and programs which are targeted solely at such implementations may exploit such behavior.

There are three issues with such code:

  1. While many implementations lay out structures in predictable fashion, the Standard allows implementations to add arbitrary padding before any structure member other than the first. Code could use sizeof or offsetof to ensure that structure members are placed as expected, but the other two issues would remain.

  2. Given something like:

    if (structPtr->array1[x])
     structPtr->array2[y]++;
    return structPtr->array1[x];
    

    it would normally be useful for a compiler to assume that the use of structPtr->array1[x] will yield the same value as the preceding use in the "if" condition, even though it would change the behavior of code that relies upon aliasing between the two arrays.

  3. If array1[] has e.g. 4 elements, a compiler given something like:

    if (x < 4) foo(x);
    structPtr->array1[x]=1;
    

might conclude that since there would be no defined cases where x isn't less than 4, it could call foo(x) unconditionally.

Unfortunately, while programs can use sizeof or offsetof to ensure that there aren't any surprises with struct layout, there's no way by which they can test whether compilers promise to refrain from the optimizations of types #2 or #3. Further, the Standard is a little vague about what would be meant in a case like:

struct foo {char array1[4],array2[4]; };

int test(struct foo *p, int i, int x, int y, int z)
{
  if (p->array2[x])
  {
    ((char*)p)[x]++;
    ((char*)(p->array1))[y]++;
    p->array1[z]++;
  }
  return p->array2[x];
}

The Standard is pretty clear that behavior would only be defined if z is in the range 0..3, but since the type of p->array in that expression is char* (due to decay) it's not clear the cast in the access using y would have any effect. On the other hand, since converting pointer to the first element of a struct to char* should yield the same result as converting a struct pointer to char*, and the converted struct pointer should be usable to access all bytes therein, it would seem the access using x should be defined for (at minimum) x=0..7 [if the offset of array2 is greater than 4, it would affect the value of x needed to hit members of array2, but some value of x could do so with defined behavior].

IMHO, a good remedy would be to define the subscript operator on array types in a fashion that does not involve pointer decay. In that case, the expressions p->array[x] and &(p->array1[x]) could invite a compiler to assume that x is 0..3, but p->array+x and *(p->array+x) would require a compiler to allow for the possibility of other values. I don't know if any compilers do that, but the Standard doesn't require it.

supercat
  • 77,689
  • 9
  • 166
  • 211