123

What I am asking about is the well known "last member of a struct has variable length" trick. It goes something like this:

struct T {
    int len;
    char s[1];
};

struct T *p = malloc(sizeof(struct T) + 100);
p->len = 100;
strcpy(p->s, "hello world");

Because of the way that the struct is laid out in memory, we are able to overlay the struct over a larger than necessary block and treat the last member as if it were larger than the 1 char specified.

So the question is: Is this technique technically undefined behavior?. I would expect that it is, but was curious what the standard says about this.

PS: I am aware of the C99 approach to this, I would like the answers to stick specifically to the version of the trick as listed above.

Evan Teran
  • 87,561
  • 32
  • 179
  • 238
  • As in, the code listed above that will fail during compilation? Look at your `malloc` line. Specifically, the type decl. – jer Sep 14 '10 at 17:14
  • 36
    This seems like a quite clear, reasonable, and above all *answerable* question. Not seeing the reason for the close vote. – cHao Sep 14 '10 at 17:17
  • @jer: I am asking about whether or not the trick itself is UB, not if it compiles. sorry i forgot the `struct` keyword there, fixed. – Evan Teran Sep 14 '10 at 17:19
  • It will work, but you can be writing over memory, since you are going past the end of the allocated area, so this would be as bad as just having a char[1] a; strcpy(a, "hello world"). It compiles, but both have the same problem. – James Black Sep 14 '10 at 17:26
  • 2
    If you introduced a "ansi c" compiler that didn't support the struct hack, most c programmers I know would not accept that your compiler "worked right". Not withstanding that they would accept a strict reading of the standard. The committee simply missed one on that. – dmckee --- ex-moderator kitten Sep 14 '10 at 17:30
  • 4
    @james The hack works by mallocing an object big enough for the array you mean, despite having declared a minimal array. So you are accessing *allocated* memory outside the strict definition of the struct. Writing past your allocation is unarguable a mistake, but that is different from writing in your allocation but outside "the struct". – dmckee --- ex-moderator kitten Sep 14 '10 at 17:32
  • JUST KIDDING: one advantage of the "struct hack" in C89 as opposed to C99 is that we "gain" 1 byte for the terminating `'\0'` automagically. – pmg Sep 14 '10 at 18:49
  • @dmckee - I thought it was just changing a pointer, except that the example was doing a string copy into an array that is too small. If p->s had pointed to "hello world" that would have been fine, but the strcpy is the problem. – James Black Sep 14 '10 at 18:53
  • 2
    @James: The oversized malloc is critical here. It insures that there is memory---memory with legal address and and 'owned' by the structure (i.e. it is illegal for any other entity to use it)---past the nominal end of the structure. Note that this means you can't use the struct hack on automatic variables: they must be dynamically allocated. – dmckee --- ex-moderator kitten Sep 14 '10 at 19:46
  • 1
    @James Black: No, `p->s` isn't a pointer. – jamesdlin Sep 14 '10 at 19:47
  • @dmckee: It's not owned by the structure. The structure is merely overlapping with a part of the allocated object, which is much larger and is the actual "object" in question. `p->s + 1` happens to be a valid pointer to a part of that object which can be used for storing type `char`. – R.. GitHub STOP HELPING ICE Sep 14 '10 at 23:44
  • @jamesdlin: Are you sure about that? An array in that context ought to decay to a pointer, following C semantics. – Vatine Sep 14 '10 at 23:45
  • @R.. Sure. Take that as "notionally owned". The right language isn't well defined. The important bit is that nothing *else* owns that memory and it is a continuously addressable block after the structure proper. – dmckee --- ex-moderator kitten Sep 14 '10 at 23:55
  • 1
    I know this is a long standing trick, but just out of curiosity — why would you bother? Instead of defining the struct to have a pointer in the first place? I mean, I know C pretty well, and this is just confusing to look at. – detly Sep 15 '10 at 00:09
  • 2
    @detly: using a pointer is slower (extra dereference) and wastes space (at least 4 or 8 bytes, depending on if you have a 32/64 bit machine, and a lot more if you `malloc` the string separately rather than storing it immediately after the struct in the same allocated block). If you have lots of small objects or will be accessing them often, it's stupid to use a pointer here. – R.. GitHub STOP HELPING ICE Sep 15 '10 at 00:26
  • 6
    @detly: It's simpler to allocate/deallocate one thing than it is to allocate/deallocate two things, especially since the latter has two ways of failing that you need to deal with. This matters more to me than the marginal cost/speed savings. – jamesdlin Sep 15 '10 at 00:30
  • 1
    @detly: Also used for headers of variable-length data (like network frames, certain file formats, etc.). You can take a raw array of bytes and cast it as a pointer to this type to access the members of the header and still have a member that points to the variable-length data at the end. You can also create a union between the struct an a fixed-length buffer of bytes when creating a block of data to send on the network, or write to a file. – tomlogic Nov 04 '10 at 23:14
  • Maybe I'm getting old and missing something simple, but why is this confusing. Ignore the struct part for a minute and realize that p is a pointer of some number of bytes larger than 100, so you can clearly copy "hello world" to it. C is beautiful in it's simplicity. Don't get bogged down in what you think the compiler might be thinking. You have a pointer, you're allocating memory, you're using that memory. I have a hard time even calling this a 'hack.' – stu Nov 08 '10 at 04:12
  • Symbian OS's descriptors use this technique for stack based objects with heap allocated stings. The object contains the length seperately. – Dynite Nov 08 '10 at 09:42
  • What is its status in C++? The `struct` has to be POD in C++98/03, and at least trivial in C++11 and later. (Would trivially-copyable be OK?) – CTMacUser Oct 15 '13 at 06:03

8 Answers8

55

As the C FAQ says:

It's not clear if it's legal or portable, but it is rather popular.

and:

... an official interpretation has deemed that it is not strictly conforming with the C Standard, although it does seem to work under all known implementations. (Compilers which check array bounds carefully might issue warnings.)

The rationale behind the 'strictly conforming' bit is in the spec, section J.2 Undefined behavior, which includes in the list of undefined behavior:

  • An array subscript is out of range, even if an object is apparently accessible with the given subscript (as in the lvalue expression a[1][7] given the declaration int a[4][5]) (6.5.6).

Paragraph 8 of Section 6.5.6 Additive operators has another mention that access beyond defined array bounds is undefined:

If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.

Carl Norum
  • 219,201
  • 40
  • 422
  • 469
  • 2
    In the OP's code, `p->s` is never used as an array. It's passed to `strcpy`, in which case it decays to a plain `char *`, which happens to point to an object which can legally be interpreted as `char [100];` inside the allocated object. – R.. GitHub STOP HELPING ICE Sep 14 '10 at 23:34
  • 3
    Perhaps another way of looking at this is that the language could conceivably restrict how you access actual **array variables** as described in J.2, but there is no way it can make such restrictions for an object allocated by `malloc`, when you have merely converted the returned `void *` to a pointer to [a struct containing] an array. It's still valid to access any part of the allocated object using a pointer to `char` (or preferably `unsigned char`). – R.. GitHub STOP HELPING ICE Sep 14 '10 at 23:41
  • @R. - I can see how J2 might not cover this, but isn't it also covered by 6.5.6? – detly Sep 15 '10 at 00:03
  • 1
    Sure it could! Type and size information could be embedded in every pointer, and any erroneous pointer arithmetic could then be made to trap - see e.g. [CCured](http://hal.cs.berkeley.edu/ccured/). On a more philosophical level, it doesn't matter whether *no possible implementation* could catch you, it's still undefined behavior (there are, iirc, cases of undefined behavior that would require an oracle for the Halting Problem to nail down - which is precisely why they are undefined). – zwol Sep 15 '10 at 00:10
  • detly: J2 is just an informative list, 6.5.6 is definitive, R is wrong. – zwol Sep 15 '10 at 00:11
  • 4
    The object is not an array object so 6.5.6 is irrelevant. The object is the block of memory allocated by `malloc`. Lookup "object" in the standard before you spout off bs. – R.. GitHub STOP HELPING ICE Sep 15 '10 at 00:23
  • By the way, the example in J2 depends on the type. If the array happened to be `unsigned char a[4][5];`, then `a[1][7]` is valid because any object can be accessed via a pointer to `unsigned char` as if it were overlaid by an array of type `unsigned char [sizeof object]`. – R.. GitHub STOP HELPING ICE Sep 15 '10 at 00:34
  • Whether or not 6.5.6 disallows seems to hinge on whether you consider that the entire block of 100 chars allocated by `malloc()` in the example counts as an array object (of 100 chars) in its own right. (What if the `struct` was stored within a union that included a large char array? Are two pointers that have the same type and must be equal always equivalent?) – caf Sep 15 '10 at 02:06
  • @R.. I was not aware that the `unsigned char[]` behaviour you mentioned was in the std... do you recall if that's C89, C99 or both? – detly Sep 15 '10 at 10:38
  • I think it's only in C99, but I'm not sure. The purpose seems to be so that you can make functions which operate on abstract memory. My favorite hypothetical one is a `memswap` which swaps two potentially-giant objects without requiring temporary space like using `memcpy` would. – R.. GitHub STOP HELPING ICE Sep 15 '10 at 12:43
  • By the way, the reason this only applies to `unsigned char *` and not other pointers is related to strict aliasing rules. Having to assume `int *` and `struct foo *` might point to the same or overlapping memory, like pre-C99 compilers did in practice, was deemed too costly (inhibits too many optimizations). Only allowing `unsigned char *` to alias other objects seems to have been the compromise. – R.. GitHub STOP HELPING ICE Sep 15 '10 at 12:46
  • @R..: What do you make of the defect reprot at http://www.open-std.org/Jtc1/sc22/wg14/www/docs/dr_051.html which suggests that the struct hack is technically Undefined Behavior (though sufficiently widely used that compiler writers should regard it as having defined semantics). – supercat Dec 05 '12 at 16:50
  • @supercat: Per the standard, `p->x == (char *)p`. Clearly `((char *)p)[5]` is well-defined, so how can `p->x[5]` be undefined? Are they claiming pointers could be represented such that two different pointers *compare equal* but still carry with them in their representations information on how they were derived? – R.. GitHub STOP HELPING ICE Dec 05 '12 at 17:05
  • @R..: Is there anything in the standard that would forbid a compiler from doing so? The standard certainly allows such behaviors with floating-point numbers (e.g. on many machines, if `x=1.0/(1.0/0.0)` and `y=1.0/(-1.0/0.0)`, the comparison `x==y` would report them as equal, but `1.0/x` != `1.0/y`). – supercat Dec 05 '12 at 17:07
  • @supercat: I'm aware that there's no explicit rule against doing that. My intuition is that it would however break other pointer cast and pointer arithmetic semantics that are supposed to work. For instance, the pointer resulting from the decay of `p->x` *points to* a `char` which happens to be the first `char` of a 101-`char` array (the representation array of the object returned by `malloc`). As such, adding 5 to it should be legal, since the result does not go outside this array. – R.. GitHub STOP HELPING ICE Dec 05 '12 at 17:13
  • Put differently, I see nowhere that the rules about pointer arithmetic make reference to how the pointer was obtained or the identity of the pointer. Rather, they refer only to the pointed-to object and its potential status as an element of an array, which is invariant with respect to the derivation of the pointer used to access it. – R.. GitHub STOP HELPING ICE Dec 05 '12 at 17:21
  • @R..: Although I don't know of any implementations whose pointer type contains a minimum and maximum limit along with the present value, I don't know of anything that would forbid it. If malloc returns p==(min=1234, max=1342, cur=1234), p->s would be (min=1238, max=1239, cur=1238) (pegging max to the lesser of (p_cur+5) or p_max). If a compiler could *legally* do that, then it should also legally be allowed to do the much optimizations like replacing `p->s[ind]` with `p->s[0]` (a useful optimization if the prohibition against zero-sized arrays hadn't led to widespread abuse of size-1 arrays). – supercat Dec 05 '12 at 17:59
  • @supercat: What about my second comment? The way pointer addition is specified *defines* the behavior regardless of how the pointer was obtained. Declaring it undefined elsewhere is contradictory. I agree that this contradiction could be "fixed" by amending the specification of pointer addition to allow it to depend on the pointer used (rather than just the pointed-to object), but under the current specification, the interpretation for the defect report contradicts the language elsewhere in the standard. – R.. GitHub STOP HELPING ICE Dec 05 '12 at 18:03
  • @R..: BTW, given the above struct, should a compiler be required to assume that `p->s[someVariable]` could alias `p->len`? Such a requirement would compel the generation of code to perform extra work which would in 99.99999% of cases be totally useless [it would probably be useless even in a lot of cases where `someVariable` was in the range (-sizeof(int)) to -1, since most of those would likely be accidental]. – supercat Dec 05 '12 at 18:05
  • As written, yes. Because pointers to `char` type can alias anything. I agree this is suboptimal and it's probably desirable to "fix" the issue. The problem is it's not so easy to fix, because you might actually have code that's valid under the standard as written, e.g. using `offsetof(T,y)-offsetof(T,x)` to jump from a pointer to one element of a struct to a pointer to another using the representation array. And with members which have `char` or `char` array type, it's impossible for the compiler to distinguish between pointers to the representation array and pointers to the member. – R.. GitHub STOP HELPING ICE Dec 05 '12 at 18:07
36

I believe that technically it's undefined behavior. The standard (arguably) doesn't address it directly, so it falls under the "or by the omission of any explicit definition of behavior." clause (§4/2 of C99, §3.16/2 of C89) that says it's undefined behavior.

The "arguably" above depends on the definition of the array subscripting operator. Specifically, it says: "A postfix expression followed by an expression in square brackets [] is a subscripted designation of an array object." (C89, §6.3.2.1/2).

You can argue that the "of an array object" is being violated here (since you're subscripting outside the defined range of the array object), in which case the behavior is (a tiny bit more) explicitly undefined, instead of just undefined courtesy of nothing quite defining it.

In theory, I can imagine a compiler that does array bounds checking and (for example) would abort the program when/if you attempted to use an out of range subscript. In fact, I don't know of such a thing existing, and given the popularity of this style of code, even if a compiler tried to enforce subscripts under some circumstances, it's hard to imagine that anybody would put up with its doing so in this situation.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • 2
    I can also imagine a compiler which might decide that if an array happened to be of size 1, then `arr[x] = y;` might be rewritten as `arr[0] = y;`; for an array of size 2, `arr[i] = 4;` might be rewritten as `i ? arr[1] = 4 : arr[0] = 4;` While I've never seen a compiler perform such optimizations, on some embedded systems they could be very productive. On a PIC18x, using 8-bit data types, the code for the first statement would be sixteen bytes, the second, two or four, and the third, eight or twelve. Not a bad optimization if legal. – supercat Jan 13 '12 at 23:48
  • If the standard defines array access outside of array bounds as undefined behaviour, then the struct hack is too. If, however, the standard defines array access as syntactical sugar for pointer arithmetic (`a[2] == a + 2`), it does not. If I'm correct, all C standards define array access as pointer arithmatic. – yyny Dec 05 '16 at 22:07
17

Yes, it is undefined behavior.

C Language Defect Report #051 gives a definitive answer to this question:

The idiom, while common, is not strictly conforming

http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_051.html

In the C99 Rationale document the C Committee adds:

The validity of this construct has always been questionable. In the response to one Defect Report, the Committee decided that it was undefined behavior because the array p->items contains only one item, irrespective of whether the space exists.

ouah
  • 142,963
  • 15
  • 272
  • 331
  • 2
    +1 for finding this, but I still claim it's contradictory. Two pointers to the same object (in this case, the given byte) are equal, and one pointer to it (the pointer into the representation array of the entire object obtained by `malloc`) is valid in the addition, so how can the identical pointer, obtained via another route, be invalid in the addition? Even if they want to claim it's UB, that's pretty meaningless, because there is computationally no way for an implementation to distinguish between the well-defined usage and the supposedly-undefined usage. – R.. GitHub STOP HELPING ICE Sep 13 '12 at 19:12
  • It's too bad that C compilers started forbidding the declaration of zero-length arrays; were it not for that prohibition, many compilers wouldn't have had to do any special handling to make them work as they "should", but would still have been able to special-case code for single-element arrays (e.g. if `*foo` contains a single-element array `boz`, the expression `foo->boz[biz()*391]=9;` could be simplified as `biz(),foo->boz[0]=9;`). Unfortunately, compilers' rejection zero-element arrays means a lot of code uses single-element arrays instead, and would be broken by that optimization. – supercat Oct 06 '12 at 09:16
13

That particular way of doing it is not explicitly defined in any C standard, but C99 does include the "struct hack" as part of the language. In C99, the last member of a struct may be a "flexible array member", declared as char foo[] (with whatever type you desire in place of char).

Chuck
  • 234,037
  • 30
  • 302
  • 389
  • To be pedantic, that's not the struct hack. The struct hack uses an array with a fixed size, not a flexible array member. The struct hack is what was asked about and is UB. Flexible array members just seem like an attempt to appease the kind of folk seen in this thread complaining about that fact. – underscore_d Jul 01 '16 at 23:49
9

It is not undefined behavior, regardless of what anyone, official or otherwise, says, because it is defined by the standard. p->s, except when used as an lvalue, evaluates to a pointer identical to (char *)p + offsetof(struct T, s). In particular, this is a valid char pointer inside the malloc'd object, and there are 100 (or more, dependign on alignment considerations) successive addresses immediately following it which are also valid as char objects inside the allocated object. The fact that the pointer was derived by using -> instead of explicitly adding the offset to the pointer returned by malloc, cast to char *, is irrelevant.

Technically, p->s[0] is the single element of the char array inside the struct, the next few elements (e.g. p->s[1] through p->s[3]) are likely padding bytes inside the struct, which could be corrupted if you perform assignment to the struct as a whole but not if you merely access individual members, and the rest of the elements are additional space in the allocated object which you are free to use however you like, as long as you obey alignment requirements (and char has no alignment requirements).

If you are worried that the possibility of overlapping with padding bytes in the struct might somehow invoke nasal demons, you could avoid this by replacing the 1 in [1] with a value which ensures that there is no padding at the end of the struct. A simple but wasteful way to do this would be to make a struct with identical members except no array at the end, and use s[sizeof struct that_other_struct]; for the array. Then, p->s[i] is clearly defined as an element of the array in the struct for i<sizeof struct that_other_struct and as a char object at an address following the end of the struct for i>=sizeof struct that_other_struct.

Edit: Actually, in the above trick for getting the right size, you might also need to put a union containing every simple type before the array, to ensure that the array itself begins with maximal alignment rather than in the middle of some other element's padding. Again, I don't believe any of this is necessary, but I'm offering it up for the most paranoid of the language-lawyers out there.

Edit 2: The overlap with padding bytes is definitely not an issue, due to another part of the standard. C requires that if two structs agree in an initial subsequence of their elements, the common initial elements can be accessed via a pointer to either type. As a consequence, if a struct identical to struct T but with a larger final array were declared, the element s[0] would have to coincide with the element s[0] in struct T, and the presence of these additional elements could not affect or be affected by accessing common elements of the larger struct using a pointer to struct T.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • Can you make arrays of 'hacked structs'? As in `struct hack *p = malloc(42 * (sizeof *p + 100));`? – pmg Sep 14 '10 at 22:42
  • You can, but you'll have to do the subscripting yourself, including handling alignment. `p[1]` will not point to the second struct, but rather will overlap with `p->s[4]` or so... – R.. GitHub STOP HELPING ICE Sep 14 '10 at 23:13
  • 4
    You're right that the nature of the pointer arithmetic is irrelevant, but you're *wrong* about access beyond the declared size of the array. See [N1494](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1494.pdf) (latest public C1x draft) section 6.5.6 paragraph 8 - you're not even allowed to do the *addition* that takes a pointer more than one element past the declared size of the array, and you can't dereference it even if it's just one element past. – zwol Sep 15 '10 at 00:08
  • 2
    @Zack: that's true if the object is an array. It's not true if the object is an object allocated by `malloc` which is being accessed as an array or if it's a larger struct that's being accessed via a pointer to a smaller struct whose elements are an initial subset of the elements of the larger struct, among other cases. – R.. GitHub STOP HELPING ICE Sep 15 '10 at 00:19
  • 7
    +1 If `malloc` doesn't allocate a range of memory that can be accessed with pointer arithmetic, what use would it be? And if `p->s[1]` is *defined* by the standard as syntactic sugar for pointer arithmetic, then this answer merely reasserts that `malloc` is useful. What is there left to discuss? :) – Daniel Earwicker Oct 16 '10 at 12:46
  • C Committe says it *is* undefined behavior. See my answer on this question. – ouah Sep 13 '12 at 18:39
  • 3
    You can argue that it's well-defined as much as you like, but that does not change the fact that it is not. The standard is very clear about access beyond the bounds of an array, and the bound of this array is `1`. It's precisely as simple as that. – Lightness Races in Orbit Feb 17 '13 at 13:22
  • 2
    I've still yet to see a valid argument that `p->s` is not a pointer to an element of the `char` array that is the *representation array* of the allocated object. If there is such an argument, that changes things. Any ideas on constructing one? – R.. GitHub STOP HELPING ICE Feb 17 '13 at 16:04
  • Does the C++ concept of *strict pointer safety* not exist in C? Because in C++, your claim that the way in which a pointer value is derived is unimportant is no longer correct. Under *strict pointer safety*, certain operations are valid only for "safely derived" pointers. – Ben Voigt May 03 '13 at 18:34
  • Of course, in C++ the "safely derived" pointer rules don't apply to dynamic allocations... – Ben Voigt May 03 '13 at 18:39
  • 3
    @R.., I think, your assumption that two pointers comparing equal must behave the same is wrong. Consider `int m[1]; int n[1]; if(m+1 == n) m[1] = 0;` assuming the `if` branch is entered. This is UB (and not guaranteed to initialize `n`) as per 6.5.6 p8 (last sentence), as I read it. Related: 6.5.9 p6 with footnote 109. (References are to C11 n1570.) [...] – mafso Dec 08 '14 at 02:25
  • Though this reading of the standard would invalidate a lot of code (and beside what Supercat [mentioned](http://stackoverflow.com/questions/3711233/is-the-struct-hack-technically-undefined-behavior?rq=1#comment11067630_3711364), that's rather relevant to bounds-checking debugging implementations than to optimizations, so probably UB we needn't fear in reality). – mafso Dec 08 '14 at 02:25
  • 1
    @R.. by your logic here, say we have `struct S { char x[5]; }` ... `void f(struct S *s, int *t) { s->x[10000] = 'a'; }` the compiler must actually write `'a'` to the indicated location. In fact `s->x[-500] = 'b';` must also actually write `'b'` in case `s` points to a non-initial subobject of a larger allocated object. Further, it must assume `*t` may have been modified by this. I don't agree that this is the intent of the array bounds rules. – M.M Nov 06 '16 at 21:19
  • However, the [N2090](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2090.htm) proposal does seem to be in line with your view: the definition of pointer provenance is that it belongs to *the allocation*, not to any subobject . But codifying that view would (IMO) require changes in the wording to the section about array arithmetic. `struct T { int a, b, c; } t; ((int *)&t.a)[2] = 0;` is another such example, is that well-defined to write `c` (assuming no padding)? – M.M Nov 06 '16 at 21:24
9

Yes, it is technically undefined behavior.

Note, that there are at least three ways to implement the "struct hack":

(1) Declaring the trailing array with size 0 (the most "popular" way in legacy code). This is obviously UB, since the zero size array declarations are always illegal in C. Even if it does compile, the language makes no guarantees about the behavior of any constraint-violating code.

(2) Declaring the array with minimal legal size - 1 (your case). In this case any attempts to take pointer to p->s[0] and use it for pointer arithmetic that goes beyond p->s[1] is undefined behavior. For example, a debugging implementation is allowed to produce a special pointer with embedded range information, which will trap every time you attempt to create a pointer beyond p->s[1].

(3) Declaring the array with "very large" size like 10000, for example. The idea is that the declared size is supposed to be larger than anything you might need in actual practice. This method is free of UB with regard to array access range. However, in practice, of course, we will always allocate smaller amount of memory (only as much as really needed). I'm not sure about the legality of this, i.e. I wonder how legal it is to allocate less memory for the object than the declared size of the object (assuming we never access the "non-allocated" members).

AnT stands with Russia
  • 312,472
  • 42
  • 525
  • 765
  • 1
    In (2), `s[1]` is not undefined behavior. It's the same as `*(s+1)`, which is the same as `*((char *)p + offsetof(struct T, s) + 1)`, which is a valid pointer to a `char` in the allocated object. – R.. GitHub STOP HELPING ICE Sep 14 '10 at 23:26
  • On the other hand, I'm almost sure (3) is undefined behavior. Whenever you perform any operation which depends on such a struct residing at that address, the compiler is free to generate machine code which reads from any part of the struct. It could be useless, or it could be a safety feature for strict allocation checking, but there's no reason an implementation couldn't do it. – R.. GitHub STOP HELPING ICE Sep 14 '10 at 23:29
  • R: If an array was declared to have a size (is not just the `foo[]` syntactic sugar for `*foo`), then any access beyond the *smaller* of its declared size and its allocated size is UB, regardless of how the pointer arithmetic was done. – zwol Sep 15 '10 at 00:01
  • 1
    @Zack, you're wrong on several things. `foo[]` in a struct is not syntactic sugar for `*foo`; it's a C99 flexible array member. For the rest, see my answer and comments on other answers. – R.. GitHub STOP HELPING ICE Sep 15 '10 at 00:21
  • I was talking about `void x(char foo[])`, and no, I repeat, because the array in the struct hack *has a declared size*, it is the *smaller* of the declared size and the allocated size that counts for purpose of the "past the end of the array" language in 6.5.6p8. So you're still wrong. Neener. – zwol Sep 15 '10 at 00:39
  • ... I confess that I can't find positive language in the standard to back up that assertion. However, if things were as you claim, then most of section 6.7.2.1p18 (especially the "as if that member were replaced with the largest array that ..." language) would be unnecessary. Therefore I stand on my interpretation. – zwol Sep 15 '10 at 00:49
  • 6
    The issue is that some members of the committee desperately **want** this "hack" to be UB, because they envision some fairyland where a C implementation could enforce pointer bounds. For better or worse, however, doing so would conflict with other parts of the standard - things like the ability to compare pointers for equality (if bounds were encoded in the pointer itself) or the requirement that any object be accessible via an imaginary overlaid `unsigned char [sizeof object]` array. I stand by my claim that the flexible array member "hack" for pre-C99 has well-defined behavior. – R.. GitHub STOP HELPING ICE Sep 15 '10 at 00:56
  • @R: If a struct is defined as having an element "char arr[2]", is there anything in the standard that would require an expression "n = myptr->arr[i]" to handle values of i greater than 255-offsetof(arr)? On small embedded micros, it may be much faster to use 8-bit arithmetic to compute the offset into the structure, and then add that to a 16-bit pointer, than it would be to use 16-bit math throughout. If the value of i cannot legally exceed 1, is there any reason a compiler should generate address-computation code to handle larger values? – supercat Nov 17 '11 at 20:50
  • Regarding method 3, the committee response to [DR 051](http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_051.html) suggests that the committee intends it to be legal – M.M Nov 06 '16 at 21:04
5

The standard is quite clear that you cannot access things beside the end of an array. (and going via pointers does not help, as you are not allowed to even increment pointers past one after array end).

And for "working in practise". I've seen gcc/g++ optimizer using this part of the standard thus generating wrong code when meeting this invalid C.

2

If a compiler accepts something like

typedef struct {
  int len;
  char dat[];
};

I think it's pretty clear that it must be ready to accept a subscript on 'dat' beyond its length. On the other hand, if someone codes something like:

typedef struct {
  int whatever;
  char dat[1];
} MY_STRUCT;

and then later accesses somestruct->dat[x]; I would not think the compiler is under any obligation to use address-computation code which will work with large values of x. I think if one wanted to be really safe, the proper paradigm would be more like:

#define LARGEST_DAT_SIZE 0xF000
typedef struct {
  int whatever;
  char dat[LARGEST_DAT_SIZE];
} MY_STRUCT;

and then do a malloc of (sizeof(MYSTRUCT)-LARGEST_DAT_SIZE + desired_array_length) bytes (bearing in mind that if desired_array_length is larger than LARGEST_DAT_SIZE, the results may be undefined).

Incidentally, I think the decision to forbid zero-length arrays was an unfortunate one (some older dialects like Turbo C support it) since a zero-length array could be regarded as a sign that the compiler must generate code that will work with larger indices.

supercat
  • 77,689
  • 9
  • 166
  • 211