Does casting an array to a char* imply a limit on the string's length?

Question

What should this code print?

#include <stdio.h>
#include <string.h>

struct S
{
    int x[1];
};

union U
{
    struct S arr[64];
    char s[256];
};

int main()
{
    union U u;
    strcpy(u.s, "abcdefghijklmnopqrstuvwxyz");
    size_t len = strlen((char*)&u.arr[1].x);
    puts(len > 10 ? "YES" : "NO");
    return 0;
}

Clang always prints "YES". GCC 8.1 prints "NO" with optimizations, though emits no warnings. Is it taking advantage of some undefined behavior?

type of `&u.arr[1].x` is `int (*)[1]`. Casting it to `char*` and copying more that `sizeof(int)` is out of bound access. — Ajay Brahmakshatriya, Jun 21 '18 at 07:49
It doesn't matter that you have valid memory AFTER the array of size `1`. If you derive the pointer using the base pointer of small size and read from it, the behavior is undefined. — Ajay Brahmakshatriya, Jun 21 '18 at 07:50
Also I would have used `u.arr[1].x` or `&u.arr[1].x[0]` for the test. I wonder why you used the `&u.arr[1].x` — Ajay Brahmakshatriya, Jun 21 '18 at 07:52
With `gcc 8.1` if you change the array to `int[3]` instead of `int[1]` it stops printing `NO` directly. So surely it is identifying a out of bound access. Since `10 < 12` but `10 > 8`. See https://godbolt.org/g/fmbPyE vs https://godbolt.org/g/dQCY8V — Ajay Brahmakshatriya, Jun 21 '18 at 07:58
"I wonder why you used the `&u.arr[1].x`" To make it easy to tweak the type/arity of `x`. — Vladimir Panteleev, Jun 21 '18 at 08:06
FWIW - gcc version 6.3.1 prints YES - with and without optimization flag — Support Ukraine, Jun 21 '18 at 08:07

Ajay Brahmakshatriya · Answer 1 · 2018-06-21T08:12:38.477

Yes, gcc 8.1 is making use of Undefined Behavior. You have out of bound access on array of size 1 int while calling strlen.

strlen((char*)&u.arr[1].x);

The type of &u.arr[1].x is int (*)[1]. You have then casted it to char*. Unless used as a operand for sizeof, address of an array has the same value as the address to the first element. Hence before the cast it will have the value of &u.arr[1].x[0], which is of type int[1]. Assuming sizeof(int) == 4, you can see that reading more than 4 bytes causes out of bound access.

It doesn't matter that you have valid memory AFTER the array of size 1. If you derive the pointer using the base pointer of small size and read from it, the behavior is undefined.

You can confirm that this is the exact reason by changing the array sizes to 1, 2 and 3 and check the generated assembly from gcc.

For 1 and 2 it generates puts("NO"). But for 3 it generates the expected code. This is because you are comparing against 10. With int[2], the length can never be greater than 10 (without invoking UB). But with 3 the maximum bytes are 12.

You can see both the generated assembly here -

array of size 3 vs array of size 2

You might also want to see this old question of mine for a similar discussion with 2D arrays.

Note however the special rule in 6.3.2.3/7 that allows one to iterate over any object by using a character pointer. So if the input had not been a single array out-of-bounds, but rather a whole struct or the array of structs, I would expect a different behavior. — Lundin, Jun 21 '18 at 11:46
@Lundin yes if the single struct had a size greater than 11 bytes, the behaviour would have been as expected. — Ajay Brahmakshatriya, Jun 21 '18 at 12:25

supercat · Accepted Answer · 2018-06-22T20:45:35.333

Implementations that are suitable for systems programming will allow a pointer to an inner object to be used to derive pointers to containing objects. The C Standard does not, however, seek to require that all conforming implementations be suitable any purpose whatsoever (the authors acknowledge in the rationale that it would be possible to construct a conforming implementation which is of such low quality as to be essentially useless), much less that they all be suitable for systems programming. On the other hand, it does describe a fairly easy means by which an implementation intended for systems programming can provide the necessary semantics.

In particular, while the Standard does not mandate that a direct cast from T* to V* will behave as a conversion from T* to U*, followed by a conversion from U* to V* if there exists some type U* supporting round-trip conversions to/from T* and V*, such behavior was certainly commonplace when it was written. Many actions whose behavior would otherwise not be defined by the Standard would be defined on an implementation that guarantees that pointer casts behave transitively.

Among other things, the Standard specifies that a pointer to an aggregate (array, struct, or union), suitably converted, will yield a pointer to its first element/member and vice versa. Thus, converting &u.x[0] to an int(*)[1], converting that to a struct S*, then to a union U*, and then finally to a char*, would yield a char* which can be used to index the entire structure. While Standard may allow a conforming implementation to treat a cast to to char* in a way that only allows access to the specific "inner" object whose address was converted, it hardly implies that implementations should do so, nor that such a restriction would not make an implementation unsuitable for systems programming.

PS--I could certainly see benefits to a range-limiting qualifier that would indicate that a pointer to a particular object will not be used to derive the address of anything outside that object. Given something like:

struct foo {int x,y,z; };
...
int test(struct foo restrict *it)
{
  it->y++;
  doSomething(&it->x);
  it->y--;
  return it->y;
}

the existence of such a qualifier on the parameter to doSomething() would allow a compiler to optimize out the operations on it->y whether or not it knew anything about the code for doSomething(). Note, however, that to be most useful such a qualifier would require that--as with restrict--operations that would normally launder the pointer would not erase its effects. Consequently, it makes more sense to treat unqualified casts as laundering pointers to the extent possible than to treat casts as yielding range-limited pointers except when explicitly laundered.

Good insight, thanks. Looks like GCC maintainers ruled this "not a bug", and the program as undefined, in the end: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86259 — Vladimir Panteleev, Jun 22 '18 at 04:54
@VladimirPanteleev: I wonder why the authors of gcc don't realize that (1) the authors of the Standard intended that C be regarded as a language suitable for writing *non-portable* programs; use of actions not defined by the Standard may make a program non-portable, but they in no way imply that it is "broken"; and (2) unless targeted toward very narrow application fields, quality implementations should seek to efficiently process as wide a range of useful programs as practical. A quality compiler should not need to have nearly all optimizations disabled to reliably compile an OS. — supercat, Jun 22 '18 at 05:09

Does casting an array to a char* imply a limit on the string's length?

2 Answers2