Variable-length strings in structures and undefined behavior causing time travel

Question

In this Stackoverflow question or in the article Undefined behavior can result in time travel (among other things, but time travel is the funkiest) one may learn that accessing data structures at indexes greater than their size is undefined behavior, and when a compiler sees undefined behavior, it generates crazy code without even reporting that undefined behavior was encountered. This is done to make code run a few nanoseconds faster and because the standard permits it. So-called "time travel" appears because the compiler operates on control-flow branches, and when it sees undefined behavior in a branch, it just deletes that branch (on the basis that any behavior will do in place of undefined behavior).

Nevertheless, here is an old idiom:

struct myString {
    int length;
    char text[1];
}

used as

    char* s = "hello, world";
    int len = strlen(s);
    myString* m = malloc(sizeof(myString) + len);
    m->length = len;
    strcpy(&m->text,s);

and now, what will happen if I access m->text[3]? (Note how it is declared.) Will the compiler take it as undefined behavior? If yes, how do I add an array of statically unknown amount of items to the end of a structure?

In particular, I am interested in an array whose size is at least 1, but maybe more. Sort of,

struct x {
    unsigned u[1];
};

and access like `struct x* p; ... p->x[3]`.

UPD related: Is the "struct hack" technically undefined behavior? (as @mafso has noted in a comment).

This is a [flexible array member](http://stackoverflow.com/questions/20221012/unsized-array-declaration-in-a-struct/20221073#20221073) the version you show uses the [old c90 style](https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html) but gcc probably still support it, — Shafik Yaghmour, Jul 11 '14 at 13:37
If you want variable-length strings in C++ you should use `std::string`. — Some programmer dude, Jul 11 '14 at 13:40
@ShafikYaghmour What he shows wasn't legal in C90? It _was_ widely used, and C99 added a special feature which allows the same functionality without making bounds checking generally illegal. — James Kanze, Jul 11 '14 at 13:49
At least one compiler generates an intentional crash in some cases of undefined behavior. — James Kanze, Jul 11 '14 at 13:50
I hope that nobody would do that in C++ (since you tagged it such). There are no efficiency gains compared to a properly written container. Come to think of it, one should write a container in C as well (fake a class with a factory function as ctor replacement that allocates behind the scenes and proper access functions as member replacements). In C we lose automatic freeing (no dtors) but the original approach needs manual free as well. — Peter - Reinstate Monica, Jul 11 '14 at 13:59
related: http://stackoverflow.com/questions/3711233/is-the-struct-hack-technically-undefined-behavior — mafso, Jul 11 '14 at 14:02
Note that the name for the technique is **Struct Hack**. It is seldom officially supported (it is not recognized as valid by any C — or C++ — standard, for starters) but it usually works in C if you are careful. In C++, you simply should not be using the technique at all. — Jonathan Leffler, Jul 11 '14 at 17:52
@PeterSchneider in embedded programming, C++ is often used as C with syntactic sugar like operator redefinition and functions in structures. Even in Android `std::string` was a problem at least a while ago. — 18446744073709551615, Jul 14 '14 at 07:05
@18446744073709551615 I agree (even if Matt doesn't). For example around the year 2000 Windriver's clib's malloc (which was underlying new()) was so bad that it proved unusable for short lived objects like strings for our uses (infotainment GUI). The memory would just fragment too much. But we indeed implemented our own which would only resort to dynamic allocation above a certain string length. — Peter - Reinstate Monica, Jul 14 '14 at 10:55
@MattMcNabb I assume that 18.. was oversimplifying. Some systems are perceived too small for the STL, sometimes correctly. And then there are a lot of embedded C programmers out there who never enjoyed a proper education as SW engineers. They usually are indispensible experts in their field with a lot of programming experience on small systems but they are not expert programmers and much less software designers. They just gradually grow into C++ because of added benefits compared to C, and luckily -- by design -- the language supports this transition. ... — Peter - Reinstate Monica, Jul 14 '14 at 11:05
... So one cannot expect elaborate inheritance hierarchies or consequent RAII patterns from them. Not yet. For many organizations I have worked with it is a challenge to enable their work force to transition through the IT changes their field is experiencing. As we all know many embedded systems today are comparable to typical PCs not so many years ago. So bear with them ;-). They are no NOOBs and no dinosaurs either. They are indispensible HRes who -- like all of us -- must be continously re-enabled. — Peter - Reinstate Monica, Jul 14 '14 at 11:13

score 2 · Answer 1 · answered Jul 11 '14 at 13:47

2

It may be an old idiom, but it is undefined behavior in both C and C++. Since C99, you can write something like:

struct MyString
{
    int length;
    char text[];
};

and use it as you describe (although you will probably need to add 1 to the length in the malloc). In C++, you need to jump through a few more hoops:

struct MyString
{
    int length
    char* text()
    {
        return reinterpret_cast<char*>( this + 1 );
    }
};

For anything other than char, however, you'll need to watch out for alignment restrictions, since the compiler doesn't know that the end of the struct must be aligned correctly for what follows. (G++ uses, or at least used, something like this in its implementation of std::basic_string. And instantiations like std::basic_string<double> would crash on machines where size_t was only 4 bytes, and accessing a double required 8 byte alignment.)

answered Jul 11 '14 at 13:47

James Kanze

150,581
18
184
329

@Manu343726 `this + 1` works regardless of what the object is, whereas your suggestion only works if `sizeof(*this) == sizeof(int)`. – M.M Jul 14 '14 at 09:57
@Manu343726 First, `this + sizeof(int)` would likely increase the pointer too much, since it will add `sizeof(int) * sizeof(*this)` to the pointer. And second, as others have already say, `this + 1` works regardless of the contents of `MyString`. – James Kanze Jul 14 '14 at 12:50
@MattMcNabb `this + sizeof(int)` doesn't work even if `sizeof(*this) == sizeof(int)`. The type of `this` is `MyString*`, so anything you add to it will be multiplied by `sizeof(MyString)`. – James Kanze Jul 14 '14 at 12:52
@Manu343726 You're missing an understanding of how pointer arithmetic works. Adding one to a pointer advances it to the next element in an array. – James Kanze Jul 14 '14 at 12:53
That's exactly the problem, I was thinking like if it was char pointer arithmetic. Thanks – Manu343726 Jul 14 '14 at 12:58

score 1 · Answer 2 · answered Jul 11 '14 at 13:41

1

To answer your question, the compiler have no bounds-checking of array indexes, which is why flexible arrays works. You can use any index in any array.

answered Jul 11 '14 at 13:41

Some programmer dude

400,186
35
402
621

3

Some compilers have no bounds checking on array indices. The C standard was carefully written so that bounds checking would be legal, and his code contains undefined behavior, even if it works on most compilers. – James Kanze Jul 11 '14 at 13:48
@JamesKanze exactly, on most _today_ compilers; and I care about things like the future versions or the same version invoked with a different command-line switch. – 18446744073709551615 Jul 11 '14 at 13:55
@18446744073709551615 For what it's worth, I believe that CenterLine once had a compiler which _did_ bounds checking, and would cause the program to terminate in such cases. It wasn't meant for production use, because it ran a lot slower, but it was a lot like the checked iterators that you get in most modern C++ compilers. – James Kanze Jul 11 '14 at 15:14

score 0 · Answer 3 · answered Jul 11 '14 at 15:27

You seem to have an odd idea about the meaning of "undefined behavior". Compilers are not required to recognize code that results in undefined behavior, and if they do, they are not required to do anything in particular with it. One very reasonable approach, in fact, is to do nothing special with it at all. The undefinedness is not about the machine code the compiler may generate, but rather about the effect of running it.

With that said, there is a subtle but very important difference between arrays and pointers. Arrays (such as char text[1]) have associated storage, whereas pointers (char *text or char text[]) do not.

what will happen if I access m->text[3]? (Note how it is declared.) Will the compiler take it as undefined behavior?

Well, the first undefined behavior is associated with this earlier statement:

strcpy(&m->text,s);

m->text refers to a char array of length 1, and the strcpy() will write past its end. As far as definedness goes, the fact that you reserve extra space in the block allocated for your struct is irrelevant. In practice, it probably does what you intend, but it's a terrible idea to depend on that.

Similar applies to accessing m->text[3]. The compiler is more likely to recognize the out-of-bounds access, but even if it does, the effect is likely to be what you intend. "Likely", but not certain. That's why relying on undefined behavior is to be avoided.

`char text[]` is not a pointer, except for in a [function parameter list](http://stackoverflow.com/questions/22677415/why-do-c-and-c-compilers-allow-array-lengths-in-function-signatures-when-they/22677793#22677793). Pointers do have associated storage: the amount of memory required to store the pointer value. `&m->text` is a type mismatch, the code is ill-formed. — M.M, Jul 14 '14 at 09:56

Variable-length strings in structures and undefined behavior causing time travel

3 Answers3