39

My professor of a systems programming course I'm taking told us today to define a struct with a zero-length array at the end:

struct array{
    size_t size;
    int data[0];
};

typedef struct array array;

This is a useful struct to define or initialize an array with a variable, i.e., something as follows:

array *array_new(size_t size){
    array* a = malloc(sizeof(array) + size * sizeof(int));

    if(a){
        a->size = size;
    }

    return a;
}

That is, using malloc(), we also allocate memory for the array of size zero. This is completely new for me, and it's seems odd, because, from my understanding, structs do not have their elements necessarily in continuous locations.

Why does the code in array_new allocate memory to data[0]? Why would it be legal to access then, say

array * a = array_new(3);
a->data[1] = 12;

?

From what he told us, it seems that an array defined as length zero at the end of a struct is ensured to come immediately after the last element of the struct, but this seems strange, because, again, from my understanding, structs could have padding.

I've also seen around that this is just a feature of gcc and not defined by any standard. Is this true?

haccks
  • 104,019
  • 25
  • 176
  • 264
nbro
  • 15,395
  • 32
  • 113
  • 196
  • 2
    'structs do not have their elements necessarily in continuous locations' - arrays in structs do. – Martin James Apr 12 '16 at 15:03
  • In such cases, I usually use [INT_MAX], rather than 0. Using 0 can result in problems with bounds-checking. – Martin James Apr 12 '16 at 15:05
  • 1
    @MartinJames how do you compute the size to pass to `malloc` afterwards ? – Quentin Apr 12 '16 at 15:07
  • 13
    This is deprecated syntax, after C99 the size needs to be empty rather than anything else (including 0) to make it well defined behavior. – user3528438 Apr 12 '16 at 15:08
  • @Quentin dunno - it's not a fixed thing. Depends on app, code, data, whatever:) Usually, I use a separate struct type for all except the last array, so it's easy to get the size of that bit. – Martin James Apr 12 '16 at 15:10
  • 1
    @nbro if there were, for example, any padding between `size` and `data`, that would be accounted for in the `sizeof(array)` term used in the `malloc` expression. Accesses to the `data` array in the structure subsequently are relative to the symbol, `data`, and therefore don't involve the padding. – lurker Apr 12 '16 at 16:31
  • 3
    I believe this kind of hacks may be common when programming for embedded devices or other "exotic platforms".But in this case most of the time you are using a custom compiler for the platform that has various extensions/limitations, so there's no point in following the C standard where the only compiler that is able to produce code for that strange CPU isn't standard conforming... – Bakuriu Apr 12 '16 at 18:50
  • Why on Earth would it be defined as an array of length 0, rather than the far more obvious "int *data;"? A construct I use fairly often myself. – jamesqf Apr 12 '16 at 21:48
  • 1
    @jamesqf because an array is not a pointer. Making only that change to the above example will *absolutely not* result in working code. – Alex Celeste Apr 12 '16 at 22:07
  • I cringe at this code every time I look upon it – Jerfov2 Apr 13 '16 at 01:39
  • @Leushenko: Only because the rest of the code is rather messed up. One could do e.g "array->data = (int *) malloc (array->size * sizeof (int));", then reference array->data [n], with n being less than size, which is much easier to understand. – jamesqf Apr 13 '16 at 04:53
  • 1
    @jamesqf Your way *is* easier to understand, but the point of flexible array members (and their predecessors) is to avoid making two allocations and needing two pointer dereferences when accessing the array; code that uses these constructs accepts the cost in maintainability in exchange for improved efficiency (which can be substantial, particularly if there are many small instances of these objects). – zwol Apr 13 '16 at 15:40
  • @zwol: If I really needed to eliminated the extra malloc, I would do something like "array = (array *) malloc (sizeof (array) + n * sizeof (int)); array->data = (int *) array + sizeof (array);" Though if I had to allocate a bunch of them, I'd create a pool that's alloc'd a bunch at a time. – jamesqf Apr 14 '16 at 17:15
  • @jamesqf See the comments on [this answer](https://stackoverflow.com/questions/36577094/array-of-size-0-at-the-end-of-struct/36588194#36588194) for why that's buggy. – zwol Apr 14 '16 at 17:56
  • @zwol: Can't see why that'd be buggy. Certainly less so (IMHO, anyway) than being unclear about what you're doing. – jamesqf Apr 15 '16 at 20:49
  • @jamesqf Wasn't it clear from the comments on the answer? The alignment of the array may be wrong if you do it that way. – zwol Apr 15 '16 at 21:59

5 Answers5

38

Currently, there exists a standard feature, as mentioned in C11, chapter §6.7.2.1, called flexible array member.

Quoting the standard,

As a special case, the last element of a structure with more than one named member may have an incomplete array type; this is called a flexible array member. In most situations, the flexible array member is ignored. In particular, the size of the structure is as if the flexible array member were omitted except that it may have more trailing padding than the omission would imply. [...]

The syntax should be

struct s { int n; double d[]; };

where the last element is incomplete type, (no array dimensions, not even 0).

So, your code should better look like

struct array{
    size_t size;
    int data[ ];
};

to be standard-conforming.

Now, coming to your example, of a 0-sized array, this was a legacy way ("struct hack") of achieving the same. Before C99, GCC supported this as an extension to emulate flexible array member functionality.

Sourav Ghosh
  • 133,132
  • 16
  • 183
  • 261
  • @user3528438 elaborated a bit to clear the context. Added relevant link, too. – Sourav Ghosh Apr 12 '16 at 15:12
  • 8
    In really old code, before either the C99 feature or the GCC extension existed, you will see `data[1]` used almost exactly the same way, relying on the general lack of bounds checking in C. This (unlike the extension and the C99 feature) is *arguably* UB-provoking, depending on what you think the repeatedly-revised-yet-still-in-need-of-major-revision definition of an "object" in the C standard means. – zwol Apr 12 '16 at 18:43
  • @zwol: The Standard imposes no requirements upon an implementation's behavior if code indexes past the end of the space declared for an array or sub-array, regardless of whether allocated storage exists for additional items. The only thing which is "arguable" is the extent to which implementations can be deemed to have defined behavior despite the Standard's failure to do so. As for "object" and related issues, what's needed IMHO is to clearly define two or more C dialects with different memory semantics--one with more precise semantics than present, and one with stricter semantics, ... – supercat Apr 13 '16 at 15:15
  • ...but explicit means of saying that memory should be recycled [treated as holding Unspecified values of a given type] or reinterpreted. That would avoid the impossible task of saying what the present standard means (impossible because there exist programs and compilers which rely upon contradictory interpretations, and neither application programmers nor compiler writers can be expected to agree that those existing programs or compilers are illegitimate). – supercat Apr 13 '16 at 15:18
  • @supercat I am sure you are aware that there are people who will _vehemently_ disagree with the assertion that "The Standard imposes no requirements upon an implementation's behavior if code indexes past the end of the space declared for an array or sub-array, regardless of whether allocated storage exists for additional items." (At least some of) those people will cite the definition of "object" as backing up their contention that indexing such an array past the end is allowed as long as it fits within the limits of actually-allocated storage. – zwol Apr 13 '16 at 15:32
  • @zwol: Given `int foo[2][4];`, `foo[0]+4` is a pointer that points one past `foo[0][3]`, but cannot be legitimately dereferenced even though it will compare equal to `foo[1][0]`, which can. I'm not positive about C89, but C99 makes abundantly clear that a compiler is allowed to assume that `foo[0][x]` and `foo[1][y]` will never alias, regardless of the values of x and y. Far more relevant, IMHO, than whether the Standard mandates a particular behavior would be determining whether or not decent compiler should ever do anything else, *regardless of what the Standard says or fails to say*. – supercat Apr 13 '16 at 15:44
  • @zwol: To put things another way: the Standard was never intended to document everything a decent compiler should do. Instead, it was intended to document those things which wouldn't be clear to any well-meaning person trying to write a decent compiler. If the Standard didn't say that a compiler must not assume a write to an `unsigned*` can't modify an `int`, a compiler writer might make that assumption about an `unsigned*` of unknown origin. On the other hand, the authors of the Standard probably figured that any decent compiler writer would recognize that if a `float*` is converted... – supercat Apr 13 '16 at 16:07
  • ...to `int*` and immediately dereferenced, such an operation would be likely to modify a `float` (but might also be capable of modifying an `int`), and so there was no need for the Standard to specify that. If one takes such a view of the Standard, it won't matter whether the Standard requires compilers to do things they obviously should. If instead one regards the Standard as being all-inclusive, the language described thereby will be rather anemic and useless. Unfortunately, the latter viewpoint is becoming more popular, even though it requires ignoring the rationale. – supercat Apr 13 '16 at 16:11
26

Your professor is confused. They should go read what happens if I define a zero size array. This is a non-standard GCC extension; it is not valid C and not something they should teach students to use (*).

Instead, use standard C flexible array member. Unlike your zero-size array, it will actually work, portably:

struct array{
    size_t size;
    int data[];
};

Flexible array members are guaranteed to count as zero when you use sizeof on the struct, allowing you to do things like:

malloc(sizeof(array) + sizeof(int[size]));

(*) Back in the 90s people used an unsafe exploit to add data after structs, known as the "struct hack". To provide a safe way to extend a struct, GCC implemented the zero-size array feature as a non-standard extension. It became obsolete in 1999 when the C standard finally provided a better way to do this.

Community
  • 1
  • 1
Lundin
  • 195,001
  • 40
  • 254
  • 396
  • 2
    What "unsafe exploit" are you talking about? The only one I remember was initialising the array with size 1 and then using the offset to the first element in the array for the size of the basic struct. Which from my understanding is perfectly safe (if a bit roundabout and has the disadvantage of not enabling people to have structs with 0 arrays) - or does that cause UB somehow too? – Voo Apr 12 '16 at 16:28
  • @Voo you could always malloc the too-small struct, cast to the pointer to the struct with the 1 element array, then trust people not to touch that last element. Quite unsafe, as many people reasonably presume that given `T*`, you can `memcpy` `sizeof(T)` bytes in/out of it... which this hack wouldn't permit. – Yakk - Adam Nevraumont Apr 12 '16 at 17:53
  • 1
    @Voo That's quite unsafe, because there might be padding bytes at the end of the array, and you end up writing data into padding bytes, where there are no guarantees of value preservation. The compiler is free to assume there's nothing of value in the padding bytes. – Lundin Apr 12 '16 at 17:54
  • @Lundin No you don't write to any padding bytes. `offsetof(struct, arr)` gives you the offset to the start of the array *including* the padding bytes in between. Not sure I understand your hack though - clearly you have to allocate the whole size - including the array and all padding. That `sizeof(T)` doesn't work is a given (how would it?) but the C99 variant doesn't help with that either. – Voo Apr 12 '16 at 17:56
  • 1
    @Voo: Yes, you do write to padding bytes. `struct {int size, char data[1]};` usually contains three padding bytes after `data`, which any writes to `data[1]` would write to. OTOH, the use of `malloc` managing the memory arguably made it safer. The C spec contradicts itself in this area. – Mooing Duck Apr 13 '16 at 04:28
  • @MooingDuck Any reference to that claim? Obviously arrays of struct would need padding to be aligned, but are you claiming that if I have something like `struct { struct {int size; char data[1]} member; char x;}`, that the compiler would pointlessly waste 3 bytes for padding we don't need? That sounds unlikely to me - the start of an object has to be aligned, but it does not have to end on the same alignment boundary. And since we use malloc, I'm not sure how applicable this would be anyhow. – Voo Apr 13 '16 at 06:39
  • 1
    @Voo What has offsetof to do with anything. Like Mooing Duck says, there might be padding bytes at the end. Instead of arguing, maybe run that code in a compiler and learn something. That code you just posted yields size 12 bytes on my Mingw compiler, thus containing a total of 6 padding bytes, 3 at the end of the inner struct and 3 at the end of the outer struct. – Lundin Apr 13 '16 at 06:44
  • 1
    @Voo: The size of any structure will be a multiple of its required alignment, which must in turn be a multiple (typically exactly 1x) of the coarsest alignment of any of its members. If `int` is required to be four-byte aligned, then the indicated `member` structure would need to be padded to 8 bytes, and the outer struct padded to 12. The relevance to `malloc` is that if code uses the struct hack and the size of the struct with the single item array at the end isn't a multiple of the alignment, the value passed to `malloc` will be slightly bigger than it needs to be. – supercat Apr 13 '16 at 15:22
  • @Voo: As the others stated: `int` has to be aligned to a specific boundary (usually 4 bytes). To make an array of your struct, the `int size` must always lie on a address that's a multiple of four, so there must be padding between the array of an element and the `int` of the next element in the bigger array. But struct pointer increment must point at the next element in the array, so the easiest way to make this happen is to simply always make that padding part of the struct itself. – Mooing Duck Apr 13 '16 at 15:56
  • `malloc(sizeof(array) + sizeof(int[size]));` should be actually `malloc(offsetof(array, data[size]))` or some equivalent to avoid problem of allocating too many. Size of padding in `array` struct in undefined but may overlapping with `data` – Swift - Friday Pie May 25 '22 at 20:19
  • @Swift-FridayPie No that is not correct. `sizeof(array)` already includes padding. Flexible array members were explicitly designed for this purpose: to avoid all manner of strange hacks that people used prior C99 ("the struct hack"). – Lundin May 30 '22 at 06:24
  • @Lundin experiments with compliant compiler (which explicitly exclides MS one, for example) show otherwise, as well as how it is phrased in C standard that padding at end of struct can be larger in comparison to struct without incomplete array. It wasn't legalizing original struct hack because array can have own requirements on alignment. – Swift - Friday Pie May 31 '22 at 14:59
  • @Lundin `struct S{ short a; char data[]; };` on latest gcc for me gave `sizeof(S)` == 8, which is unexpected. `offsetof(S, data[0])` or a manual calculation gives 2 - expected. Inspecting memory after writing to data also matches. In most cases this doesn't cause issues if we overcommit, unless we deal with tightly packed preexisting superstructure and those gaps break it – Swift - Friday Pie May 31 '22 at 15:04
  • @Swift-FridayPie How exactly did you compile? Because gcc 12.1 x86_64 Linux gives size 2. https://godbolt.org/z/qMTnE96rs – Lundin Jun 01 '22 at 06:53
8

Other answers explains that zero-length arrays are GCC extension and C allows variable length array but no one addressed your other questions.

from my understanding, structs do not have their elements necessarily in continuous locations.

Yes. struct data type do not have their elements necessarily in continuous locations.

Why does the code in array_new allocate memory to data[0]? Why would it be legal to access then, say

array * a = array_new(3);
a->data[1] = 12;

?

You should note that one of the the restriction on zero-length array is that it must be the last member of a structure. By this, compiler knows that the struct can have variable length object and some more memory will be needed at runtime.
But, you shouldn't be confused with; "since zero-length array is the last member of the structure then the memory allocated for zero-length array must be added to the end of the structure and since structs do not have their elements necessarily in continuous locations then how could that allocated memory be accessed?"

No. That's not the case. Memory allocation for structure members not necessarily be contiguous, there may be padding between them, but that allocated memory must be accessed with variable data. And yes, padding will have no effect over here. The rule is: §6.7.2.1/15

Within a structure object, the non-bit-field members and the units in which bit-fields reside have addresses that increase in the order in which they are declared.


I've also seen around that this is just a feature of gcc and not defined by any standard. Is this true?

Yes. As other answers already mentioned that zero-length arrays are not supported by standard C, but an extension of GCC compilers. C99 introduced flexible array member. An example from C standard (6.7.2.1):

After the declaration:

struct s { int n; double d[]; };

the structure struct s has a flexible array member d. A typical way to use this is:

int m = /* some value */;
struct s *p = malloc(sizeof (struct s) + sizeof (double [m]));

and assuming that the call to malloc succeeds, the object pointed to by p behaves, for most purposes, as if p had been declared as:

struct { int n; double d[m]; } *p;

(there are circumstances in which this equivalence is broken; in particular, the offsets of member d might not be the same).

haccks
  • 104,019
  • 25
  • 176
  • 264
  • 1
    "The allocated chunk of memory can go anywhere, just before or after the location of `size`"... That's not quite right. The memory has to go somewhere after `size` (possibly with padding), not before. If `size` was at the end of the variable-size object, indexing into `data` would have to go backwards. Given a `struct array *`, you have to be able access the `size` member without knowing the size. I think the C standard has enough requirements to make it impossible for an intentionally-weird implementation to satisfy all the rules. `char*` can alias anything, so you can look at the bytes. – Peter Cordes Apr 13 '16 at 05:49
  • 2
    The key things to understand are that there *may* be padding between `n` and `d`, but the elements of `d` will be contiguous (as they always are in any array) and `offsetof(struct s, d)` will be accurate. N.B. the *only* difference between the GCC extension and the C99 feature is the syntax. – zwol Apr 13 '16 at 14:22
  • @PeterCordes; I am very late to reply to your comment, but you are right actually. I corrected that part. – haccks Mar 08 '17 at 06:05
2

A more standard way would be to define your array with a data size of 1, as in:

struct array{
    size_t size;
    int data[1]; // <--- will work across compilers
};

Then use the offset of the data member (not the size of the array) in the calculation:

array *array_new(size_t size){
    array* a = malloc(offsetof(array, data) + size * sizeof(int));

    if(a){
        a->size = size;
    }

    return a;
}

This is effectively using array.data as a marker for where the extra data might go (depending on size).

Neil
  • 11,059
  • 3
  • 31
  • 56
  • 1
    `array_new(0)`; now you have a pointer to a struct where there are less than `sizeof(struct)` bytes valid! – Yakk - Adam Nevraumont Apr 12 '16 at 17:54
  • @Yakk Indeed - a well known limitation of the old variant, which is the whole reason (I assume) that C99 introduced the new syntax. Apart from that little limitation, it works well and is standards compliant even in C90 as far as I know. – Voo Apr 12 '16 at 18:00
  • 5
    Accessing beyond the first element of `data` causes [undefined behaviour](http://stackoverflow.com/a/4105123/1505939). Your code would be relying on non-standard compiler extensions . I don't think "more standard" is a good description here. Using `0` instead of `1` turns a compilation error into silent undefined behaviour. – M.M Apr 12 '16 at 21:07
  • 4
    While there's frankly no good reason to use either, `1` is definitely worse than `0` because it removes the intent behind the code. `0` at least relies on a language extension that is defined in its own way. `1` is perfectly legal C code for a *fixed* size array and as a result communicates very little in practice (could easily be an artifact of strange metaprogramming or rigid style). – Alex Celeste Apr 12 '16 at 22:04
  • 1
    Guess what? if you use this idiom incorrectly, it causes undefined behaviour! The original question was about a professor using a compiler extension to allocate memory. My answer provides a way to do the same thing on all compilers. When programming at such low level, you need to know what you are doing. Feel free to access memory beyond that allocated. Nothing will prevent you. – Neil Apr 13 '16 at 10:56
  • @Yakk If you called array_new(0), then you must know not to use the pointer. Nothing will prevent you from accessing that pointer on the original either ! – Neil Apr 13 '16 at 10:57
  • @Neil `if (ptr1 && ptr2) memcmp(ptr1, ptr2, min(sizeof(*ptr1), sizeof(*ptr2))` is usually safe, except for the return pointer of `array_new(0)`. Basically, the risk of `[1]` is that `sizeof` the structure can return a value *larger* than the size of the structure. – Yakk - Adam Nevraumont Apr 13 '16 at 13:07
  • @Neil: Nothing in the Standard would forbid a compiler from replacing any access to `a->data[n]` with an access to `a->data[0]`, since code which would attempt to access `a->data[n]` when `n` is non-zero would invoke Undefined Behavior, allowing the compiler to do anything it likes, *regardless of how much space is allocated for the structure*. – supercat Apr 13 '16 at 15:05
  • @Neil: The rule in question was included in part so that given `struct {int a[4],b;} foo;` a compiler given `foo.b=5; foo[n]=6; do_something_with(foo.b);` could assume that `foo.b` will still be 5 (unaffected by the write to `foo.a[n]`). There probably *should* have been an exemption for an array at the end of a structure, but the Standard never included one. – supercat Apr 13 '16 at 15:07
1

The way I used to do it is without a dummy member at the end of the structure: the size of the structure itself tells you the address just past it. Adding 1 to the typed pointer goes there:

header * p = malloc (sizeof (header) + buffersize);
char * buffer = (char*)(p+1);

As for structs in general, you can know that the fields are layed out in order. Being able to match some imposed structure needed by a file format binary image, operating system call, or hardware is one advantage of using C. You have to know how the padding for alignment works, but they are in order and in one contiguous block.

JDługosz
  • 5,592
  • 3
  • 24
  • 45
  • 1
    Code you have shown fails, if size of header is 3, and buffer is of type `uint32_t` which requires 4 byte alignment. You'd have to take care of the alignment manually, and the point of the flexible array member is that you don't have to do that. – user694733 Apr 13 '16 at 06:52
  • No it doesn't fail. The padding is part of the size of the structure (think about making an array of structures). I see that using a flex member (of type char, for example) lets you make the location of the array *not* aligned to a word boundary. IAC matching a known binary layout means understanding the compiler's alignment. – JDługosz Apr 13 '16 at 07:33
  • 1
    Consider `typedef struct { uint8_t a[3]; } header;`. On some system it will have size of 3 bytes, and alignment of 1, which means there is no padding. If you then do `uint32_t * buffer = (uint32_t*)(p+1);`, you'll get UB, because `(p+1)` will result in address which is not correctly aligned for 32-bit type. – user694733 Apr 13 '16 at 07:42
  • I see what you're getting at. I recall doing things where the header is followed by various possible record types, depending on actual values (e.g. a file format). The buffer, if declared as a simple single primitive type, needs to match the actual alignment of the structures you will actually use. You have to understand (or controll) alignment no matter what you do. (Meanwhile, before the first ANSI standard we didn't have formal "UB" either. We just had whatever the compiler did, and hope it doesn't change in the next version.) – JDługosz Apr 13 '16 at 07:50