1

Given this declaration:

struct s1 {
    int type;
    union u1 {
        char c;
        int i[10000];
    } u;
} s;

I'm wondering whether we can allocate less memory for the struct than sizeof(struct s1) would suggest:

struct s1 * s_char = malloc(sizeof(int)+sizeof(char)); 

On one hand, this seems intuitive: if one knows that s/he will never reach past the char s_char.u.c, then allocating the whole sizeof(struct s1) looks like a big waste.

On the other hand, I rather understand the C11 standard to be against this - BUT it's never spelled out. The two passages I have found that can be understood as being against this are these:

  • if the struct somehow assumes that its full size has been allocated, this opens the door to Undefined Behavior: a new object can be allocated just after s_char but still inside of the "real" sizeof(struct s1) bytes assumed by the struct, which would then trigger item 54 of Annex J.2 of the C11 standard: UB if

An object is assigned to an inexactly overlapping object or to an exactly overlapping object with incompatible type (6.5.16.1).

  • 6.2.6.1 paragraph 7:

When a value is stored in a member of an object of union type, the bytes of the object representation that do not correspond to that member but do correspond to other members take unspecified values.

But this can also be understood as either the standard refusing to deal with what happens with those values, or saying that those values can actually be expected to change arbitrarily.

In summary, there is an intuition of "but we're only using 5 bytes!" vs language-lawyeristic caution - not proof. And my question is: is there any more evidence for any side? More concretely: is it ever OK to underallocate memory for a union or any other data structure?

Again: intuition is what brought the problem, I don't want more of it. I am looking for something reasoned on reliable facts, like the C11 standard and/or compiler information. Also, I already know that the standard way to do this is to substitute the struct-with-union for a union-of-structs with a Common Initial Sequence, though that is also not without risks... . But that is tangential here.

M.M
  • 138,810
  • 21
  • 208
  • 365
hmijail
  • 1,069
  • 9
  • 17
  • I think the J2 issue relates to situations where such objects appear on opposite sides of the same assignment operator. On a 32-bit machine, given `uint64_t *p,*q;`, the assignment `*p=*q;` could write the low half of p before reading the high half of q, or vice versa. If they overlap precisely, that would be fine, but if they overlap in any other way that would be trouble. – supercat Nov 13 '16 at 04:45
  • I'm fairly certain you can allocate as little or as much as you want. Since a pointer to a union is also a pointer to each of its members, `u.c` is guaranteed to be the first byte of `u.i[0]`, so allocating just enough memory to only accommodate `type` and `u.c` should be fine. Just don't access `u.i` unless you assign a value to it, and you're fine. You also assume no padding exists between `type` and `u`, so `sizeof(int)+sizeof(char)` might not be enough. Instead you'd want `offsetof(struct s1, u)+1` for something like this –  Nov 13 '16 at 06:24
  • @supercat, that sounds concise enough that it would be explicitly noted in the standard, doesn't it? To me, the issue sounds way more generic, as the cited text itself. – hmijail Nov 13 '16 at 11:40
  • @ChronoKitsune, that circles back to the question: anything in the standard to back that intuition? I can only find interpretations against it. – hmijail Nov 13 '16 at 14:10
  • 1
    @hmijail In general, I think you'd be better off using `i[1]` and allocating only what you need rather than underallocating `i[10000]`. I unfortunately don't think what you're wanting can lead to anything other than UB, but I'm also not 100% certain. My recommendation would be type punning two different structure types with a common `int type` member at the beginning; you'd use a union of the two structure types to access things as you wish. Is this question out of curiosity, or do you have a more practical reason? –  Nov 13 '16 at 21:52
  • @ChronoKitsune, are you referring to a "struct hack"? I don't think either that or a proper Flexible Array Member will work in a union. Anyway, you're describing what I already wrote in my last paragraph: union-of-structs with a Common Initial Sequence. I already know I can do that, but given that it also has its own problems, I want to first make sure that underallocating is out of the question. So that is what I am asking. I'll edit to make it clearer. – hmijail Nov 14 '16 at 08:27
  • I'm fairly certain it is out of the question... Padding makes this more difficult than it needs to be, and a flexible array member can't be used in a union as you said. There might be another solution, but I don't know what it is, if it exists. –  Nov 14 '16 at 10:46
  • 2
    This issue isn't specific to unions, e.g. `struct S { int a, b; } *p = malloc(sizeof(int)); p->a = 5;` – M.M Nov 14 '16 at 22:03
  • @M.M, what is your point? – hmijail Nov 14 '16 at 22:37
  • @hmijail the problem can be broken down by considering the simpler case first – M.M Nov 14 '16 at 22:43
  • Sorry, I still don't see what does your example bring to the table, given that `struct` and `union` are totally different cases regarding memory semantics. – hmijail Nov 14 '16 at 23:11

2 Answers2

0

Looks like the GCC maintainers think that underallocating memory for a union causes UB, as seen (kinda tangentially) in this bug report. There is no standard-based explanation, but still this implies that the compiler can't be expected to support it, so it makes no sense to look further.

For completeness, this Defect Report against a rule of the CERT C Secure Coding Rules shows this text:

A call to a standard memory allocation function taking a size integer argument n and presumed to be intended for type T * shall be diagnosed when n < sizeof(T).

The WG14 group, responsible for the C standard, is also involved with the CSCR. This is the closest I have managed to get to something relatively standard-based.

hmijail
  • 1,069
  • 9
  • 17
  • GCC interprets a subset of the language defined by the C Standard. There is nothing in the Standard, for example, that specifies that the Common Initial Sequence rule only applies to accesses made *directly* via lvalue of union type, versus those made by pointers to members performed from code where the complete union type declaration is visible, but the authors of gcc claim that the standard is insufficiently clear to justify them honoring the clear intent of the language therein. GCC compatibility is thus a separate issue from standard-compliance. – supercat Nov 14 '16 at 23:51
  • If the compiler doesn't comply with the standard, your compliance with the standard affords you preciously little. – hmijail Nov 15 '16 at 00:04
0

You seem to be confusing yourself by looking for an answer that's outside the scope of the C language. As @M.M says, the issue isn't specific to unions. If you don't allocate sufficient memory for an object, and you write to a part of that object outside the allocated memory, don't be surprised if things go pear shaped later.

The C language lets you define a pointer to any known type, even to an incomplete types whose only "known" property is its name. It's up to you the programmer, when you assign a value to that pointer, to ensure the provided value is valid. C will treat the pointed-to area as an object of the pointer's type. That's a convenience to you, to help you navigate the memory.

Keep in mind that "where the memory comes from" is outside the scope of the language. It could come from malloc, but it could also come from sbrk(2) or mmap(2). Or it could just a constant, like the video RAM buffer in the original IBM PC. If you misdescribe the thing pointed to, you can't expect the compiler to come to your rescue.

James K. Lowden
  • 7,574
  • 1
  • 16
  • 31
  • " If you don't allocate sufficient memory for an object, **and you write to a part of that object outside the allocated memory**, don't be surprised if things go pear shaped later." -> But that's clear. My question specifies: **if you are only ever using 5 bytes, do you need to allocate all the rest?** "you can't expect the compiler to come to your rescue." -> I rather want to make sure that the compiler won't ambush me, as is typical when talking about UB and standard-legalese. – hmijail Nov 15 '16 at 00:02
  • 1
    There's no rule saying you have to allocate the rest, or allocate anything. The compiler makes no attempt to verify the pointed-to object. If you allocate 5 bytes and only ever use 5 bytes, you're good to go. Not that I'd recommend it, or ever do it that way. – James K. Lowden Nov 15 '16 at 00:25
  • "The compiler makes no attempt to verify the pointed-to object." -> No, but as has been the case before, the compiler can make *assumptions*, as I mentioned in my question. And when you break the assumption, you have UB. That's why I am asking for standard-based reasoning, not intuition. – hmijail Nov 15 '16 at 00:29