3

I'm trying to understand how type-punning works when it comes to storing a value into a member of structure or union.

The Standard N1570 6.2.6.1(p6) specifies that

When a value is stored in an object of structure or union type, including in a member object, the bytes of the object representation that correspond to any padding bytes take unspecified values.

So I interpreted it as if we have an object to store into a member such that the size of the object equals the sizeof(declared_type_of_the_member) + padding the bytes related to padding will have unspecified value (even in spite of the fact that we had the bytes in the original object defined). Here is an example:

struct first_member_padded_t{
    int a;
    long b;
};

int a = 10;
struct first_member_padded_t s;
char repr[offsetof(struct first_member_padded_t, b)] = //some value
memcpy(repr, &a, sizeof(a));
memcpy(&(s.a), repr, sizeof(repr));
s.b = 100;
printf("%d%ld\n", s.a, s.b); //prints 10100

On my machine sizeof(int) = 4, offsetof(struct first_member_padded_t, b) = 8.

Is the behavior of printing 10100 well defined for such a program? I thing that it is.

Some Name
  • 8,555
  • 5
  • 27
  • 77
  • 1
    I don't see anything that is not well defined here. Are you concerned that the padding values will get overwritten with some other indeterminate values? – Eugene Sh. Mar 11 '19 at 13:57
  • you are checking the padding is after the field _a_ (between _a_ and _b_ then), in case the padding is (strangely) placed before _a_ your code does not set `s.a` to 10. Is the norm explicitly specifies where the padding is added (sorry too lazy to check ^^) ? – bruno Mar 11 '19 at 14:08
  • @EugeneSh. _Are you concerned that the padding values will get overwritten with some other indeterminate values?_ Exactly. It is specified that the bytes has indeterminate value, but there are bytes correspond to padding in `repr[offsetof(struct first_member_padded_t, b)]` – Some Name Mar 11 '19 at 14:08
  • Code looks fishy to me. If there is padding between `s.a` and `s.b`, you are modifying memory outside of `s.a` with pointer which is derived from `s.a`. Considering how tricky rules regarding these are, this code is asking for trouble. – user694733 Mar 11 '19 at 14:11
  • @user694733 `you are modifying memory outside of s.a` I did that deliberately to check the rule I cited. I expected that the padding bytes will either ignored or took unspecified value. – Some Name Mar 11 '19 at 14:16
  • 3
    Well, I see what @user694733 is talking about, it is not about the cited rule, but about writing out of bounds of the `s.a` object, which is undefined. – Eugene Sh. Mar 11 '19 at 14:18
  • @SomeName your code does allow to check if the padding bytes have or not an unspecified value. It checks the padding is after the field _a_ (else `s.a` is not 10). Your first _memcpy_ set the 4 first bytes dedicated to the field _a_, hte next 4 bytes of padding are undet, the second _memspy_ copy from the offset of the field _a_ the 4 bytes valuing 10 then the 4 next having a undefined value, then the _printf_ access to the field _a_, you never read the padding in _s_ – bruno Mar 11 '19 at 14:20
  • @EugeneSh. I agree that writing out of bounds is undefined. So the rule is about if the member is structure or union itself. And if we store a value in the member bytes corresponding to any padding will take unspecified value. Is that correct? – Some Name Mar 11 '19 at 14:31
  • @bruno This seems to be undefined anyway as user694733 mentioned. I stored a value out of object's bound... – Some Name Mar 11 '19 at 14:32
  • 1
    you store out of the field _s.a_ but not out of _s_ if the padding is after the field _a_, all the question is about where the padding is – bruno Mar 11 '19 at 14:35
  • @bruno There cannot be padding before the first member of a `struct`. A pointer to the `struct` can be casted to a pointer to its first member. – alx - recommends codidact Mar 11 '19 at 14:40
  • @bruno But the lvalue I used to store value into the object has type `int` yielding that `memcpy` went out of its bounds. – Some Name Mar 11 '19 at 14:40
  • @CacahueteFrito yes I know, I was not clear enough, I was supposing a more general case like for instance `int f1; int f2; int a; long b` to have padding while the field is not the first. Of course the simple way for the compiler is to add the padding just before the field requiring it, but is that a rule in the norm ? – bruno Mar 11 '19 at 14:43
  • 1
    @EugeneSh.: While the `memcpy` writes out of bounds of `s.a`, it is within bounds of `s`, and the C standard permits accessing the bytes of an object (which `s` is) as an array of character type, including via `memcpy`. – Eric Postpischil Mar 11 '19 at 14:43
  • @EricPostpischil But we used the lvalue of type `int` to store the value of object which representation exceeds `sizeof(int)`. Does not it cause undefined behavior? – Some Name Mar 11 '19 at 14:46
  • 4
    @SomeName: No `int *` was used to store the value. An `int *` was used in an expression that was an argument to `memcpy`, but the semantics of function calls cause this to be converted to `void *`. Then, per the specification of `memcpy`, it copies characters from one place to another. There is some pedantic complication in whether the `int *` that is `&s.a` may be used to access other bytes of `s`. I am drafting an answer that may address this. – Eric Postpischil Mar 11 '19 at 14:49
  • UB ? Hmmm... `memset(&s, 0, sizeof s)` is valid. So I doubt `memset(&s.a, 0, sizeof s)` is UB. – Support Ukraine Mar 11 '19 at 14:53
  • @EricPostpischil While it is true that you can `memcpy` a `struct`, it should be done via its pointer `&s`, and not a pointer to a member, as far as I know. Pointers are very picky. I'll check that. – alx - recommends codidact Mar 11 '19 at 14:54
  • @4386427 Note that the question is about the line where the size passed is not the same as the one of the pointed object. – Eugene Sh. Mar 11 '19 at 14:55
  • @bruno The padding can go wherever, except before the first member. No rules. – alx - recommends codidact Mar 11 '19 at 14:55
  • @EugeneSh. yes, I noticed. But as long as the size is less/equal `sizeof s` I doubt it can be UB.... but I can't quote the standard... and I could be wrong, of cause :-) – Support Ukraine Mar 11 '19 at 14:57
  • @4386427 But in the code it is not less but greater (in case of padding) – Eugene Sh. Mar 11 '19 at 14:58
  • @EugeneSh. `s.a` is less but `s` is greater. The layout is int (4 bytes), padding (4 bytes), long (? bytes) as OP haven't told us about size of long. But sizeof `s` is at least 12 (probably 16 bytes) and the code only copies 8 bytes. – Support Ukraine Mar 11 '19 at 15:05
  • 2
    @4386427 The line in question is `memcpy(&(s.a), repr, sizeof(repr));`. Here `repr` is the size of `s.a` plus the padding of `s.a`. So `sizeof(repr)` is greater than `sizeof(s.a)`. So I guess the **lawyer** question here is whether it is legal to write outside an object known to be a part of a larger object. – Eugene Sh. Mar 11 '19 at 15:06
  • @EugeneSh. yes - agree - I just wrote the same. But `sizeof(repr)` is less than `sizeof(s)` so the code is not writing out side `s`. – Support Ukraine Mar 11 '19 at 15:09
  • Possible duplicate of [Is it legal to access struct members via offset pointers from other struct members?](https://stackoverflow.com/questions/51737910/is-it-legal-to-access-struct-members-via-offset-pointers-from-other-struct-membe) – alx - recommends codidact Mar 12 '19 at 09:11

3 Answers3

3

What the memcpy Calls Do

The question is poorly posed. Let’s look first at the code:

char repr[offsetof(struct first_member_padded_t, b)] = //some value
memcpy(repr, &a, sizeof(a));
memcpy(&(s.a), repr, sizeof(repr));

First note that repr is initialized, so all the elements in it are given values.

The first memcpy is fine—it copies the bytes of a into repr.

If the second memcpy were memcpy(&s, repr, sizeof repr);, it would copy bytes from repr into s. This would write bytes into s.a and, due to the size of repr, into any padding between s.a and s.b. Per C 2018 6.5 7 and other pats of the standard, it is permitted to access the bytes of an object (and “access” means both reading and writing, per 3.1 1). So this copy into s is fine, and it results in s.a taking on the same value that a has.

However, the memcpy uses &(s.a) rather than &s. It uses the address of s.a rather than the address of s. We know that converting s.a to a pointer to a character type would allow us to access the bytes of s.a (6.5 7 and more) (and passing it to memcpy has the same effect as such a conversion, as memcpy is specified to have the effect of copying bytes), but it is not clear it allows us to access other bytes in s. In other words, we have a question of whether we can use &s.a to access bytes other than those in s.a.

6.7.2.1 15 tells us that, if a pointer to the first member of a structure is “suitably converted,” the result points to the structure. So, if we converted &s.a to a pointer to struct first_member_padding_t, it would point to s, and we can certainly use a pointer to s to access all the bytes in s. Thus, this would also be well defined:

memcpy((struct first_member_padding t *) &s.a, repr, sizeof repr);

However, memcpy(&s.a, repr, sizeof repr); only converts &s.a to void * (because memcpy is declared to take a void *, so &s.a is automatically converted during the function call) and not to a pointer to the structure type. Is that a suitable conversion? Note that if we did memcpy(&s, repr, sizeof repr);, it would convert &s to void *. 6.2.5 28 tells us that a pointer to void has the same representation as a pointer to a character type. So consider these two statements:

memcpy(&s.a, repr, sizeof repr);
memcpy(&s,   repr, sizeof repr);

Both of these statements pass a void * to memcpy, and those two void * have the same representation as each other and point to the same byte. Now, we might interpret the standard pedantically and strictly so that they are different in that the latter may be used to access all the bytes of s and the former may not. Then it is bizarre that we have two necessarily identical pointers that behave differently.

Such a severe interpretation of the C standard seems possible in theory—the difference between the pointers could arise during optimization rather than in the actual implementation of memcpy—but I am not aware of any compiler that would do this. Note that such an interpretation is at odds with section 6.2 of the standard, which tells us about types and representations. Interpreting the standard so that (void *) &s.a and (void *) &s behave differently means that two things with the same value and type may behave differently, which means a value consists of something more than its value and type, which does not seem to be the intent of 6.2 or the standard generally.

Type-Punning

The question states:

I'm trying to understand how type-punning works when it comes to storing a value into a member of structure or union.

This is not type-punning as the term is commonly used. Technically, the code does access s.a using lvalues of a different type than its definition (because it uses memcpy, which is defined to copy as if with character type, while the defined type is int), but the bytes originate in an int and are copied without modification, and this sort of copying the bytes of an object is generally regarded as a mechanical procedure; it is done to effect a copy and not to reinterpret the bytes in a new type. “Type-punning” usually refers to using different lvalues for the purpose of reinterpreting the value, such as writing an unsigned int and reading a float.

In any case, type-punning is not really the subject of the question.

Values In Members

The title asks:

What values can we store in a struct or union members?

This title seems off from the content of the question. The title question is easily answered: The values we can store in a member are those values the member’s type can represent. But the question goes on to explore the padding between members. The padding does not affect the values in the members.

Padding Takes Unspecified Values

The question quotes the standard:

When a value is stored in an object of structure or union type, including in a member object, the bytes of the object representation that correspond to any padding bytes take unspecified values.

and says:

So I interpreted it as if we have an object to store into a member such that the size of the object equals the sizeof(declared_type_of_the_member) + padding the bytes related to padding will have unspecified value…

The quoted text in the standard means that, if the padding bytes in s have been set to some values, as with memcpy, and we then do s.a = something;, then the padding bytes are no longer required to hold their previous values.

The code in the question explores a different situation. The code memcpy(&(s.a), repr, sizeof(repr)); does not store a value in a member of the structure in the sense meant in 6.2.6.1 6. It is not storing into either of the members s.a or s.b. It is copying bytes in, which is a different thing from what is discussed in 6.2.6.1.

6.2.6.1 6 means that, for example, if we execute this code:

char repr[sizeof s] = { 0 };
memcpy(&s, repr, sizeof s); // Set all the bytes of s to known values.
s.a = 0; // Store a value in a member.
memcpy(repr, &s, sizeof s); // Get all the bytes of s to examine them.
for (size_t i = sizeof s.a; i < offsetof(struct first_member_padding_t, b); ++i)
    printf("Byte %zu = %d.\n", i, repr[i]);

then it is not necessarily true that all zeros will be printed—the bytes in the padding may have changed.

Community
  • 1
  • 1
Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • As you say, both pointers will have the exact same representation, but still they don't point to the same thing, so as you also said, the compiler may do optimizations based on that. – alx - recommends codidact Mar 11 '19 at 15:39
  • 2
    Also, in the more general case of an element of the `struct` that isn't its first element, it will be more clear that it is UB. – alx - recommends codidact Mar 11 '19 at 15:58
  • @CacahueteFrito: Re “they don't point to the same thing”: They do point to the same thing; `(char *) &s` and `(char *) &s.a` must both point to the first byte of `s.a`, and so must `(void *) &s` and `(void *) &s.a`. If the standard is interpreted so that there is a difference between them, that difference must arise from their provenance and not from what they point to. It implies that a value in C consists of more than its type and its value, which is at odds with the Concepts section (6.2) of the C standard. – Eric Postpischil Mar 11 '19 at 16:10
  • 2
    @CacahueteFrito: Using a member other than the first member does not change the analysis. We would have the same situation with `(void *) &s.b` and `(void *) ((char *) &s + offsetof(struct first_member_padding_t, b))`—the first points to the same location as the second, and they are the same type with the same representation, but the first derives from a pointer to a member while the second derives from a pointer to the structure. – Eric Postpischil Mar 11 '19 at 16:13
  • _Now, we might interpret the standard pedantically and strictly so that they are different in that the latter may be used to access all the bytes of s and the former may not._ But the Standard makes an informative note that `48) The same representation and alignment requirements are meant to imply interchangeability as arguments to functions, return values from functions, and members of unions.` So if we say that `&s` may be used to access all the members (in that case via `memcpy`), but `&(s.a)` may not it would contradict to the informative note. Wouldn't it? – Some Name Mar 12 '19 at 00:01
  • _Using a member other than the first member does not change the analysis._ Could you please elaborate on that? As per `7.24.2.1(p1)` _The `memcpy` function copies n characters from the **object** pointed to by s2 into the **object** pointed to by s1_. When considering the first member we can rely on `6.7.2.1(p15)` and use the pointer to it and a pointer to the whole structure interchangeably. When considering the second member we do not have such an interchangeability with the "pointer to the whole struct" object, so we are restricting to consider the member object on its own. – Some Name Mar 12 '19 at 00:38
  • 1
    @SomeName: `(void *) ((char *) &s + offsetof(struct first_member_padding_t, b))` points to the first byte of `s.b`. It is a pointer derived from converting `&s` to `char *`, and we are free to use that pointer to access `s` as if it were an array of bytes, moving up and down the bytes of `s` as desired. Now, `(void *) &s.b` is also a pointer to the first byte of `s.b`, and it is a `void *` just as the first pointer is. So we have two pointers of the same type and representation that point to the same byte, but we cannot, under the pedantic interpretation, use them interchangeably. – Eric Postpischil Mar 12 '19 at 00:42
  • @EricPostpischil @SomeName I'll continue with using the array of arrays similarity (both are big objects that hold smaller objects and have the same position in RAM): Let's have `int a[2][2][2][2];`. Then `(void *)a == (void *)(a[0][0][0])` is `true`, but I think (and there's already an old question on that, but I can't find it) you can't `memcpy(dest, (a[0][0][0]), 8);` – alx - recommends codidact Mar 12 '19 at 08:22
  • @CacahueteFrito: Have you cited any part of the C standard or given any reasoning to support that assertion? – Eric Postpischil Mar 12 '19 at 11:39
  • @EricPostpischil It's the accessing an array out of bounds thing. However, which array is the one to be considered (the inner or the outer) is not written in the Standard, so can't quote it. I'll refer to the answer here (I finally found it this morning): https://stackoverflow.com/a/51738580/6872717 – alx - recommends codidact Mar 12 '19 at 13:09
  • @EricPostpischil The section 6.5 6 specifies _If a value is copied into an object having no declared type using `memcpy` or `memmove`, or is copied as an array of character type_. I'm confused by the "copied as an array of character type" does it mean we first copies the object representation to some `char[n]` (as in my case) and then copy it next to the final destination? – Some Name Mar 12 '19 at 13:55
  • @CacahueteFrito: The rules about pointer arithmetic out of bounds are distinct from the rules about accessing objects. Once the pointer is converted from a pointer to an array element to some other type, the rules about pointer arithmetic regarding that particular array are irrelevant. The issues about whether a pointer to a subobject may be used to access an object discussed in the link you provide are the same as discussed in this answer. – Eric Postpischil Mar 12 '19 at 17:27
  • 1
    @SomeName: “Copied as an array of character type” essentially means copied byte-by-byte using a pointer to a character type. However, that sentence about things with no declared type is for allocated memory. The example in your question uses an object with a declared type, the object `s` with declared type `struct first_member_padded_t`. – Eric Postpischil Mar 12 '19 at 17:30
1

In many implementations of the language the C Standard was written to describe, an attempt to write an N-byte object within a struct or union would affect the value of at most N bytes within the struct or union. On the other hand, on a platform which supported 8-bit and 32-bit stores, but not 16-bit stores, if someone declared a type like:

struct S { uint32_t x; uint16_t y;} *s;

and then executed s->y = 23; without caring about what happened to the two bytes following y, it would be faster to performs a 32-bit store to y, blindly overwriting the two bytes following it, than to perform a pair of 8-bit writes to update the upper and lower halves of y. The authors of the Standard didn't want to forbid such treatment.

It would have been helpful if the Standard had included a means by which implementations could indicate whether writes to structure or union members might disturb storage beyond them, and programs that would be broken by such disturbance could refuse to run on implementations where it could occur. The authors of the Standard, however, likely expected that programmers who would be interested in such details would know what kinds of hardware their program was expected to run on, and thus know whether such memory disturbances would be an issue on such hardware.

Unfortunately, modern compiler writers seem to interpret freedoms that were intended to assist implementations for unusual hardware as an open invitation to get "creative" even when targeting platforms that could process code efficiently without such concessions.

supercat
  • 77,689
  • 9
  • 166
  • 211
0

As @user694733 said, in case there is padding between s.a and s.b, memcpy() is accessing a memory area that cannot be accessed by &a:

int a = 1;
int b;
b = *((char *)&a + sizeof(int));

This is Undefined Behaviour, and it is basically what is happening inside memcpy().

  • 3
    I disagree, this is not the same case, your variables are not fields part of a struct, you cannot apply _offsetof_ isn't it ? – bruno Mar 11 '19 at 14:39
  • You cannot, but the fact that a pointer to a variable can not be used (by pointer arithmetics) to access memory that doesn't belong to that variable still holds. This example shows the same problem, but taking away the `struct` so that it is simpler to understand. – alx - recommends codidact Mar 11 '19 at 14:44
  • You could reproduce the same UB with an array of arrays: `int a[2][2]; /*...*/ x = *(&(a[0][0]) + 2);` is also UB. – alx - recommends codidact Mar 11 '19 at 14:47
  • I don't think it's the same case with and without `struct` – Support Ukraine Mar 11 '19 at 14:49
  • Unless someone finds a difference, both an array of arrays and a `struct` should behave the same in this case. In the case of an array of arrays, if you have a pointer to an element of one of the arrays, you can access any element in that array, but you can't step to the next one. – alx - recommends codidact Mar 11 '19 at 15:20
  • 1
    @CacahueteFrito: There are many situations where there would be exactly one useful way for a program to behave, but the Standard makes no effort to forbid conforming implementations from behaving in stupidly-useless fashion. In the language the Standard was written to describe, the `memcpy` function could be used on any combination of objects within a contiguously-allocated region of storage, without regard for the bounds of any internal arrays. Any failure of the Standard to accommodate such cases would be differences between the language it was written to describe and the one it does. – supercat Mar 11 '19 at 16:34
  • @supercat I agree with you that it makes no sense to disallow that, unless that would allow for optimizations, which are welcome, and I would accept it then. But still it is good to know what the Standard (good or not) allows, because then when you encounter a bug using `-O3` you can tell, aha it's that crazy thing the Standard is crazy about, and not lose a lot of time debugging code that looks OK. Also it is good so that future revisions of the Standard can fix that. (If nobody knew that, how would you fix it?) – alx - recommends codidact Mar 12 '19 at 08:30
  • @CacahueteFrito: According to the authors of the Standard, two fundamental parts of the Spirit of C are "Trust the programmer" and "Don't prevent the programmer from doing what needs to be done". Because different tasks require the ability to do different things, the authors wanted to encourage variety among implementations intended for various purposes. If an implementation claims to be suitable for low-level programming without requiring the use of non-standard syntax (which used to be a common objective until for some bizarre reason it became unfashionable)... – supercat Mar 12 '19 at 14:23
  • ...that would imply that it should reliably handle constructs that may not be needed for other types of programming, and failure to do so should be considered a bug *without regard for whether or not it would make an implementation non-conforming*. It's too bad there's no common terminology to distinguish the language the Standard was written to describe, versus the gutted shell of a language that the maintainers clang and gcc view it as defining. – supercat Mar 12 '19 at 14:29
  • @CacahueteFrito: As for optimizations being "welcome", that depends on whether they are consistent with the Spirit of C as applied to the task at hand. Often, some parts of the Standard or an implementation's documentation describe the behavior of an action, but some other part describes an overlapping class of actions as invoking Undefined Behavior. Optimizations based upon such actions being UB may be welcome in cases where such actions would be useless, but counter-productive in cases where such actions would represent the most effective way to accomplish the task at hand. – supercat Mar 12 '19 at 15:48