11

Here is the code:

unsigned int a;            // a is indeterminate
unsigned long long b = 1;  // b is initialized to 1
std::memcpy(&a, &b, sizeof(unsigned int));
unsigned int c = a;        // Is this not undefined behavior? (Implementation-defined behavior?)

Is a guaranteed by the standard to be a determinate value where we access it to initialize c? Cppreference says:

void* memcpy( void* dest, const void* src, std::size_t count );

Copies count bytes from the object pointed to by src to the object pointed to by dest. Both objects are reinterpreted as arrays of unsigned char.

But I don't see anywhere in cppreference that says if an indeterminate value is "copied to" like this, it becomes determinate.

From the standard, it seems it's analogous to this:

unsigned int a;            // a is indeterminate
unsigned long long b = 1;  // b is initialized to 1
auto* a_ptr = reinterpret_cast<unsigned char*>(&a);
auto* b_ptr = reinterpret_cast<unsigned char*>(&b);
a_ptr[0] = b_ptr[0];
a_ptr[1] = b_ptr[1];
a_ptr[2] = b_ptr[2];
a_ptr[3] = b_ptr[3];
unsigned int c = a;        // Is this undefined behavior? (Implementation defined behavior?)

It seems like the standard leaves room for this to be allowed, because the type aliasing rules allow for the object a to be accessed as an unsigned char this way. But I can't find something that says this makes a no longer indeterminate.

Joel
  • 2,065
  • 2
  • 19
  • 30
  • `int a=1; long long b;` followed by a `memcpy` of `a` to `b` may be a more interesting case because in practice, only half of `b` will likely be initialized. – jww Jul 02 '19 at 00:48
  • @jww Good call, that's what I intended. Perhaps obvious, perhaps not, the idea was that endianness would obviously affect what happened, so even if there isn't undefined behavior, there is for sure implementation defined behavior. I'll change it! – Joel Jul 02 '19 at 00:51
  • The second example is pretty different from the first and is obviously UB. `a` is 4 bytes uninitialized bytes. You then write 1 byte. It now has 3 uninitialized bytes. – Barry Jul 02 '19 at 02:14

3 Answers3

4

Is this not undefined behavior

It's UB, because you're copying into the wrong type. [basic.types]2 and 3 permit byte copying, but only between objects of the same type. You copied from a long long into an int. That has nothing to do with the value being indeterminate. Even though you're only copying sizeof(int) bytes, the fact that you're not copying from an actual int means that you don't get the protection of those rules.

If you were copying into the value of the same type, then [basic.types]3 says that it's equivalent to simply assigning them. That is, a " shall subsequently hold the same value as" b.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • 1
    But this isn't what I'm asking. I'm putting a determinate byte sequence into an indeterminate byte sequence. It's... the converse of your answer? Contrapositive? I'm bad with those logical negation terms... – Joel Jul 02 '19 at 01:15
  • Unfortunately, I'm still a downvote- I see bits about how to make things defined, but not something that says doing things differently is UB. I'm editing the question to be unsigned ints/longs, because it makes things more obvious: what happens if the byte I copy happens to have a valid value? What I'm really asking is, is this UB (legally allowed to erase my hard drive) or implementation-defined behavior (the compiler docs should/may tell me what happens). – Joel Jul 02 '19 at 01:27
  • 3
    @Joel: "*not something that says doing things differently is UB*" [basic.type]/2&3 specify the behavior for doing byte copies to objects of the same types. There is no statement about the behavior of doing byte copies between objects of different types. I can't point to a thing that doesn't exist. – Nicol Bolas Jul 02 '19 at 02:02
  • If something doesn't exist, that's a problem with the spec, right? The spec should say what is guaranteed to happen under all possible inputs, or when guarantees don't apply. https://en.cppreference.com/w/cpp/language/ub – Joel Jul 02 '19 at 02:21
  • 5
    [When the spec doesn't say what happens, it's undefined.](https://timsong-cpp.github.io/cppwp/n4659/intro.defs#defns.undefined) – HTNW Jul 02 '19 at 02:25
  • @Joel: Well, either way, the main point is that whether it works has *nothing* to do with the initial contents of `a`. So it being indeterminate to start with or questioning whether the result is determinate is kind of beside the point. If the specification defines the behavior of such a byte copy somewhere, then it will be defined to have some value based on the given bytes. If the copy is undefined... then it's undefined. – Nicol Bolas Jul 02 '19 at 02:27
  • Thanks @HTNW. @NicolBolas: I think @LemonDrop has the answer. The copy is not undefined, because you can reinterpret_cast any type to `unsigned char`, and if the type has the right byte pattern then accessing it is not undefined. – Joel Jul 02 '19 at 02:34
  • 4
    @Joel: It's possible that accessing the byte representation via character lvalues is defined, but using the original static type of the object results in undefined behavior. For example, the bytes written might be the representation of a trap value. Interesting to note that in C, memcpy would transfer the dynamic type from source to destination, so strict aliasing would forbid use of the variable's static type since it is incompatible with the new dynamic type. – Ben Voigt Jul 02 '19 at 04:05
  • @BenVoigt I think it might be even more subtle than that. Apparently a variable can be represented as a register by the compiler. And some architectures' registers can have a hardware-defined trap state. So it could be undefined behavior in that case before you even access the bytes at all. – Joel Jul 02 '19 at 04:10
  • @BenVoigt _Interesting to note that in C, memcpy would transfer the dynamic type from source to destination_ First: effective type, not dynamic. Second: it would not. `memcpy`/`memmove` does not change the effective type if the destination has declared type. http://port70.net/~nsz/c/c11/n1570.html#6.5p6 – Language Lawyer Jul 02 '19 at 07:43
  • @Joel: A variable whose address is taken (and to use memcpy or memmove, it must be) can only be placed in a register under the as-if rule. The as-if rule doesn't allow changing visible behavior (such as triggering a trap). When the Standard says it is permissible to access any object's representation as an array of bytes, then it is permissible and no program transformation is allowed under as-if that would change this. The danger comes in when switching back to the original type (`unsigned int` in this question). – Ben Voigt Jul 02 '19 at 14:18
  • @NicolBolas I've added an answer, the main difference being 16.2/2: "The descriptions of many library functions rely on the C standard library for the semantics of those functions. (...) the behavior and the preconditions (including any preconditions implied by the use of an ISO C restrict qualifier) are the same unless otherwise stated." This, I believe, provides defined behavior for std::memcpy when the pointers are to different types. This is reflected by memcpy(void*, const void*, size_t)'s pointers to void in its signature. – Joel Jul 02 '19 at 17:03
  • 1
    [basic.types]/2,3 aren't relevant to OP's code . They provides a guarantee that for copies under the listed circumstances, the original value is recovered. OP isn't doing any of the things that those points define. The behaviour of OP's code is covered by other parts of the standard. – M.M Jul 02 '19 at 22:33
  • @LanguageLawyer fortunately C++ doesn't have anything like the C "effective type" rule, which is poorly specified and unclear what it means except for the most trivial cases. E.g. is the effective type transferred by memcpy if you copy some of the bytes of an `int` and memset the other bytes in the destination? And, clearly(?) we cannot have an `int`-sized object whose effective type is `long long`. – M.M Jul 02 '19 at 22:39
  • 1
    @M.M _The behaviour of OP's code is covered by other parts of the standard_ Or not covered at all. – Language Lawyer Jul 02 '19 at 22:51
  • @M.M I've just told that `memcpy` can't change the effective type of an object with a declared type. Yes, I've read some of C defect reports related to effective types, where the committee struggles to tell what exactly happens. – Language Lawyer Jul 02 '19 at 22:52
  • @LanguageLawyer: The whole notion of storage having "effective type" has so far as I can tell never actually been practical. N1570 6.5p7 would be perfectly workable, without the effective type rule, if one simply made the footnote about its purpose normative and recognized that "aliasing" requires conflicting operations on a region of storage using *seemingly-unrelated* references. – supercat Jul 08 '19 at 22:57
1

TL;DR: It's implementation-defined whether there will be undefined behavior or not. Proof-style, with lines of code numbered:


  1. unsigned int a;

The variable a is assumed to have automatic storage duration. Its lifetime begins (6.6.3/1). Since it is not a class, its lifetime begins with default initialization, in which no other initialization is performed (9.3/7.3).

  1. unsigned long long b = 1ull;

The variable b is assumed to have automatic storage duration. Its lifetime begins (6.6.3/1). Since it is not a class, its lifetime begins with copy-initialization (9.3/15).

  1. std::memcpy(&a, &b, sizeof(unsigned int));

Per 16.2/2, std::memcpy should have the same semantics and preconditions as the C standard library's memcpy. In the C standard 7.21.2.1, assuming sizeof(unsigned int) == 4, 4 characters are copied from the object pointed to by &b into the object pointed to by &a. (These two points are what is missing from other answers.)

At this point, the sizes of unsigned int, unsigned long long, their representations (e.g. endianness), and the size of a character are all implementation defined (to my understanding, see 6.7.1/4 and its note saying that ISO C 5.2.4.2.1 applies). I will assume that the implementation is little-endian, unsigned int is 32 bits, unsigned long long is 64 bits, and a character is 8 bits.

Now that I have said what the implementation is, I know that a has a value-representation for an unsigned int of 1u. Nothing, so far, has been undefined behavior.

  1. unsigned int c = a;

Now we access a. Then, 6.7/4 says that

For trivially copyable types, the value representation is a set of bits in the object representation that determines a value, which is one discrete element of an implementation-defined set of values.

I know now that the value of a is determined by the implementation-defined value bits in a, which I know hold the value-representation for 1u. The value of a is then 1u.

Then like (2), the variable c is copy-initialized to 1u.


We made use of implementation-defined values to find what happens. It is possible that the implementation-defined value of 1ull is not one of the implementation-defined set of values for unsigned int. In that case, accessing a will be undefined behavior, because the standard doesn't say what happens when you access a variable with a value-representation that is invalid.

AFAIK, we can take advantage of the fact that most implementations define an unsigned int where any possible bit pattern is a valid value-representation. Therefore, there will be no undefined behavior.

Joel
  • 2,065
  • 2
  • 19
  • 30
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/195891/discussion-on-answer-by-joel-does-stdmemcpy-make-its-destination-determinate). – Samuel Liew Jul 03 '19 at 01:43
0

Note: I updated this answer since by exploring the issue further in some of the comments has reveled cases where it would be implementation defined or even undefined in a case I did not consider originally (specifically in C++17 as well).

I believe that this is either implementation defined behavior in some cases and undefined in others (as another answer came to conclude for similar reasons). In a sense it's implementation defined if it's undefined behavior or implementation defined, so I am not sure if it being undefined in general takes precedence in such a classification.

Since std::memcpy works entirely on the object representation of the types in question (by aliasing the pointers given to unsigned char as is specified by 6.10/8.8 [basic.lval]). If the bits within the bytes in question of the unsigned long long are guaranteed to be something specific then you can manipulate them however you wish or write them into the object representation of any other type. The destination type will then use the bits to form its value based on its value representation (whatever that may be) as is defined in 6.9/4 [basic.types]:

The object representation of an object of type T is the sequence of N unsigned char objects taken up by the object of type T, where N equals sizeof(T). The value representation of an object is the set of bits that hold the value of type T. For trivially copyable types, the value representation is a set of bits in the object representation that determines a value, which is one discrete element of an implementation-defined set of values.

And that:

The intent is that the memory model of C++ is compatible with that of ISO/IEC 9899 Programming Language C.

Knowing this, all that matters now is what the object representation of the integer types in question are. According to 6.9.1/7 [basic.fundemental]:

Types bool, char, char16_t, char32_t, wchar_t, and the signed and unsigned integer types are collectively called integral types. A synonym for integral type is integer type. The representations of integral types shall define values by use of a pure binary numeration system. [Example: This International Standard permits two’s complement, ones’ complement and signed magnitude representations for integral types. — end example ]

A footnote does clarify the definition of "binary numeration system" however:

A positional representation for integers that uses the binary digits 0 and 1, in which the values represented by successive bits are additive, begin with 1, and are multiplied by successive integral power of 2, except perhaps for the bit with the highest position. (Adapted from the American National Dictionary for Information Processing Systems.)

We also know that unsigned integers have the same value representation as signed integers, just under a modulus according to 6.9.1/4 [basic.fundemental]:

Unsigned integers shall obey the laws of arithmetic modulo 2^n where n is the number of bits in the value representation of that particular size of integer.

While this does not say exactly what the value representation may be, based on the specified definition of a binary numeration system, successive bits are to be additive powers of two as expected (rather than allowing the bits to be in any given order), with the exception of the maybe present sign bit. Additionally since signed and unsigned value representations this means an unsigned integer will stored as an increasing binary sequence up until 2^(n-1) (past then depending on how signed number are handled things are implementation defined).

There are still some other considerations however, such as endianness and how many padding bits may be present due to sizeof(T) only measuring the size of the object representation rather than the value representation (as stated before). Since in C++17 there is no standard way (I think) to check for endianness, this is the main factor that would leave this to be implementation defined in what the outcome would be. As for padding bits, while they may be present (but not specified where they will be from what I can tell other than the implication that they will not interrupt the contiguous sequence of bits forming the value representation of a integer), writing to them can prove potentially problematic. Since the intent of the C++ memory model is based on the C99 standard's memory model in a "comparable" way, a footnote from 6.2.6.2 (which is referenced in the C++20 standard as a note to remind that it's based on that) can be taken which say as follows:

Some combinations of padding bits might generate trap representations, for example, if one padding bit is a parity bit. Regardless, no arithmetic operation on valid values can generate a trap representation other than as part of an exceptional condition such as an overflow, and this cannot occur with unsigned types. All other combinations of padding bits are alternative object representations of the value specified by the value bits.

This implies that writing directly to padding bits incorrectly could potentially generate a trap representation from what I can tell.

This shows that in some cases depending on if padding bits are present and endianness, the result can be influenced in an implementation-defined manner. If some combination of padding bits is also a trap representation, this may become undefined behavior.

While not possible in C++17, in C++20 one can use std::endian in conjunction with std::has_unique_object_representations<T> (which was present in C++17) or some math with CHAR_BIT, UINT_MAX/ULLONG_MAX and the sizeof those types to ensure the expected endianness is correct as well as the absence of padding bits, allowing this to actually produce the expected result in a defined manner given what was previously established with how integers are said to be stored. Of course C++20 also further refines this and specifies that integer are to be stored in two's complement alone eliminating further implementation-specific issues.

Lemon Drop
  • 2,113
  • 2
  • 19
  • 34
  • 1
    @Joel This answer says that it is defined behavior, the other (upvoted) answer says that it's undefined behavior. Clearly only one of them can be the right answer and people who are voting find that the other one is the correct one. There's a lot of discussion on both of the other answers explaining their reasoning so I'm not sure what else you are looking for. – JJJ Jul 02 '19 at 14:47
  • 1
    I'm looking for a justification for why this is wrong. I'm going to write out my own math-style proof, because I believe the standard says that what matters upon access is the bits stored in the value part of the variable. If we can prove there are bits there, and we know what they are, and we got here through defined behavior, then the behavior on access depends on what those bits do. If those bits are implementation-defined as valid, then the behavior should be implementation-defined to some extent. If they are invalid then the behavior may be undefined. – Joel Jul 02 '19 at 14:55
  • TL;DR: Nobody's said what's wrong with this answer. – Joel Jul 02 '19 at 14:58
  • You can also confirm the absence of padding bits by checking `UINT_MAX` and `CHAR_BIT * sizeof(unsigned int)` – M.M Jul 02 '19 at 22:26
  • @M.M Yeah I suppose that could work if you do the math at compile time to see how many bits `UINT_MAX` needs to be represented properly by, I guess I can modify it to mention that (though `std::has_unique_object_representations` does this task already without needing to do that math). – Lemon Drop Jul 02 '19 at 22:31