4
int main(){
  int v = 1;
  char* ptr = reinterpret_cast<char*>(&v);
  char r = *ptr; //#1
}

In this snippet, the expression ptr point to an object of type int, as per:
expr.static.cast#13

Otherwise, the pointer value is unchanged by the conversion.

Indirection ptr will result in a glvalue that denotes the object ptr point to, as per
expr.unary#op-1

the result is an lvalue referring to the object or function to which the expression points.

Access an object by using a glvalue of the permitted type does not result in UB, as per
basic.lval#11

If a program attempts to access ([defns.access]) the stored value of an object through a glvalue whose type is not similar ([conv.qual]) to one of the following types the behavior is undefined:

  • a char, unsigned char, or std​::​byte type.

It seems it also does not violate the following rule:
expr#pre-4

If during the evaluation of an expression, the result is not mathematically defined or not in the range of representable values for its type, the behavior is undefined.

Assume the width of char in the test circumstance is 8 bits, its range is [-128, 127]. The value of v is 1. So, Does it mean the snippet at #1 does not result in UB?

As a contrast, given the following example

int main(){
  int v =  2147483647; // or any value greater than 127
  char* ptr = reinterpret_cast<char*>(&v);
  char r = *ptr; //#2
}

#2 would be UB, Right?

xmh0511
  • 7,010
  • 1
  • 9
  • 36
  • 1
    The language allows for multiple objects to have the same address, provided they have different types (they overlap). It is the intention of the language that every object overlaps with an array of `unsigned char` of the same size, which represents that object's memory representation. That is what your pointer would be pointing to. However, I *believe* that there is a language defect in that it is never explicitly stated that the resulting `unsigned char` pointer points to an actual array. Currently you need to use `memcpy` to access an object's memory representation. – François Andrieux Mar 24 '21 at 14:25
  • 2
    BTW, depending of endianness, `*ptr` might be `'\0'`. – Jarod42 Mar 24 '21 at 14:38
  • 1
    `char` is not necessary signed. (I would say portable range is [0, 127]). – Jarod42 Mar 24 '21 at 14:40
  • @Jarod42 regardless its underlying type is signed or unsigned char, and regardless of big or little endian; the first snippet is always well-defined, Right? – xmh0511 Mar 24 '21 at 14:50
  • @jackX I believe that is an open question. As far as I have been able to find, it is defined by non-normative notes only. So strictly speaking, it may be UB. But the *intention* is clearly for both examples to be well defined, at least implementation defined. – François Andrieux Mar 24 '21 at 14:52
  • @FrançoisAndrieux So, in my first snippet, Is it well-defined? – xmh0511 Mar 24 '21 at 14:53
  • @FrançoisAndrieux Which rule in the standard indicates that the first case is UB? – xmh0511 Mar 24 '21 at 14:54
  • @jackX: Pedantically, there are no `char` object at `ptr`. – Jarod42 Mar 24 '21 at 14:58
  • @jackX I said I believe it is an open question. That means there doesn't seem to be a definite answer. I've seen good arguments for both cases. The argument for the position that it is UB is an argument of omission. The standard never actually says what happens at `*ptr`. My information is based on discussions on C++17, it may have changed in C++20. – François Andrieux Mar 24 '21 at 15:01
  • 1
    @Jarod42 It indeed does not exist an object of type char, however `*ptr` is a lvalue of type char, access the actual object of type int through the glvalue is well-defined by basic.lval#11. And as I said, l-to-r conversion applies to the glvalue would produce a prvalue of type char with the value contained in the actual object, which does not exceed the range of char anyway. – xmh0511 Mar 24 '21 at 15:05
  • It doesn't break those rules, but I think it (unfortunately) breaks other rules: from [expr.unary#op-1](https://eel.is/c++draft/expr.unary#op-1), `*ptr` return the object `char` pointing by `ptr`, but creation of object are limited from [intro.object](https://eel.is/c++draft/intro.object). I think we might be correct with convoluted `alignas(int) char buffer[sizeof(int)]; int* pi = new (buffer) int; char* ptr = std::launder(reinterpret_cast(pi));` (storage create the char array, whereas `int i;` doesn't). – Jarod42 Mar 24 '21 at 15:31
  • @FrançoisAndrieux: Are you sure about "it is never explicitly stated that the resulting unsigned char pointer points to an actual array"? See https://eel.is/c++draft/basic.types#general-4 – Ben Voigt Mar 24 '21 at 17:30
  • 1
    @BenVoigt As the linked [defect report](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1839r2.pdf) complains, a sequence is not an array. Edit : Though it seems clear that it was *intended* for an object representation to be an array. See [footnote](https://eel.is/c++draft/basic.types#footnote-35). – François Andrieux Mar 24 '21 at 17:59
  • @FrançoisAndrieux: But a sequence which is inside a single complete object satisfies the requirements for pointer arithmetic, does it not? (I know it used to) Why exactly do we care if it is an "array", what does that have that a sequence is missing (in the guaranteed contiguous case)? – Ben Voigt Mar 24 '21 at 18:21
  • @BenVoigt All it is missing is that it isn't called an array. Pointer arithmetic is defined for elements of an array (the term array is used [link](https://timsong-cpp.github.io/cppwp/expr.add#:operator,subtraction)) and makes no mention of sequences. So, intuitively, they should be equivalent, but a compiler would be within its rights to wreck havoc on your code if it spots it. I would be surprised if any compiler would every do that, compiler implementers clearly understand the intent of the language, but it is technically UB because a compiler wouldn't be in the wrong if they did. – François Andrieux Mar 24 '21 at 18:26
  • Kinda dup of https://stackoverflow.com/questions/54169489/is-c17-implementable-on-big-endian-platforms – Language Lawyer Mar 24 '21 at 19:55
  • Yes, #1 is not UB and the value of `r` is guaranteed to be `1` by [this](https://timsong-cpp.github.io/cppwp/n4659/conv.lval#3.4). #2 is UB because the value of the `*ptr` would be `2147483647`, which is not in the range representable by `char` (unless it is an implementation where the range of `char` is the same as the range of `int`). – Language Lawyer Mar 24 '21 at 20:18
  • @Jarod42 _depending of endianness, *ptr might be '\0'_ as it is currently worded, the value of `*ptr` doesn't depend on endianness. – Language Lawyer Mar 24 '21 at 20:19
  • @FrançoisAndrieux: Ahh yeah, I seem to have one again confused the precondition for pointer addition with the precondition for pointer comparison (which is allowed between two pointers into the same complete object). I never keep those two straight. – Ben Voigt Mar 24 '21 at 20:19
  • @LanguageLawyer The value of `*ptr` does depend on Endianness. In the LE, it might be `1` instead in the BE, it might be `0`. – xmh0511 Mar 25 '21 at 02:35
  • @jackX https://timsong-cpp.github.io/cppwp/n4861/conv.lval#3.4.sentence-1 – Language Lawyer Mar 25 '21 at 02:42
  • @LanguageLawyer l-to-r is a process of reading value, however it does not explicitly how to read the value. Assume it is `mov eax, byte ptr [address of ptr]`. – xmh0511 Mar 25 '21 at 02:43
  • @jackX «how» is an implementation's headache. It says: the result is the value of the object denoted by the glvalue. And the object is `v` with value `1` or `2147483647`. – Language Lawyer Mar 25 '21 at 02:45
  • @LanguageLawyer It does say *the value **contained** in the object is the prvalue result*, which means it is not necessay to be the value that is represented by the **entire** value representation. – xmh0511 Mar 25 '21 at 02:49
  • @jackX I think we've already discussed it. An object can't contain multiple values. AFAIK it is the current C++ object model. BTW, «a char, unsigned char, or std​::​byte type» strict aliasing bullet is obsolete, since even after P1839R2 access of an element of an object representation won't be considered an access of the object (and vice versa). – Language Lawyer Mar 25 '21 at 02:56
  • @LanguageLawyer AFAIK, when we transfer the data of type `int` to tcp by using `char`, endianness of the platform is significant. P1839R2 intends totally to amend this vague in the current standard. *An object can't contain multiple values* << yes, it does. However,note *For trivially copyable types, the value representation is a set of bits in the object representation that determines a value, which is one discrete element of an implementation-defined **set of values**.* which might mean bits is a atomic unit... – xmh0511 Mar 25 '21 at 03:17
  • @LanguageLawyer ... serveral bits compose char and serveral unsigned char compose an object (sequence of N unsigned char objects) – xmh0511 Mar 25 '21 at 03:22
  • @jackX [I read `int i = -1; auto u = reinterpret_cast(i);` as reading `-1` and thus triggering UB](https://lists.isocpp.org/sg12/2019/06/0802.php) and [zygloid seems to agree](https://lists.isocpp.org/sg12/2019/06/0803.php). I think this supports my reading of [conv.lval]/(3.4). _when we transfer the data of type int to tcp by using char, endianness of the platform is significant_ Practically yes, but the Standard doesn't support such sending of objects now. And even for trivially copyable types, bit pattern in their object/value representation doesn't uniquely determine the value. – Language Lawyer Mar 25 '21 at 03:25
  • @LanguageLawyer After thinking a moment, I would say even if the value could be represented in the range of `char`, it might still be UB since [basic.life] *The properties ascribed to objects and references throughout this document apply for a given object or reference only during its lifetime.* there's no rule in the standard says that the lifetime of `char` would overlap with the object `v` of type int. – xmh0511 Mar 25 '21 at 03:27
  • @jackX `reinterpret_cast(v)` denotes an object of type `int` which is within lifetime. – Language Lawyer Mar 25 '21 at 03:28
  • @LanguageLawyer That's a subtle different. That OP trys to state that an signed integer and its corresponding unsigned type can read each other by using a glvalue of the unsigned/signed of that type. As I said, how to read the value of an object of in through using a glvalue of type char is unspecified. The implementation could either read the whole value into the register and then convert the value(prvalue of int) to another value( the prvalue of char) or truncate the value(this way should concern the endianness). Such two way would have quite different result. – xmh0511 Mar 25 '21 at 03:38
  • @xmh0511 _That OP trys to state that an signed integer and its corresponding unsigned type can read each other by using a glvalue of the unsigned/signed of that type_ That OP doesn't try to state this. This is well-known without OP. – Language Lawyer Mar 25 '21 at 03:48
  • @xmh0511 _That's a subtle different._ From [conv.lval](3.4) POV, what is the difference between reading an `int` object through a glvalue of `unsigned` or `char` type? There seem to be agreement that **the** value of the `int` object is read in the `unsigned` case. Not that `int` object's value representation bit pattern is interpreted as a bit pattern of an object of `unsigned` type. I.e. such conversion can produce `-1` if the `int` object stores `-1`. – Language Lawyer Mar 25 '21 at 03:49
  • @LanguageLawyer *An unsigned integer type has the same object representation, value representation, and alignment requirements ([basic.align]) as the corresponding signed integer type.* Since `unsigned int` and `int` has the same object representation and value representation, it's the difference here.`char` and `int` does not have the same width of bits. – xmh0511 Mar 25 '21 at 05:01
  • @xmh0511 Where does [conv.lval]/(3.4) care about same/different representation? – Language Lawyer Mar 25 '21 at 05:04
  • @LanguageLawyer [conv.lval]/(3.4) indeed does not care the representation. Intuitively, two object of the same type except that the signedness, it is easy to process, it merely to read the same width of bits into the register. How do you interpret that in the BE platform, `c` is zero? – xmh0511 Mar 25 '21 at 05:11
  • @xmh0511 _How do you interpret that in the BE platform, c is zero?_ [The implementation is defective, not conforming.](https://i.kym-cdn.com/photos/images/newsfeed/000/925/494/218.png_large) – Language Lawyer Mar 25 '21 at 05:21
  • @LanguageLawyer `struct A{int a = 1; int b = 0;} object; char c = (char&)object`. What do you think the value of `c` is? – xmh0511 Mar 25 '21 at 05:29
  • @xmh0511 UB because of [expr.pre]/4. I very much doubt `char` type can represent a value of a `struct A { ... };` object. – Language Lawyer Mar 25 '21 at 05:31
  • @LanguageLawyer I don't know how do you determine whether the value can be represented or not. if it were `struct A{int a = 1;} object;` Is the value of `object` can be represented? – xmh0511 Mar 25 '21 at 05:34
  • _I don't know how do you determine whether the value can be represented or not_ The standard doesn't say that the range of values of type `struct A { /*whatever*/ };` is [some_int_value, another_int_value] — no reason to assume this. (Yes, this might be a bit weak, if `/*whatever*/` is just `int a = 1;`, but I think the value of `object` is not the value of `object.a` e.g. because `object.a` can be outside lifetime when `object` is not and this prolly should somehow be a part of what we call «`object`'s value».) – Language Lawyer Mar 25 '21 at 05:42
  • @LanguageLawyer So, in general. You think `*ptr` always be 1 when v is `1` regardless of LE or BE? If it were, How do you explain the exsitence of `ntohl(), htonl()`? see https://www.geeksforgeeks.org/little-and-big-endian-mystery/. if it is as you said, then `#include using namespace std; int main() { unsigned int i = 1; char *c = (char*)&i; if (*c) cout<<"Little endian"; else cout<<"Big endian"; return 0; } ` wouldn't determin which the platform is. – xmh0511 Mar 25 '21 at 05:53
  • _So, in general. You think *ptr always be 1 when v is 1 regardless of LE or BE?_ This what the Standard literally says. (For unknown reasons, this makes some ppl feel uncomfortable and they want to change this, like in P1839.) _then ... wouldn't determin which the platform is_ On a literally conforming implementation ­— it won't. – Language Lawyer Mar 25 '21 at 05:59
  • @LanguageLawyer In other words, Can it be considered as a defect of the standard? After all, all the platform are concerning the LE or BE, especialy when we do `tcp` programming. – xmh0511 Mar 25 '21 at 06:09
  • P1839 says that EWG told CWG to fix this as a DR. – Language Lawyer Mar 25 '21 at 06:10
  • @LanguageLawyer I saw your point. From the point of the current standard, the answers of the above asking is true. However, what the standard says does not conform to what actually these main implementations do and P1839 trys to amend these rules to make it be comfortable with these implementations? – xmh0511 Mar 25 '21 at 06:16
  • _However, what the standard says does not conform to what actually these main implementations do and P1839 trys to amend these rules to make it be comfortable with these implementations?_ Yes. – Language Lawyer Mar 25 '21 at 06:51

2 Answers2

2

Does it mean the snippet at #1 does not result in UB?

Yes, the quoted rules mean that #1 is well defined.

#2 would be UB, Right?

No, as per the quoted rules, the behaviour of #2 is also well defined.

The type of ptr is char*, therefore the type of the expression *ptr is char whose value cannot exceed the value representable by char, thus expr#pre-4 does not apply.

Assume the width of char in the test circumstance is 8 bits, its range is [-128, 127].

This assumption is not necessary in order for #1 to be well defined.

The value of v is 1

This does not follow from the above assumption alone. It may be practically true in case of a little endian CPU (including the previous assumptions) although the standard doesn't specify the representation exactly.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • I don't think the cited rules define the behavior of the snippets. There was a [similar question](https://stackoverflow.com/questions/63724182/can-you-access-the-object-representation-of-any-object-through-a-char) recently. I believe this question falls under [this defect report](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1839r2.pdf). The intention of the language is for this question's code to be implementation defined, but it is currently technically UB due to a defect in the standard. – François Andrieux Mar 24 '21 at 15:09
  • @FrançoisAndrieux Ah. I hadn't realised that this (inadvertently) changed in C++17. – eerorika Mar 24 '21 at 15:16
  • @eerorika The value of the object in the second case does exceed the range of char. – xmh0511 Mar 24 '21 at 15:17
  • @jackX The value of char exceeds the range of char? – eerorika Mar 24 '21 at 15:18
  • @eerorika `*ptr` denotes the object of type int since the object of char is not pointer-inconvertible with an object of int. After the conversion, ptr still points to `v` – xmh0511 Mar 24 '21 at 15:26
  • @jackX: we don't do `char c = 128;`, the question is more *"is bit pattern (`0b1000000`?) correct for a `char`?"* – Jarod42 Mar 24 '21 at 15:35
  • @Jarod42 The standard does not say indirection `ptr` would obtain the bit pattern('0b1000000'). – xmh0511 Mar 24 '21 at 15:43
  • @jackX Arguably it doesn't say what happens in this case at all, which is why it should be treated as UB. – François Andrieux Mar 24 '21 at 15:46
  • @FrançoisAndrieux If two object are not pointer-interconvertible, the point value is unchanged, which means the pointer still points to the original object as per [basic.compound](https://timsong-cpp.github.io/cppwp/n4861/basic.compound#def:represents_the_address). So, indirection `ptr` still denotes the original object it points to. – xmh0511 Mar 24 '21 at 15:50
  • @jackX I've tried to address this in my answer. The error in logic is to assume that if two pointers point to the same address, they point to the same object. This is not necessarily true when the pointer types are different. Two objects of different type can have the same address, notably the first member of a standard layout type. Which object is pointed (the owning instance or its member) to depends on the static pointer type. This is similar to what is happening here. An object is intended to overlap with an array of bytes which is its memory representation. – François Andrieux Mar 24 '21 at 15:54
  • @FrançoisAndrieux Note "If two objects are pointer-interconvertible, then they have the same address, and it is possible to obtain a pointer to one from a pointer to the other via a reinterpret_­cast. " and [intro.object#9](https://timsong-cpp.github.io/cppwp/n4861/intro.object#9). There's no rule in the current standard says that array of bytes and the object it represents has an overlapping lifetime. – xmh0511 Mar 24 '21 at 16:01
  • @jackX The problem with interpreting the standard is that it information poorly localized. There are often other sections that expand on a given rule. You cannot assume that a passage completely defines its topics, any other section could introduce new information or exceptions. See [object representation](https://timsong-cpp.github.io/cppwp/basic.types.general#def:representation,object). – François Andrieux Mar 24 '21 at 16:03
  • @FrançoisAndrieux However, according to [basic.compound#4], anyhow we cannot obtain a pointer point to char through a pointer to int, Isn't it? – xmh0511 Mar 24 '21 at 16:10
  • An object and its representation are not currently pointer interconvertable, as far as I understand it. Changing that could be part of a possible solution to this defect. But it seems the current proposal is to instead amend the definition of `reinterpret_cast`. Edit : Among other changes, it proposes adding "— If `a` occupies contiguous bytes of storage and `T2` is `unsigned char`, `char` or `std::byte`, the result is a pointer to the first element of the object representation of `a`." to [expr.reinterpret.cast#7](https://timsong-cpp.github.io/cppwp/expr.reinterpret.cast#7). – François Andrieux Mar 24 '21 at 17:07
  • 1
    _The type of ptr is char*, therefore the type of the expression *ptr is char whose value cannot exceed the value representable by char_ Sounds like «If the type of an expression `E1+E2` is `int`, its value cannot exceed the value representable by `int`» – Language Lawyer Mar 24 '21 at 20:06
  • @LanguageLawyer violent agreement. How the value of `*ptr` is, depend on how compiler would interpret the bit pattern. If the pattern is `11111111` and the compiler interprets it to an unsigned value, it would be `256` and the choose for underlying type for `char` is `signed char`, it causes UB. – xmh0511 Mar 25 '21 at 02:42
2

It is the intention of the language that both snippets be implementation defined. I believe they were, until to C++17 which broke support for that language feature. See the defect report here. As far as I know, this has not been fixed in C++20.

Currently, the portable workaround for accessing memory representation is to use std::memcpy (example) :

#include <cstring>

char foo(int v){
  return *reinterpret_cast<char*>(&v);
}

char bar(int v)
{
    char buffer[sizeof(v)];
    std::memcpy(buffer, &v, sizeof(v));
    return *buffer;
}

foo is technically UB while bar is well defined. The reason is foo is UB is by omission. Anything the standard fails to define is by definition UB and the standard, in its current state, fails to define the behavior of this code.

bar produces the same assembly as foo with gcc 10. For simple cases, the actual copy is optimized out.

Regarding your rational, the reasoning seems sound except that, in my opinion, the rules defining unary operator* (expr.static.cast#13) doesn't have the effect you expect in this case. The pointer must point to the underlying representation, which is poorly defined as the linked defect describes. The fact that the pointer's value doesn't change does not mitigate the fact that it points to a different object. C++ allows objects to have the same address if their types are different, such as the first member in a standard layout class sharing the same address as the owning instance.

Note that the author is the defect report came to the same conclusion as you regarding snippet #1, but I disagree. But due to the fact that we are dealing with a language defect, and one that conflicts with state intentions, it is hard to definitively prove one behavior correct. The fundamental rules these arguments would be based on are known to be flawed in this particular case.

François Andrieux
  • 28,148
  • 6
  • 56
  • 87
  • But does the standard say anything about `memcpy` being well defined? I always assumed `memcpy` was well defined **because** it reinterprets the data as array of `char`. – eerorika Mar 24 '21 at 16:03
  • 1
    *such as the first member in a standard layout class sharing the same address as the owning instance.* Because they're **pointer-interconvertible** – xmh0511 Mar 24 '21 at 16:15
  • @eerorika: `memcpy` is provided by compiler, so you cannot use its implementation to know if it is well formed or not. there are several standard implementations which would be UB if written by regular user. `std::vector`, `std::less`, ... – Jarod42 Mar 24 '21 at 16:15
  • @eerorika I'm not sure of the wording that guaranties `memcpy` allows this, it refers to the C standard for its definition. But it is used in examples throughout the standard such as [here](https://timsong-cpp.github.io/cppwp/basic.types.general#3). I'm sorry I can't offer a more strict proof. – François Andrieux Mar 24 '21 at 17:02
  • @jackX Nonetheless it shows objects of different objects of different types can have the same address. I'll also refer you to [this note](https://timsong-cpp.github.io/cppwp/basic.compound#note-4) which indicates that not all objects which share an address are pointer-interconvertible. – François Andrieux Mar 24 '21 at 17:05