9

Someone recently brought it up that this:

uint8_t a = 0b10000000;
int8_t b = *(int8_t*) &a;

is undefined behavior, because the value of a is outside of what I can represent in int8_t. Can someone explain why exactly this is undefined behavior?

My main issue is that the memory is there, and is valid as the memory for int8_t, the only difference is that int8_t will interpret that byte as -128, while uint8_t will interpret it as 128. I am further confused by this because the fast inverse square root uses:

float y =  /* Some val*/;
int32_t i  = * ( int32_t * ) &y; 

This will give a value of i in essence unrelated (apart from the IEEE standard) to y, so I don't see why reinterpreting a piece of memory could be undefined behavior.

LoremIpsum
  • 105
  • 2
Lala5th
  • 1,137
  • 7
  • 18
  • don't think in terms of memory, but in terms of what the standard specifies to be defined. (because sentences like "the memory is there" make only sense for code that does not have undefined behavior) – 463035818_is_not_an_ai Aug 10 '21 at 18:13
  • 1
    Well, this is UB because the standard says so. Just because it is fine under one memory model doesn't mean it will be fine under another. And yes, the standard fast inverse square root implementation has UB inside. It was written to solve a specific problem under a specific architecture (remember that UB doesn't mean "wrong", it may actually be correct, it only means that the behaviour is not covered by the C++ standard). What's the lesson here? Just because a piece of code is famous doesn't mean it is correct or well written. – freakish Aug 10 '21 at 18:15
  • 1
    I've said it before and I'll say it again. That the path of least resistance to learning c++ is "Syntax -> Compiled code behavior -> The Standard" is a huge problem. It leads to erroneous assumptions like OPs to be almost unavoidable for people who are learning. –  Aug 10 '21 at 18:15
  • The implementation of fast inverse square root you [typically see](https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overview_of_the_code) is also Undefined Behavior in C++. You would need to `memcpy` instead of casting pointers for it to be allowed in C++. – François Andrieux Aug 10 '21 at 18:17
  • The memory might not necessarily be there because the compiler will make assumptions that one type will never alias another type (except the core char/byte types). So the optimizer could reason that a certain block of memory is never accessed and therefore never allocate it. – Galik Aug 10 '21 at 18:17
  • 2
    @Frank: Are you sure? C++ draft N4659 6.10 8 lists a number of allowed types, including “a type that is the signed or unsigned type corresponding to the dynamic type of the object.” I do not follow C++ in the detail I do the C standard, but that seems to admit using `int8_t` for `uint8_t`, even if `int8_t` is not a `typedef` alias for `char`. – Eric Postpischil Aug 10 '21 at 18:18
  • @EricPostpischil Yes, you are correct. I just keep forgetting about aliasing signed and unsigned types because it comes up so rarely. –  Aug 10 '21 at 18:22
  • And `int8_t` and `uint8_t` are optional, so if you don't have an 8 bit datatype, the implementation can leave them out and give a compiler error rather than a trip to Bizarro world. – user4581301 Aug 10 '21 at 18:26
  • I reopened this question, because strict aliasing rule doesn't block OP's first code snippet from being valid. The second snippet is still a violation, but it seems auxiliary – SergeyA Aug 10 '21 at 18:26
  • I think the first snippet is UB simply because signed int overflow is UB. – freakish Aug 10 '21 at 18:37
  • 1
    @freakish: There is no overflow. The bits 10000000 are interpreted as 8-bit two’s complement, in which they represent −128. That value is simply stored in `b`. There is no arithmetic operation or conversion to overflow. – Eric Postpischil Aug 10 '21 at 18:45
  • As @EricPostpischil states 2's-complement is a requirement now for a conforming platform/compiler. – Captain Giraffe Aug 10 '21 at 18:49
  • Where does the standard say that two's complement is mandatory? In fact, I can read here: https://en.cppreference.com/w/cpp/types/integer that the two's complement is mandatory only if supported. Which also means that in general it is not mandatory. – freakish Aug 10 '21 at 18:51
  • @EricPostpischil according to the standard integer literals such as `0b10000000` (without any suffix) are representeted as full integers. This integer is literally `128`, not some negative number dependent on representation. And it is outside of `int8_t` range. Or do I misunderstand something? – freakish Aug 10 '21 at 18:59
  • 1
    @freakish cppreference is not always correct, but here it specifies that only 2's complement is supported in the standard: https://en.cppreference.com/w/cpp/language/types (as of c++20). I don't have time where it is in the actual standard though. Also it isn't the literal that is interpreted, but the memory. – Lala5th Aug 10 '21 at 19:04
  • 1
    @freakish: The issue is not about a literal. The code `uint8_t a = 0b10000000;` is merely preparatory for the question; it stores 128 in `a` and is not what is being asked about. Once the bits 10000000 are in `a`, the code `*(int8_t*) &a;` fetches those bits and interprets them as `int8_t`. If the `int8_t` type is defined, it is two’s complement; the `` are specified to be two’s complement. If it is not defined, this code would not compile, but, again, that is not the issue being asked about. – Eric Postpischil Aug 10 '21 at 19:04

3 Answers3

8

Thanks for all the comments. I went down a rabbit hole of strict aliasing and found that the fast inverse square root is undefined behavior, despite my beliefs, but my initial code does not seem to be. Not because uint8_t is special, but as the standard has a rule for signed/unsigned interchange it:

If a program attempts to access the stored value of an object through a glvalue whose type is not similar to one of the following types the behavior is undefined: [...] (11.2) a type that is the signed or unsigned type corresponding to the dynamic type of the object

So there is no issue in theory, as uint8_t is the unsigned type of int8_t

LoremIpsum
  • 105
  • 2
Lala5th
  • 1,137
  • 7
  • 18
  • > "So there is no issue in theory, as uint8_t is the unsigned type of int8_t", I don't believe the standard requires this. – Fatih BAKIR Aug 10 '21 at 19:04
  • @FatihBAKIR what? You don't believe the above quote from the standard? You don't believe `uint8_t` is the unsigned type corresponding to `int8_t`? Can you explain? – Tim Randall Aug 10 '21 at 19:30
  • @TimRandall, the above quote does not say `uint8_t` is the unsigned type corresponding to `int8_t`. The standard does not preclude them from being unrelated types. As far as I can see, this is the only place `uint8_t` or `int8_t` are mentioned: https://eel.is/c++draft/cstdint.syn and it does not place any requirements on them. – Fatih BAKIR Aug 10 '21 at 19:40
  • 4
    @FatihBAKIR The definition of them is here in the C standard http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf (page 255). "When typedef names differing only in the absence or presence of the initial u are defined, they shall denote corresponding signed and unsigned types". The C++ standard explicitly defines that these types follow the C standard – Lala5th Aug 10 '21 at 19:52
  • 4
    @FatihBAKIR: C++ inherits the `` types from C (the C++ standard refers to the C standard for the specifications of these imported headers). C 2018 7.20.1 1 says that, when these typedef names are defined, “they shall denote corresponding signed and unsigned types as described in 6.2.5…” Therefore `int8_t` and `uint8_t` are corresponding signed and unsigned types. – Eric Postpischil Aug 10 '21 at 19:53
  • 1
    Oh I missed that in the C standard, thanks for pointing out. – Fatih BAKIR Aug 10 '21 at 19:57
1

The problem is not the reinterpretation of data, but the reinterpretation of the pointer. This is problematic for due to the following, non-exhaustive list of reasons:

  • The standard does not require that all pointers be the same size, so sizeof(float*) does not have to be sizeof(int*), so the conversion may just lose data.
  • If you grab a uint32_t* from a float* and read from it, you would be reading a uint32_t that was never created.
  • As you said, compilers assume two pointers of different types (except unsigned char*) never alias, and perform optimizations with this information.

However, sometimes converting between bit representation of unrelated types is a legit requirement. Traditionally, this has been done using memcpy, but C++20 added std::bit_cast, able to do this reinterpretation even in constexpr, so the following is legal, and expresses our intention directly:

constexpr float pi = 3.14f;
constexpr uint32_t pi_bits = std::bit_cast<uint32_t>(pi);
Fatih BAKIR
  • 4,569
  • 1
  • 21
  • 27
  • _The standard does not require that all pointers be the same size_ true, but they have to be the same size between a signed and the corresponding unsigned type. So `sizeof(int8_t)==sizeof(uint8_t)`. – 12431234123412341234123 Aug 10 '21 at 21:28
0

Rather than trying to define all of the behaviors necessary to accomplish every plausible task, the authors of the C and C++ Standards instead allow implementations to support various useful behaviors or not, at their leisure, on the presumption that compiler writers will be able to know and support their customers' needs far better than the Committee ever could.

If one is targeting a platform where all pointers are the same size and have the same representation (true of nearly all implementations for current processor and controller designs), one ensures that any pointer used to access an object of a particular type satisfies the platform's alignment requirements for that type (true if the pointer is a multiple of the size of the largest primitive), and one uses a compiler configuration that is specified to support straightforward type punning patterns (e.g. -fno-strict-aliasing on clang or gcc), then type punning code will work as expected on that compiler configuration. Such code will not be portable to all other implementations or configurations, but portability is just one factor upon which the quality of code should be judged. If code will run efficiently and correctly on all C implementations where it will be used, replacing it with code that is slower and/or harder to read purely for purposes of making it "portable" would not be an improvement.

Incidentally, every compiler configuration I've tested either uses an abstraction model that supports useful type-punning constructs beyond those mandated by the Standards, or fails to uphold all of the memory-recycling constructs for which the Standard mandates support. It would be impossible for a compiler to behave as specified in all cases where the Standard defines behavior without also behaving in a fashion consistent with writing and reading object representations in many cases where the Standard imposes no requirements; presumably the authors of the Standard expected compilers to accommodate that difficulty by behaving usefully in more cases than required by the Standard, but when optimizations are enabled, clang and gcc prioritize "optimization" over correctness.

supercat
  • 77,689
  • 9
  • 166
  • 211