6

Suppose we take a very big array of unsigned chars.

std::array<uint8_t, 100500> blob;
// ... fill array ...

(Note: it is aligned already, question is not about alignment.) Then we take it as uint64_t[] and trying to access it:

const auto ptr = reinterpret_cast<const uint64_t*>(blob.data());
std::cout << ptr[7] << std::endl;

Casting to uint64_t and then reading from it looks suspicious as for me.

But UBsan, -Wstrict-aliasing is not triggering about it. Google uses this technique in FlatBuffers. Also, Cap'n'Proto uses this too.

Is it undefined behavior?

vladon
  • 8,158
  • 2
  • 47
  • 91
  • 2
    Formally yes. Practically you'd have trouble finding an implementation that won't do "what's expected". – StoryTeller - Unslander Monica Jan 22 '18 at 08:36
  • 1
    The cast is not UB, but the expression `ptr[7]` is a strict alising violation – M.M Jan 22 '18 at 08:43
  • 1
    Imagine you are coding the optimiser. You know that by the standard 2 unrelated pointers cannot point to the same area of memory. So you might optimise out "redundant" reads, as writes on the other pointer cannot change the value being read. This is why UB is so dangerous as future compiler optimisations will expose them. – Richard Critten Jan 22 '18 at 10:21
  • 1
    The fact is the standard is broken. The optimizer guys propped the strict aliasing in there without a way to bypass it, thereby breaking the very essence of why people code in C++ to begin with. That is also why the Linux kernel is [built](https://stackoverflow.com/questions/2958633/gcc-strict-aliasing-and-horror-stories) with -fno-strict-aliasing, and why compilers are dialing down their reliance on strict aliasing instead of expanding it. – rustyx Jan 22 '18 at 11:13
  • 1
    Shameless plug: I did a [lightning talk about this at Meeting C++ 2017](https://www.youtube.com/watch?v=_8vMAkCp0Rc). – ComicSansMS Jan 22 '18 at 11:45
  • @RichardCritten except here when one of those pointers is to `uint8_t` i.e. `unsigned char` which can point to any area of memory in which case an optimiser couldn't make that assumption without proving that no code anywhere can have reused the storage. – eerorika Jan 22 '18 at 11:49
  • 4
    @RustyX I would say the kernel is built with -fno-strict-aliasing because they know the code is full of strict aliasing violations – M.M Jan 22 '18 at 11:55

2 Answers2

6

You cannot access an unsigned char object value through a glvalue of an other type. But the opposite is authorized, you can access the value of any object through an unsigned char glvalue [basic.lval]:

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined: [...]

  • a char, unsigned char, or std​::​byte type.

So, to be 100% standard compliant, the idea is to reverse the reinterpret_cast:

uint64_t i;
std::memcpy(&i, blob.data() + 7*sizeof(uint64_t), sizeof(uint64_t));
std::cout << i << std::endl;

And it will produces the exact same assembly.

Community
  • 1
  • 1
Oliv
  • 17,610
  • 1
  • 29
  • 72
  • The code is not portable to systems where byte isn't 8 bits. On such systems `sizeof(uint64_t) != 8`. Casting the first argument of `memcpy` is redundant, is it not? – eerorika Jan 22 '18 at 11:55
  • @Oliv By the way, the standard does not guarantee that `uint8_t` (which is, moreover, optional) is an alias for `unsigned char`. Therefore, it should be pointed out that `unsigned char` should be used instead of `uint8_t` for `blob` (UPDATE: comment fix applied - was misreading things). – Arne Vogel Jan 22 '18 at 12:08
  • 1
    @user2079303 I did not wanted to make the code too difficult to read, I would have distract attention about the subject but ... – Oliv Jan 22 '18 at 13:20
  • @ArneVogel On the other hand, I had not noticed he was using uint8_t since the question was about a reinterpret_cast from unsigned char. So there are more troubles in my piece of codes. I must update it. – Oliv Jan 22 '18 at 13:21
  • @user2079303 It is not redundant, there can not be two objects with the different type (with exceptions to this rule) at the same address during their overlaping lifetime periods. – Oliv Jan 22 '18 at 13:28
  • @Oliv I don't see how there being only one object would make the cast necessary or useful. It has no effect on how `memcpy` behaves, has it? – eerorika Jan 22 '18 at 13:36
  • @user2079303 Indeed! I make a new correction: i remove last correction. – Oliv Jan 22 '18 at 13:50
1

The cast itself is well defined (a reinterpret_cast never has UB), but the lvalue to rvalue conversion in expression "ptr[7]" would be UB if no uint64_t object has been constructed in that address.

As "// ... fill array ..." is not shown, there could have been constructed a uint64_t object in that address (assuming as you say, the address has sufficient alignment):

const uint64_t* p = new (blob.data() + 7 * sizeof(uint64_t)) uint64_t();

If a uint64_t object has been constructed in that address, then the code in question has well defined behaviour.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • No this is wrong, your object has an undeterminate value. – Oliv Jan 22 '18 at 10:45
  • @Oliv The value is never read, only the address, but that might be a type too(?) – underscore_d Jan 22 '18 at 10:45
  • `std::cout<

    – Oliv Jan 22 '18 at 10:46
  • 1
    @Oliv it's determinate value. The last parenthesis mean value initialization. – eerorika Jan 22 '18 at 10:46
  • @Oliv Why? The pointer was returned by `new`, so it must be a valid address. And as mentioned, what it points to has a well-defined value too. – underscore_d Jan 22 '18 at 10:47
  • @user2079303 Read this sentence [\[dlc.init\]](https://timsong-cpp.github.io/cppwp/n4659/dcl.init#7.3) – Oliv Jan 22 '18 at 10:48
  • @Oliv How about you read the Standard? That is not default initialisation; it's value/zero initialisation. – underscore_d Jan 22 '18 at 10:49
  • 1
    @Oliv that's irrelevant rule since it pertains to default initialization. This is value initialization: https://timsong-cpp.github.io/cppwp/n4659/dcl.init#8.4 – eerorika Jan 22 '18 at 10:49
  • @underscore_d What you think is in conflict with [intro.object}(https://timsong-cpp.github.io/cppwp/n4659/intro.object#8) – Oliv Jan 22 '18 at 10:49
  • @Oliv Wow, that's even more irrelevant than everything else you already said. – underscore_d Jan 22 '18 at 10:50
  • 1
    Also for completeness, the rule that says why this is value initialization rather than default initialization: https://timsong-cpp.github.io/cppwp/n4659/dcl.init#11 – eerorika Jan 22 '18 at 10:50
  • @user2079303 So *p is zero initialized https://timsong-cpp.github.io/cppwp/n4659/dcl.init#8.4 – Oliv Jan 22 '18 at 10:51
  • `*p==0`. I did not see he had done such an even bigger mistake. Had he not add the () the object would not have been initialized the behavior would have been UB but it will probably have produced the right output. But here it will just produce 0. – Oliv Jan 22 '18 at 10:52
  • @underscore_d `*p` is what the pointer points to. It is indeed zero initialized. @Oliv I fixed the code to print the zero rather than the pointer. Zero is indeed the right output. – eerorika Jan 22 '18 at 10:53
  • @Oliv how so? The code in question is well defined if `uint64_t` has been constructed with placement new into that address. Otherwise the code in question is UB. Do you disagree with my answer? – eerorika Jan 22 '18 at 10:56
  • @user2079303 But look at the question, he does not want to output 0. He wants to read the array of unsigned char as if it contained the representation of an array of uint64_t – Oliv Jan 22 '18 at 10:57
  • @Oliv and I've shown him how it can be achieved: by constructing `uint64_t` object(s) into that memory. It's unknown what he wants as output. – eerorika Jan 22 '18 at 10:59
  • @Oliv it's fairly clear to me too. What he is suspicious about is either UB or DB (defined behaviour) depending on whether the construction that I've shown has been done. – eerorika Jan 22 '18 at 11:09
  • 1
    @user2079303 He is talking about reading a value, not reusing storage. There is no point in continuing this discussion, its motivation is without any object but for you. – Oliv Jan 22 '18 at 11:12
  • he could have constructed an uint64_t object and assigned a value to it, or am i wrong? or even fill it by memcpy. the point is, if there is a uint64_t object created once and never destroyed, the code in question is well defined – phön Jan 22 '18 at 11:30