30

My program receives messages over the network. These messages are deserialized by some middleware (i.e. someone else's code which I cannot change). My program receives objects that look something like this:

struct Message {
    int msg_type;
    std::vector<uint8_t> payload;
};

By examining msg_type I can determine that the message payload is actually, for example, an array of uint16_t values. I would like to read that array without an unnecessary copy.

My first thought was to do this:

const uint16_t* a = reinterpret_cast<uint16_t*>(msg.payload.data());

But then reading from a would appear to violate the standard. Here is clause 3.10.10:

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

  • the dynamic type of the object,
  • a cv-qualified version of the dynamic type of the object,
  • a type similar (as defined in 4.4) to the dynamic type of the object,
  • a type that is the signed or unsigned type corresponding to the dynamic type of the object,
  • a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,
  • an aggregate or union type that includes one of the aforementioned types among its elements or nonstatic data members (including, recursively, an element or non-static data member of a subaggregate or contained union),
  • a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
  • a char or unsigned char type.

In this case, a would be the glvalue and uint16_t* does not appear to meet any of the listed criteria.

So how do I treat the payload as an array of uint16_t values without invoking undefined behavior or performing an unnecessary copy?

c--
  • 498
  • 4
  • 13
  • 12
    Don't you have to handle endianness anyway ? – Jarod42 Aug 02 '18 at 11:26
  • 2
    It would improve the question to show exactly what the message is like, rather than "Something like". (My answer assumes it is exactly like that) – M.M Aug 02 '18 at 11:31
  • of interest : https://stackoverflow.com/questions/39294503/what-is-the-no-undefined-behavior-way-of-deserializing-an-object-from-a-byte-arr – Sander De Dycker Aug 02 '18 at 11:33
  • 1
    Hint: How do you ensure that `payload.data()` satisfies the alignment requirement of `uint16_t[]`? – Caleth Aug 02 '18 at 12:29
  • 7
    @M.M I disagree: this is a relatively common problem - at least, I've encountered it multiple times on different projects - so I abstracted the essence of the problem. Providing a complete copy of one of the message types would just obscure the point. – c-- Aug 02 '18 at 13:52
  • 1
    @Jarod42 It makes sense to provide two different options depending on endianness: one with a copy where the endiannesses don't match; and a non-copy option where they do match. At least, it would if the answer to this question wasn't just, "You're out of luck." – c-- Aug 02 '18 at 13:55
  • 1
    @Caleth You can use `alignof` to get the required alignment and then check the pointer. If it isn't aligned then you'll need to copy, if it is then a no-copy option could be provided. Except, it appears, there is no standard no-copy option. – c-- Aug 02 '18 at 13:59
  • This may not be a solution you'll find acceptable, but code in C and use unions. See for instance X message types. – jamesqf Aug 02 '18 at 16:35
  • I believe this is well covered in [What is the strict aliasing rule?](https://stackoverflow.com/a/51228315/1708801) – Shafik Yaghmour Aug 03 '18 at 20:21
  • 1
    @jamesqf The question is tagged C++, using unions to type pun in C++ is undefined behavior. – Shafik Yaghmour Aug 03 '18 at 20:25
  • @Shafik Yaghmour: Yes, I understand that. My point is that if one language does not easily support doing what you need to do, then often the easiest & best solution to the problem is to use a different language. – jamesqf Aug 05 '18 at 16:43

4 Answers4

16

If you are going to consume the values one by one then you can memcpy to a uint16_t, or write payload[0] + 0x100 * payload[1] etc. , as to which behaviour you want. This will not be "inefficient".

If you have to call a function that only takes an array of uint16_t, and you cannot change the struct that delivers Message, then you are out of luck. In Standard C++ you'll have to make the copy.

If you are using gcc or clang, another option is to set -fno-strict-aliasing while compiling the code in question.

M.M
  • 138,810
  • 21
  • 208
  • 365
  • 1
    @bolov `launder` doesn't get around the strict aliasing rule. It only declares that the object is reachable through the pointer, if you've obtained the pointer by shenanigans. [Related question](https://stackoverflow.com/questions/51204362/stdlaunder-and-strict-aliasing-rule) – M.M Aug 02 '18 at 11:33
  • 2
    Thank you. Your second paragraph - specifically, "In Standard C++ you'll have to make the copy" - is what I wanted to know. – c-- Aug 02 '18 at 13:44
  • 1. Endianess. 2. Alignment issues. Some processors gag when they access 16-bits on an odd address. If a protocol is Big Endian and the processor is Little Endian, the values will be wrong. – Thomas Matthews Aug 02 '18 at 13:51
  • 1
    @ThomasMatthews But both endianness and alignment can be checked. If the pointer is correctly aligned and the endianness matches, a no-copy option makes sense. – c-- Aug 02 '18 at 14:01
  • 4
    There is a proposal (not yet accepted) that would allow the zero-copy version to work: [Implicit creation of objects for low-level object manipulation](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0593r1.html) – SJL Aug 02 '18 at 15:50
  • 2
    @SJL You linked to an old version, use [this format](http://wg21.link/p0593) to get latest version. As I understand it, the `reinterpret_cast` can only work for a `new`'d block of characters that hasn't been "imprinted" with anything yet; it can't be used on a `vector` of characters passed by value.So the code would have to be `uint16_t *p = std::launder(payload.data()); for (int i = 0; i < payload.size() / 2; ++i) std::bless( p + i );` , which is a lot of boilerplate to "do nothing" as it were. I imagine there will be further iterations of this proposal – M.M Aug 02 '18 at 21:19
  • @M.M `std::bless(payload.data(), payload.size()); uint16_t* p = std::launder((uint16_t*) payload.data());` – T.C. Aug 02 '18 at 22:33
  • @T.C. Is there any reason why `bless` couldn't imply `launder` (and/or `launder` imply `bless`) ? – M.M Aug 02 '18 at 23:01
15

If you want strictly to follow C++ Standard without UB, and not employ non-standard compiler extensions, you can try:

uint16_t getMessageAt(const Message& msg, size_t i) {
   uint16_t tmp;
   memcpy(&tmp, msg.payload.data() + 2 * i, 2);
   return tmp;
}

Compiler optimizations should avoid memcpy copying here in the generated machine code; see, e.g., Type Punning, Strict Aliasing, and Optimization.

There is, in fact, copying into the return value, but depending on what you will do with it, this copy can be optimized away as well (e.g., this value can be loaded into a register and used only there).

Daniel Langr
  • 22,196
  • 3
  • 50
  • 93
  • the `uint16_t` return type is a typo or is intentional? OP was talking about `uint16_t *`. – PaperBirdMaster Aug 02 '18 at 11:55
  • 4
    @Paula_plus_plus OP wanted to read `uint16_t` values, and their proposed method was to read the values by creating a pointer and dereferencing it. This answer reads `uint16_t` values by a different method, namely calling this function to read each value. – M.M Aug 02 '18 at 11:56
  • Oh! I get it! Thanks for the details! – PaperBirdMaster Aug 02 '18 at 12:03
  • 10
    in C++20, [`std::bit_cast`](https://en.cppreference.com/w/cpp/numeric/bit_cast) should do the job. – Jarod42 Aug 02 '18 at 12:26
  • awesome article you link there. I read the same hundreds of times before but this was the first time I was really convinced – 463035818_is_not_an_ai Aug 02 '18 at 13:14
  • 1
    Thanks. I am familiar with using `memcpy` for individual values, as you have suggested, but I really want to operate on the whole array without copying. – c-- Aug 02 '18 at 13:47
  • @Paula_plus_plus You are right: I want to operate on the array, not individual values. In particular, I'd like to pass it to a function that takes an array of `uint16_t`. – c-- Aug 02 '18 at 13:49
  • @Jarod42 yes I cover both memcopy and bitcast [in my answer here](https://stackoverflow.com/a/51228315/1708801) – Shafik Yaghmour Aug 03 '18 at 20:26
2

If you want to be strictly correct, as the standard that you quoted says, you can't. If you want behavior to be well defined, you will need to make the copy.

If the code is meant to be portable, you will need to handle endianness either way, and reconstruct your uint16_t values from individual uint8_t bytes, and this by definition requires a copy.

If you really know what you're doing, you can ignore the standard, and just do the reinterpret_cast that you described.

GCC and clang support -fno-strict-aliasing to prevent the optimization generating broken code. As far as I'm aware, at the time of this writing the Visual Studio compiler does not have a flag, and never performs this kind of optimizations - unless you use declspec(restrict) or __restrict.

divinas
  • 1,787
  • 12
  • 12
-3

Your code may not be UB (or border line depending on reader sensibility) if for example the vector data had been built this way:

Message make_array_message(uint16_t* x, size_t n){
 Message m;
 m.type = types::uint16_t_array;
 m.payload.reserve(sizeof(uint16_t)*n);
 std::copy(x,x+n,reinterpret_cast<uint16_t*>(m.payload.data()));
 return m;
 }

In this code the vector's data hold a sequence of uint16_t even if it is declared as uint8_t. So accessing the data using this pointer:

const uint16_t* a = reinterpret_cast<uint16_t*>(msg.payload.data());

Is perfectly fine. But accessing the vector's data as a uint8_t would be UB. Accessing a[1] would work on all compilers, but it is UB in the current standard. This is arguably a defect in the standard, and the c++ standardization committee is working to fix it, see P0593 Implicit object creation for low level object manipulation.

As of now, in my own code, I do not deal with defects in the standard, I prefer to follow compiler behavior because for this subject, this is coder and compiler that make rules and the standard will just follow!

Oliv
  • 17,610
  • 1
  • 29
  • 72
  • 1
    1) That `copy` doesn't do what you think it does. 2) You have no guarantee that `msg.payload.data()` is properly aligned. 3) `Message void`. – T.C. Aug 02 '18 at 16:27
  • This is essentially no different to the "answers" that use `aligned_storage` , you are just using non-object storage off the end of a vector. Even with P0593,I am not sure if `return m;` is guaranteed to preserve the data you wrote there. (You certainly don't actually add elements to the vector at any stage) – M.M Aug 02 '18 at 21:04
  • @M.M Ah, I missed the `reserve`. That's completely broken. – T.C. Aug 02 '18 at 22:30
  • Even if you fix the `reserve` to `resize`, this is still wrong. The vector's data will hold the bytes of a sequence of `uint16_t`, but there are no `uint16_t` objects there, so accessing it via `uint16_t` l-values is UB. – Martin Bonner supports Monica Aug 03 '18 at 07:53
  • @T.C. Haha!! 1) I do know!! copy will invoke memcpy. 2) msg.payload.data() will at least as aligned as max_align_t, 3) fixed – Oliv Aug 03 '18 at 14:15
  • @M.M. return m is noop. – Oliv Aug 03 '18 at 14:16
  • @MartinBonner, the vector do hold uint16_t objects but not int8_t. See basic life time. The fact that vector hold a pointer of type int8_t* does not change this fact. – Oliv Aug 03 '18 at 14:17
  • Think again. That copy is not a memcpy. It's a series of assignment with truncation. And no, `msg.payload.data()` has no alignment guarantee. Neither `std::allocator::allocate` nor `vector` is required to give you the pointer returned from `::operator new` unadulterated. – T.C. Aug 03 '18 at 14:21
  • @T.C. Obviously this answer is not limited on what the standard ensures. As I said even a[1]=0 is UB according to the standard! std::copy actualy invoke memcpy in stdlib++ for trivially copyable types. For libstdc++ and libc++ since the last time I read the code, data is not adulterated. For MSVC++, I just know they are working hard to make it able to compile standard c++ code. – Oliv Aug 03 '18 at 14:31
  • No, this `copy` is not a memcpy. The source and destination types don't match. I'm done here. – T.C. Aug 03 '18 at 14:34
  • @T.C. Right did not saw it! – Oliv Aug 03 '18 at 14:36