1

Let's consider the following (simplified) code for reading contents of a binary file:

struct Header
{
    char signature[8];
    uint32_t version;
    uint32_t numberOfSomeChunks;
    uint32_t numberOfSomeOtherChunks;
};

void readFile(std::istream& stream)
{
    // find total size of the file, in bytes:
    stream.seekg(0, std::ios::end);
    const std::size_t totalSize = stream.tellg();

    // allocate enough memory and read entire file
    std::unique_ptr<std::byte[]> fileBuf = std::make_unique<std::byte[]>(totalSize);
    stream.seekg(0);
    stream.read(reinterpret_cast<char*>(fileBuf.get()), totalSize);

    // get the header and do something with it:
    const Header* hdr = reinterpret_cast<const Header*>(fileBuf.get());

    if(hdr->version != expectedVersion) // <- Potential UB?
    {
        // report the error
    }

    // and so on...
}

The way I see this, the following line:

if(hdr->version != expectedVersion) // <- Potential UB?

contains undefined behavior: we're reading version member of type uint32_t which is overlaid on top of an array of std::byte objects, and compiler is free to assume that uint32_t object does not alias anything else.

The question is: is my interpretation correct? If yes, what can be done to fix this code? If no, why there's no UB here?

Note 1: I understand the purpose of the strict aliasing rule (allowing compiler to avoid unnecessary loads from memory). Also, I know that in this case using std::memcpy would be a safe solution - but using std::memcpy would mean that we have to do additional memory allocations (on stack, or on heap if size of an object is not known).

joe_chip
  • 2,468
  • 1
  • 12
  • 23
  • 1
    [It is fine](https://stackoverflow.com/questions/43551665/has-a-stdbyte-pointer-the-same-aliasing-implications-as-char). – Hans Passant Jan 23 '19 at 23:17
  • 1
    @Hans Passant it would be fine in the opposite case (accessing `Header` through `std::byte*`). But here, I think the rules you've linked to do not apply. – joe_chip Jan 23 '19 at 23:22
  • @curiousguy The memory is allocated with `new` and is therefore aligned to maximum native alignment. `Header` is not over-aligned, so there is no problem. Aliasing is still a problem as joe points out. – eerorika Jan 23 '19 at 23:31
  • I forgot that an array of bytes allocated that was guaranteed to be aligned unlike a variable with type array of bytes. – curiousguy Jan 23 '19 at 23:32
  • It is unlikely to be a problem with the specific header shown, but in general reading directly from a file into a struct is also unsafe because there might be padding in between the members of the struct that isn't part of the file format. – zwol Jan 24 '19 at 13:15

2 Answers2

3

The question is: is my interpretation correct?

Yes.

If yes, what can be done to fix this code?

You already know that memcpy is a solution. You can however skip memcpy and extra memory allocation by reading directly onto the header object:

Header h;
stream.read(reinterpret_cast<char*>(&h), sizeof h);

Note that reading binary file this way means that the integer representation of the file must match the representation of the CPU. This means that the file is not portable to systems with differing CPU architecture.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • "*This means that the file is not portable to systems with differing CPU architecture.*" - that is why you should define which endian is used inside of the file, and then swap the endian of the fields when reading/writing them to/from memory as needed. The OP's original code would have suffered from this same issue anyway. – Remy Lebeau Jan 23 '19 at 23:26
  • @RemyLebeau Indeed my suggestion does the same as OP's original program attempts to do, except in a well defined manner. – eerorika Jan 23 '19 at 23:29
  • One interpretation of modern high level (and crazy) C++ is that you can't do pointer arithmetic here as you don't have an array of bytes only one object. – curiousguy Jan 23 '19 at 23:37
  • @curiousguy there is no problem with pointer arithmetic here. And you can perform pointer arithmetic on a pointer to an object, not just a pointer to bytes. – Remy Lebeau Jan 24 '19 at 00:25
  • @RemyLebeau According to some ppl, there is a problem with arithmetic on a pointer to bytes where **you have not declared or created (with operator new) an array of bytes**: which array are iterating with pointer arithmetic? – curiousguy Jan 24 '19 at 03:00
  • @curiousguy I don't have an exact quote, but the C++ standard allows a pointer-to-object to be casted to a pointer-to-char for the purpose of accessing the object's raw bytes, which includes pointer arithmetic of those bytes. – Remy Lebeau Jan 24 '19 at 17:42
  • "_This means that the file is not portable to systems_" not good for long term storage, but good for pipes/Unix sockets between processes – curiousguy Jan 26 '19 at 05:47
0

what can be done to fix this code?

Wait until http://wg21.link/P0593 or something similar allowing implicit object creation in arrays of char/unsigned char/std::byte is accepted.

Language Lawyer
  • 3,378
  • 1
  • 12
  • 29
  • As far as I can tell, P0593 would be relevant for reading the representation of a `Header` into a storage area allocated with `malloc` or any other function that creates _no_ objects in the storage area, but `std::make_unique` creates N `std::byte` (aka `unsigned char`) objects in the storage area, so the strict aliasing rules _are_ violated. – zwol Jan 24 '19 at 21:00
  • @zwol The case very similar to OPs case is a motivation for P0593 http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0593r2.html#objects-provided-as-byte-representation. And P0593 proposes to allow objects appear in a `std::byte` arrays: "_We propose that at minimum the following operations be specified as implicitly creating objects: Creation of an array of char, unsigned char, or std::byte implicitly creates objects within that array._" – Language Lawyer Jan 24 '19 at 21:07