10

To avoid copying large amounts of data, it is desirable to mmap a binary file and process the raw data directly. This approach has several advantages, including relegating the paging to the operating system. Unfortunately, it is my understanding that the obvious implementation leads to Undefined Behavior (UB).

My use case is as follows: Create a binary file that contains some header identifying the format and providing metadata (in this case simply the number of double values). The remainder of the file contains raw binary values which I wish to process without having to first copy the file into a local buffer (that's why I'm memory-mapping the file in the first place). The program below is a full (if simple) example (I believe that all places marked as UB[X] lead to UB):

// C++ Standard Library
#include <algorithm>
#include <cstddef>
#include <cstdint>
#include <fstream>
#include <iostream>
#include <numeric>

// POSIX Library (for mmap)
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>

constexpr char MAGIC[8] = {"1234567"};

struct Header {
  char          magic[sizeof(MAGIC)] = {'\0'};
  std::uint64_t size                 = {0};
};
static_assert(sizeof(Header) == 16, "Header size should be 16 bytes");
static_assert(alignof(Header) == 8, "Header alignment should be 8 bytes");

void write_binary_data(const char* filename) {
  Header header;
  std::copy_n(MAGIC, sizeof(MAGIC), header.magic);
  header.size = 100u;

  std::ofstream fp(filename, std::ios::out | std::ios::binary);
  fp.write(reinterpret_cast<const char*>(&header), sizeof(Header));
  for (auto k = 0u; k < header.size; ++k) {
    double value = static_cast<double>(k);
    fp.write(reinterpret_cast<const char*>(&value), sizeof(double));
  }
}

double read_binary_data(const char* filename) {
  // POSIX mmap API
  auto        fp = ::open(filename, O_RDONLY);
  struct stat sb;
  ::fstat(fp, &sb);
  auto data = static_cast<char*>(
      ::mmap(nullptr, sb.st_size, PROT_READ, MAP_PRIVATE, fp, 0));
  ::close(fp);
  // end of POSIX mmap API (all error handling ommitted)

  // UB1
  const auto header = reinterpret_cast<const Header*>(data);

  // UB2
  if (!std::equal(MAGIC, MAGIC + sizeof(MAGIC), header->magic)) {
    throw std::runtime_error("Magic word mismatch");
  }

  // UB3
  auto beg = reinterpret_cast<const double*>(data + sizeof(Header));

  // UB4
  auto end = std::next(beg, header->size);

  // UB5
  auto sum = std::accumulate(beg, end, double{0});

  ::munmap(data, sb.st_size);

  return sum;
}

int main() {
  const double expected = 4950.0;
  write_binary_data("test-data.bin");

  if (auto sum = read_binary_data("test-data.bin"); sum == expected) {
    std::cout << "as expected, sum is: " << sum << "\n";
  } else {
    std::cout << "error\n";
  }
}

Compile and run as:

$ clang++ example.cpp -std=c++17 -Wall -Wextra -O3 -march=native
$ ./a.out
$ as expected, sum is: 4950

In real life, the actual binary format is much more complicated but retains the same properties: Fundamental types stored in a binary file with proper alignment.

My question is: how do you deal with this use case?

I have found many answers that I perceive as conflicting.

Some answers state unequivocally that one should build the objects locally. This may very well be the case but severely complicates any array-oriented operations.

Comments elsewhere seem to agree on the UB nature of this construct but there are some disagreements.

The wording in cppreference is, at least to me, confusing. I would have interpreted it as "what I'm doing is perfectly legal". Specifically this paragraph:

Whenever an attempt is made to read or modify the stored value of an object of type DynamicType through a glvalue of type AliasedType, the behavior is undefined unless one of the following is true:

  • AliasedType and DynamicType are similar.
  • AliasedType is the (possibly cv-qualified) signed or unsigned variant of DynamicType.
  • AliasedType is std::byte, (since C++17)char, or unsigned char: this permits examination of the object representation of any object as an array of bytes.

It may be that C++17 offers some hope with std::launder or that I'll have to wait until C++20 for something along the lines of std::bit_cast.

In the mean time, how do you deal with this issue?

Link to on-line demo: https://onlinegdb.com/rk_xnlRUV

Simplified example in C

It is my understanding correct that the following C program does not exhibit Undefined Behavior? I understand that the pointer casting through a char buffer does not participate in the strict aliasing rules.

#include <stdint.h>
#include <stdio.h>

struct Header {
  char     magic[8];
  uint64_t size;
};

static void process(const char* buffer) {
  const struct Header* h = (const struct Header*)(buffer);
  printf("reading %llu values from buffer\n", h->size);
}

int main(int argc, char* argv[]) {
  if (argc != 2) {
    return 1;
  }
  // In practice, I'd pass the buffer through mmap
  FILE* fp = fopen(argv[1], "rb");
  char  buffer[sizeof(struct Header)];
  fread(buffer, sizeof(struct Header), 1, fp);
  fclose(fp);
  process(buffer);
}

I can compile and run this C code by passing the file created by the original, C++ program and works as expected:

$ clang struct.c -std=c11 -Wall -Wextra -O3 -march=native
$ ./a.out test-data.bin 
reading 100 values from buffer
Escualo
  • 40,844
  • 23
  • 87
  • 135
  • 3
    `std::bit_cast` doesn't seem like being useful for this case. – eerorika Mar 07 '19 at 02:03
  • 1
    The standard doesn't say anything about `mmap` (especially not what the dynamic type is of objects stored in there, or in fact whether there are any objects at all) so you are really in the territory of decisions made by the compiler vendor. I think a sensible approach is to assume that the mmap'd data contains the same objects that were written and write your code accordingly – M.M Mar 07 '19 at 02:18

1 Answers1

8

std::launder solves the problem with strict aliasing, but not with object lifetime.

std::bit_cast makes a copy (it's basically a wrapper for std::memcpy) and doesn't work with copying from a range of bytes.

There is no tool in standard C++ to reinterpret mapped memory without copying. Such tool has been proposed: std::bless. Until/unless such changes are adopted into the standard, you'll have to either hope that UB doesn't break anything, take the potential†† performance hit and copy, or write the program in C.

While not ideal, this is not necessarily as bad as it sounds. You're already restricting portability by using mmap, and if your target system / compiler promises that it is OK to reinterpret mmapped memory (perhaps with laundering), then there should be no problem. That said, I don't know if say, GCC on Linux gives such guarantee.

†† The compiler may optimise std::memcpy away. There might not be any performance hit involved. There's a handy function in this SO answer which was observed to be optimised away, but does initiate object lifetime following the language rules. It does have a limitation the mapped memory must be writable (as it creates objects in the memory, and in non-optimised build it might do an actual copy).

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • 2
    The link to the description of std::bless is excellent. It describes my exact problem. – Escualo Mar 07 '19 at 02:29
  • 2
    Yep, basically. In short, you're right, and you'll have to just deal with it, until you get this new weird feature that hacks/patches this C++ flaw... but it's okay because everybody's already been doing that for the past 35 years :) – Lightness Races in Orbit Mar 07 '19 at 02:36
  • 1
    I think the Standards Committee expected that implementations claiming to be suitable for low-level programming would process code "in a documented fashion characteristic of the environment" in cases where that would be necessary to uphold the Spirit of C principle the Committee described as ("Don't prevent the programmer from doing what needs to be done"), but unfortunately some compilers make it difficult to make use of platform behaviors except by disabling many optimizations wholesale. – supercat Mar 07 '19 at 15:34
  • @eerorika You mention that a potential alternative is to write the offending code in C. I added a snippet to my original post --- is my understanding correct that it does not exhibit UB? Thanks again for your answer. – Escualo Mar 07 '19 at 20:37
  • 1
    @Escualo I do spot one cause for UB: You haven't aligned `buffer`, so it is not guaranteed to have the alignment required by `Header`. I don't know whether it is free of strict aliasing violation. I know C++ better than C. – eerorika Mar 07 '19 at 21:07
  • 1
    After further research, I think the C example does violate strict aliasing. One can inspect the bits of a struct by aliasing to char, but doing it the other way around seems to be a violation. – Escualo Mar 07 '19 at 21:54
  • @Escualo: The intended purpose of aliasing rules is to say "*Even if there is no discernible relationship between some object and an lvalue*, it a compiler must allow for the possibility of the lvalue accessing this object if it has any of these types". They were never intended to invite compilers to ignore visible relationships among lvalues. Given `struct S {int arr[10];} s;`, the rules don't require that compilers recognize the possibility that an access to a dereferenced `int*` might access `s` because the authors thought that in cases where such accesses would be necessary... – supercat Mar 11 '19 at 15:31
  • ...the `int*` would typically be freshly formed in a way that a compiler could recognize as being related to a `struct S` (e.g. via an expression like `s.arr[i]`, or by passing `s.arr` to a function that accesses the storage only via that pointer. In the original C example, any compiler that isn't being deliberately blind should be able to see that `process` is receiving a pointer to `buffer`, casting it, and then performing accesses using the pointer formed from that cast, and recognize the possibility of such actions accessing `buffer`. As to whether willfully-blind compilers will... – supercat Mar 11 '19 at 15:38
  • ...reliably process such code, that's anybody's guess. – supercat Mar 11 '19 at 15:39
  • @supercat an `int*` is allowed to alias `s` in that example; you're allowed to form and pass pointers to subobjects. And @Escualo is right on the UB, which seems fixable by allocating the buffer using malloc. Compilers can well be "blind" because they aren't required to inline `process` or do the reasoning you describe - moving process to a separate compilation unit would help "blindness" but can't introduce UB. – Blaisorblade Jun 09 '21 at 22:17
  • @Blaisorblade: Obviously one *should* be able to form and pass pointers to sub-objects, but the N1570 6.5p7 lists the types that may be used to access a structure, and its member types are not listed among them. Compilers should only need to "see" pointer derivations *in cases where they would otherwise exploit the lack of derivation*. If code in one compilation converts a `T1*` to `void*` and passes it to code in another unit, which converts it to `T2*` and accesses it using that type, the only time code in either unit would need to care about the other would be... – supercat Jun 09 '21 at 22:30
  • ...if the code were trying to use inter-module optimizations, and in that case the code should notice what was happening in the other module. Note that if structures S1 and S2 both have an `int foo[5]`, and pointers are declared `S1 *p1; S2 *p2`;, gcc will not allow for the possibility that accesses to `p1->foo[int1]` might interact with `p2->foo[int2]` even though both lvalues are simply dereferenced pointers of type `int*`. – supercat Jun 09 '21 at 22:34
  • @supercat "N1570 6.5p7 lists the types that may be used to access a structure, and its member types are not listed among them." That isn't an actual problem here; you can use the int* to access the field subobject. – Blaisorblade Jun 09 '21 at 22:42
  • "... if the code were trying to use inter-module optimizations, and in that case the code should notice what was happening in the other module." That doesn't affect whether UB exists. It might mean that compilers will not exploit the UB, but trying to guess that is a losing game. The order of different optimizations and inlinining can give extremely surprising results. – Blaisorblade Jun 09 '21 at 22:45
  • @Blaisorblade: When the C Standard was written, it sought to characterize as UB any situation where a useful optimization might affect program behavior, without regard for whether the behavior that would exist without the optimization would be more useful. This is because the authors of the Standard expected that compiler writers would be far better able to judge when their customers would benefit more from the optimization or from being able to use the non-optimized behavior, and they did not want to prevent compiler writers from best serving their customers. – supercat Jun 09 '21 at 22:50
  • Finally, I agree on your example re `S1 *p1; S2 *p2`, because p1 and p2 can't alias each other, and because you're accessing the whole object. – Blaisorblade Jun 09 '21 at 22:56
  • The question asks about UB, not about exploitation probability. No matter the history, hoping compilers to not exploit UB is a losing battle. Anyway, SO is recommending moving to chat, so I'll take a break. – Blaisorblade Jun 09 '21 at 23:00
  • @Blaisorblade: Obviously any non-garbage compiler would recognize that an access via an lvalue expression like `p1->foo[i]` would likely be an access to an object of type `S1`, but *because* it's so obvious the authors of the C Standard saw no need to expend ink explicitly allowing it. In evaluating `p1->foo[i]`, the subexpression `p1->foo` decays to yield a pointer of type `int*`, which is then displaced by `i` yielding another `int*`, and the that resulting pointer is then dereferenced. Where in that sequence of events is "the whole object" of type `S1` accessed? – supercat Jun 09 '21 at 23:01
  • @Blaisorblade: Such "hope" was expressed by the authors of the C Standard in the published Rationale document. They correctly predicted that anyone seeking to *sell* compilers would seek to meet customer requirements without regard for whether the Standard required them to do so, but failed to imagine that someone whose compiler got distributed with Linux because it was freely distributable would use the Standard as an excuse to ignore programmer needs. – supercat Jun 10 '21 at 14:35
  • @supercat My phrasing was poor and “access” probably the wrong word, but 6.5.2.3 p4 has "A postfix expression followed by the-> operator and an identifier designates a member of a structure or union object” and leaves undefined the behavior if no such structure object exists; so accessing `p1->foo[i]` and `p2->foo[j]` does require `p1` and `p2` point to objects. I will confess I'm more familiar with the C++ text, which is stricter in these matters. – Blaisorblade Jun 12 '21 at 09:00
  • And to clarify, that paragraph “leaves undefined” by saying nothing, but “Undefined behavior is otherwise indicated in this International Standard by the words “undefined behavior” or *by the omission of any explicit definition of behavior*” (Sec. 4 p2; all references to the C17 last draft N2176). – Blaisorblade Jun 12 '21 at 09:13
  • @Blaisorblade: Given `struct foo {int x,y[10];} *p`;`, when the lvalue expression `p->y` is used as anything other than the operand of `&` or `sizeof`, that expression does not access anything, but merely yields a pointer to the first element. While the Standard should IMHO treat `[]` as an operator which acts directly upon array objects, rather than on pointers produced by array decay, that's not what it actually says. Instead, `x[y]` is defined as equivalent to `(*((x)+(y)))`. Further, the Standard makes no consistent effort to explicitly define behavior in all cases where... – supercat Jun 12 '21 at 16:40
  • ...various parts of the Standard and an implementation's documentation would, taken together, describe how an action would behave in all defined cases. The lack of an explicit definition of behavior may imply that the authors of the hadn't ruled out the possibility that it might on at least some occasions be useful for an implementation to deviate from commonplace behavior, but it certainly does not imply any judgment that implementations shouldn't be expected to behave in commonplace fashion *absent a compelling and documented or obvious reason for doing otherwise*. – supercat Jun 12 '21 at 16:50
  • Any collection or source files that is accepted by at least one conforming C implementation somewhere in the universe is a conforming C program, and for many freestanding implementations, all non-trivial programs rely upon actions characterized by the Standard as invoking Undefined Behavior (if nothing else, dereferencing pointers to hardware registers that trigger actions but don't have associated storage, and which are thus not "objects"). The ability to meaningfully process such programs is thus a Quality of Implementation issue outside the Standard's jurisdiction. – supercat Jun 12 '21 at 16:57
  • “that expression does not access anything, but merely yields a pointer to the first element” I understand the basic semantics, but *in fact* it also informs the optimizer about the structure of the heap, thanks to UB. “While the Standard should IMHO treat [] as an operator which acts directly upon array objects […] x[y] is defined as equivalent to (*((x)+(y))).” But that addition acts on array objects, per 6.5.6. – Blaisorblade Jun 12 '21 at 19:43
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/233697/discussion-between-blaisorblade-and-supercat). – Blaisorblade Jun 12 '21 at 19:43