4

I want to use fixed contiguous bytes of a long byte array s as keys in a std::map<std::array<char,N>,int>. Can I do this without copying by reinterpreting subarrays of s as std::array<char,N>?

Here is a minimal example:

#include <map>
int main() {
    std::map<std::array<char,10>,int> m;
    const char* s="Some long contiguous data";

    // reinterpret some contiguous 10 bytes of s as std::array<char,10>
    // Is this UB or valid? 
    const std::array<char,10>& key=*reinterpret_cast<const std::array<char,10>*>(s+5);

    m[key]=1;
}

I would say yes, because char is a POD type that does not require alignment to specific addresses (in contrast to bigger POD types, see https://stackoverflow.com/a/32590117/6212870). Therefore, it should be OK to reinterpret_cast to std::array<char,N> starting at every address as long as the covered bytes are still a subrange of s, i.e. as long as I ensure that I do not have buffer overflow.

Can I really do such reinterpret_cast or is it UB?

EDIT: In the comments, people correctly pointed to the fact that I cannot know for sure that for std::array<char,10> arr it holds that (void*)&arr==(void*)&arr[0] due to the possibility of padding of the internal c-array data member of the std::array template class, even though this typically should not be the case, especially since we are considering a char POD array. So I update my question:

Can I rely on the reinterpret_cast as done above when I check via static_assert that indeed there is no padding? Of coures the code won't compile anymore on compiler/platform combinations where there is padding, so I won't use this method. But I want to know: Are there other concerns apart from the padding? Or is the code valid with a static_assert check?

phinz
  • 1,225
  • 10
  • 21
  • 2
    related https://stackoverflow.com/questions/69500721/is-the-address-of-a-stdarray-guaranteed-the-same-as-its-data – 463035818_is_not_an_ai Dec 10 '21 at 12:01
  • @463035818_is_not_a_number Indeed this is important to consider. But padding at the front of `std::array` could be accounted for at compile time by adjusting the offset where I start reinterpret_casting. Are there other problems with my questions apart from front padding? – phinz Dec 10 '21 at 12:10
  • Off-topic, but size is not relevant for this being legal, rather if there's transparent data available in the type (like a v-table pointer in instances of virtual classes). – Aconcagua Dec 10 '21 at 12:11
  • how do you want to account for that? Padding is implementation defined. I'd be more concerned about a `char` array not being a `std::array` object. Some facilities have been added in recent standards, though most of the time you think "reinterpret_cast" you actually want `memcpy` – 463035818_is_not_an_ai Dec 10 '21 at 12:14
  • Consider padding bytes in between members! As far as I recall, they are not required to be set to any specific value, so could hold indeterminate values. In that case, even if the cast *is* legal, you might not be able to reproduce the same key as you got for an object you added to the map earlier. How would you get it back then? – Aconcagua Dec 10 '21 at 12:14
  • 1
    reinterpret_cast can actually be used only in a rather limited set of cases. If you cannot find a bullet that matches your case here: https://en.cppreference.com/w/cpp/language/reinterpret_cast then the cast is not defined – 463035818_is_not_an_ai Dec 10 '21 at 12:15
  • @463035818_is_not_a_number I could extract a `constexpr size_t` offset of the first member address, i.e. of `operator[](0)`of `std::array` minus its own address and subtract this offset from the position where I do reinterpret_cast? – phinz Dec 10 '21 at 12:18
  • Are you using always the same data type as key? It might be more suitable then just to provide a custom hasher, e.g. by specialising `std::hash`. Is the object itself too large? Do you only want to use part of as key? A suitable sub-structure might be appropriate then, e. g. `struct TheType { struct TheKey { /* key members */ } key; /* other members */ };` – Aconcagua Dec 10 '21 at 12:20
  • @Aconcagua I want to write a parser and have a long binary message and want many subparts of a long string as keys. And are you sure that there can also be padding in between? – phinz Dec 10 '21 at 12:23
  • @phinz Yes – that's the normal case, for POD types you usually don't have at the beginning anyway (might even be guaranteed). Consider `struct S { char c; int i; };`. Typically you have three padding bytes in between `c` and `i` to make sure `i` is appropriately aligned. – Aconcagua Dec 10 '21 at 12:25
  • @463035818_is_not_a_number I don't find my particular case in en.cppreference.com/w/cpp/language/reinterpret_cast but I thought maybe some expert would come and say "If you have no padding and no alignment probles which you don't have with `char` then it is OK due to ..." – phinz Dec 10 '21 at 12:29
  • @Aconcagua I read that on the padding at the beginning in the link of the first comment. – phinz Dec 10 '21 at 12:31
  • @phinz The point is: You proposed solving the issue by adding an offset to the beginning of the class. But you cannot solve padding in between that way. As in my example: Any offset other than 0 would skip the `c` member. Transferring binary data *including* padding bytes is problematic anyway, by the way; you need to make 100% sure that *all* involved executables use *exactly* the same alignment settings for the struct members. Better would be having appropriate serialising and deserialising (skipping any potential padding bytes). – Aconcagua Dec 10 '21 at 12:37
  • I would say it is pedantically UB, even if I think, that in practice, you will observe the expected behavior. – Jarod42 Dec 10 '21 at 12:37
  • Oh, yet another problem: Consider two *distinct* types `struct S1 { char c1; char c2; };` and `struct S2 { unsigned short s; };` – on any systems with 8-bit char and 16-bit short `c1` and `c2` would occupy the same memory as `s`. If now by accident both happen to carry identical byte values ( `c1 == 1; c2 == 1; s == 257`, for instance), then you wouldn't be able to distinguish them within the map. Are you sure this is acceptable for you? – Aconcagua Dec 10 '21 at 12:46
  • @Aconcagua I found this https://stackoverflow.com/questions/6632915/is-the-memory-in-stdarray-contiguous. So there should be no padding in between the `char`s ? And the other problem does not apply to my case because I parse the long byte array such that I know where the keys are located, they are actually just 10 bytes long char arrays and no serialized structs. – phinz Dec 10 '21 at 12:53
  • 1
    there is no padding between the `char`s but `std::array` might have padding (though in practice it probably doesnt). You pretend to know internal structure of `std::array`, when in fact you don't. – 463035818_is_not_an_ai Dec 10 '21 at 13:28
  • @463035818_is_not_a_number [cppreference](https://en.cppreference.com/w/cpp/container/array): *'This container is an aggregate type with the same semantics as a struct holding a C-style array T[N] as its only non-static data member.'* – if the array members are standard layout types, then the whole array is and thus the holding type as well, shouldn't it? But if so, struct and first member should share the same address... – Aconcagua Dec 10 '21 at 13:46
  • @Aconcagua "But if so, struct and first member should share the same address.." see the q&a in my first comment. TL;DR: in practice yes, pedantically no – 463035818_is_not_an_ai Dec 10 '21 at 14:01
  • @463035818_is_not_a_number If I read the [standard](https://timsong-cpp.github.io/cppwp/n4868/basic.compound#4.3) right it contradicts you... Note: Followed to there via comment to answer to your referenced question. – Aconcagua Dec 10 '21 at 15:23
  • @Aconcagua thats interesting, though I am not sure if it really applies for OPs cast. Unfortunately in the question I linked the reinterpret_cast thingy was just a follow-up question. I'll add the `language-lawyer` tag and looking forward to what experts have to say about it – 463035818_is_not_an_ai Dec 10 '21 at 15:26
  • _You pretend to know internal structure of std::array, when in fact you don't._ I think this is the answer even though I don't like it. For me it does not make sense that the standard didn't specify that there should not be padding. It seems a bit like the time when std::string was not guaranteed to hold strings in consecutive memory. – phinz Dec 10 '21 at 15:32
  • @463035818_is_not_a_number Well, relevant for OP's cast is that the very first character in the resulting `std::array` matches the one selected as start value in the large array. And that's the case if (and only if) there's now offset in between the `std::array` object and its internal array member... – Aconcagua Dec 10 '21 at 15:32
  • 5
    Basically the question is about whether the following is valid: `struct S { int i; }; int i = 0; reinterpret_cast(&i)->i // OK???;`. It is not, except that it is a lil bit underspecified. Size, alignment of `S` are unrelated. – Language Lawyer Dec 10 '21 at 15:33
  • @Aconcagua From my understanding now, yes. But when I can't rely on this I will just memcpy into an std::array to be on the safe side. – phinz Dec 10 '21 at 15:34
  • @phinz Well, a bit we actually know: *'This container is an aggregate type with the same semantics as a struct holding a C-style array T[N] as its only non-static data member.'*, see previous comment. So we need to assume something like `templateclass array { T _data[S]; }` or at least something that can mimic such a class 1:1. – Aconcagua Dec 10 '21 at 15:36
  • @LanguageLawyer I do not see why that shouldn't be valid – could you refer to the standard? But the analogy is great, simplifying a lot. – Aconcagua Dec 10 '21 at 15:38
  • 1
    @Aconcagua I wrote it is underspecified. You can read https://timsong-cpp.github.io/cppwp/n4868/expr.ref#6.2.sentence-1 as not applying since the object denoted by `*reinterpret_cast(&i)` has no subobjects. And now it is implicit UB. – Language Lawyer Dec 10 '21 at 15:41
  • @Aconcagua _But the analogy is great_ Actually what `map::operator[]` (indirectly) does is calling `array::operator[]` on its argument (through `std::less`). Not sure what does https://timsong-cpp.github.io/cppwp/n4868/expr.call#2.sentence-1 mean by _the call is as a member_, but _of the class object referred to by the object expression_ can also be read as not applying to the case when the object denoted by the object expression is not of class type. – Language Lawyer Dec 10 '21 at 15:44
  • related discussion about casting in the other direction: https://stackoverflow.com/q/48444004/1863938 – parktomatomi Dec 10 '21 at 16:58

1 Answers1

2

No—there is no object of type std::array<char,10> at that address, regardless of the layout of that type. (The special rules for char do not apply to a type that happens to have char subobjects.) As always, it is not the reinterpret_cast itself whose behavior is undefined, but rather the access through that non-object when using it as a map key. (What you are allowed to do in this case is merely cast it back to the real type, for use with C-like interfaces that require a fixed pointer type but do not actually use the object.)

This access also of course involves a copy; if your goal was to avoid copying at all, just make a

std::map<const char*,int,ten_cmp>

where ten_cmp is a functor type that compares 10 bytes starting from each address (via std::strncmp or std::string_view).

If you do want the map to own its key data, just std::memcpy from the string into a key; compilers often recognize that such temporary “buffers” don’t need to exist independently and actually read from the source in the fashion you hope to do with reinterpret_cast.

Davis Herring
  • 36,443
  • 4
  • 48
  • 76
  • Thanks for reading between the lines and giving the useful extra information on the functor and compiler capabilities! – phinz Dec 11 '21 at 09:12
  • "This access also of course involves a copy;" `T& operator[]( const Key& key );` only copies the key when it does not exist yet in the map (like in the example code), right? – phinz Dec 11 '21 at 09:18
  • @phinz: It has to *read* from the key (which provokes the UB) regardless, of course, but a true copy does occur only on insert, yes. – Davis Herring Dec 11 '21 at 16:55