5

According to this stackoverflow answer about C++11/14 strict alias rules:

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

  • the dynamic type of the object,

  • a cv-qualified version of the dynamic type of the object,

  • a type similar (as defined in 4.4) to the dynamic type of the object,
  • a type that is the signed or unsigned type corresponding to the dynamic type of the object,
  • a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,
  • an aggregate or union type that includes one of the aforementioned types among its elements or non-static data members (including, recursively, an element or non-static data member of a subaggregate or contained union),
  • a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
  • a char or unsigned char type.

can we access the storage of other type using

(1) char *

(2) char(&)[N]

(3) std::array<char, N> &

without depending on undefined behavior?

constexpr uint64_t lil_endian = 0x65'6e'64'69'61'6e; 
    // a.k.a. Clockwise-Rotated Endian which allocates like
    // char[8] = { n,a,i,d,n,e,\0,\0 }

const auto& arr =   // std::array<char,8> &
    reinterpret_cast<const std::array<char,8> &> (lil_endian);

const auto& carr =  // char(&)[8]>
    reinterpret_cast<const char(&)[8]>           (lil_endian);

const auto* p =     // char *
    reinterpret_cast<const char *>(std::addressof(lil_endian));

int main()
{
    const auto str1  = std::string(arr.crbegin()+2, arr.crend() );

    const auto str2  = std::string(std::crbegin(carr)+2, std::crend(carr) );

    const auto sv3r  = std::string_view(p, 8);
    const auto str3  = std::string(sv3r.crbegin()+2, sv3r.crend() );

    auto lam = [](const auto& str) {
        std::cout << str << '\n'
                  << str.size() << '\n' << '\n' << std::hex;
        for (const auto ch : str) {
            std::cout << ch << " : " << static_cast<uint32_t>(ch) << '\n';
        }
        std::cout << '\n' << '\n' << std::dec;
    };

    lam(str1);
    lam(str2);
    lam(str3);
}

all lambda invocations produce:

endian
6

e : 65
n : 6e
d : 64
i : 69
a : 61
n : 6e

godbolt.org/g/cdDTAM (enable -fstrict-aliasing -Wstrict-aliasing=2 )

wandbox.org/permlink/pGvPCzNJURGfEki7

sandthorn
  • 2,770
  • 1
  • 15
  • 59

2 Answers2

3

The char(&)[N] case and std::array<char, N> case both result in undefined behavior. The reason has already been block-quoted by you. Note neither char(&)[N] nor std::array<char, N> is the same type as char.

I am not sure of the char case, because the current standard does not explicitly say that an object can be viewed as an array of narrow characters (see here for further discussion).

Anyway, if you want to access the underlying bytes of an object, use std::memcpy, as the standards explicitly says in [basic.types]/2:

For any object (other than a base-class subobject) of trivially copyable type T, whether or not the object holds a valid value of type T, the underlying bytes ([intro.memory]) making up the object can be copied into an array of char, unsigned char, or std​::​byte ([cstddef.syn]). If the content of that array is copied back into the object, the object shall subsequently hold its original value. [ Example:

#define N sizeof(T)
char buf[N];
T obj;                          // obj initialized to its original value
std::memcpy(buf, &obj, N);      // between these two calls to std​::​memcpy, obj might be modified
std::memcpy(&obj, buf, N);      // at this point, each subobject of obj of scalar type holds its original value

— end example ]

Barry
  • 286,269
  • 29
  • 621
  • 977
xskxzr
  • 12,442
  • 12
  • 37
  • 77
  • `std::memcpy` is not magic. The standard doesn't suggest it's the only possible working method of copying. – n. m. could be an AI Dec 21 '17 at 10:45
  • @n.m. Yes, I just suggest one precisely well-defined way to OP. – xskxzr Dec 21 '17 at 11:02
  • @xskxzr If you consider that pointer arithmetic is not allowed on the object representation, what other possiblity is offered? If you do not find any other possiblity, why would `memcpy` requirement included in normative text and why is it just included in the *non-normative exemple*? – Oliv Dec 21 '17 at 11:25
  • 2
    @Oliv Maybe `memmove`? My opinion is that the words in the standard are not clear for this problem. In fact, I agree with you that pointer arithmetic **should** be allowed for such case. – xskxzr Dec 21 '17 at 11:36
  • @xskxzr If T = char[8], woudn't &obj becomes char(&)[8] ? how come std::memcpy so superior that it can use char(&)[8] while I can't? – sandthorn Dec 21 '17 at 16:32
  • 1
    @sandthorn The parameter of `std::memcpy` is `void*`, and everything is well-defined while converting `T*` to `void*`, then invoking `std::memcpy`. In fact, stl functions are black boxes of your C++ program, and their implementations, which can be viewed as part of the compiler in a sense, are not required to be valid C++. – xskxzr Dec 22 '17 at 03:42
  • @xskxzr Ah, the `memcpy` is one of a few **blessed** functions that can implicitly create lifetime by the law. In c++23, we now have `start_life_time_as` [P2590R2](https://wg21.link/P2590R2). I just wonder whether it would change anything in this scenario? Can we really quit that `memcpy` workaround for good? If possible, please update the answer for c++23. – sandthorn Oct 14 '22 at 09:45
2

The strict aliasing rule is in fact very simple: Two objects with overlapping lifetime cannot have overlapping storage region if one is not a suboject of the other.(*)

Nevertheless, it is allowed to read the memory representation of an object. The memory representation of an object is a sequence of unsigned char [basic.types]/4:

The object representation of an object of type T is the sequence of N unsigned char objects taken up by the object of type T, where N equals sizeof(T). The value representation of an object is the set of bits that hold the value of type T.

Accordingly in your example:

  • lam(str1) is UB (Undefined Behavior);
  • lam(str2) is UB (an array and its first element are not pointer interconvertible);
  • lam(str3) is not stated as UB in the standard, if you replace char by unsigned char one could argue that you are reading the object representation. (it is not defined either, but it should work on all compilers)

So using the third case and changing the declaration of p to const unsigned char* should always produce the expected result. For the other 2 cases, it can work with this simple example, but may break if the code is more complicated or on newer compiler version.


(*) There are two exception to this rule: one for unions' members with common initialization sequence; and one for array of unsigned char or std::byte that provides storage for an other object.

Oliv
  • 17,610
  • 1
  • 29
  • 72
  • 1
    The pointer addition to `char*` may result in UB. See [this discussion](https://groups.google.com/a/isocpp.org/forum/?fromgroups#!topic/std-discussion/JNp8xsgRbW4). – xskxzr Dec 21 '17 at 10:29
  • @xskxzr, I have also discussed many time about this subject with some of the members of this group, and also others. In the c++ paragraph, I site there is the term *sequence*, and in the normative text, their are no references to `memcpy`. The exemple you site is not normative. – Oliv Dec 21 '17 at 10:42
  • @xskxzr, moreover you are making a mistake: the intent of the santhorn is to read the object representation. The fact that you can copy the object representation to an array of char and back, does not mean that the values stored in the array of char are actualy the representation: it is allowed to be a bijective mapping, an isomorphism. – Oliv Dec 21 '17 at 10:46
  • 1
    "The address of an array and the first element of an array may be different" huh? – n. m. could be an AI Dec 21 '17 at 10:48
  • @n.m. Or may be it is specified that an array size is length*element_size? You making me doubt, I am going to change it, the time I look for it. – Oliv Dec 21 '17 at 10:53
  • 1
    @n.m. The fact that the first element of an array is not pointer interconvertible with the array suggest this: http://eel.is/c++draft/basic.compound#4 – Oliv Dec 21 '17 at 11:00
  • 1
    The note says they are not pointer interconvertible even though they have the same address. Makes one wonder why, – n. m. could be an AI Dec 21 '17 at 11:03
  • @Oliv Is in case `lam(str3)`, `unsigned char` really needed? Despite the last rule explicit mentions "`char` or `unsigned char`" ? – sandthorn Dec 21 '17 at 11:05
  • @sandthorn, But what is sited in the xskxzr's answer is totaly orthogonal to the concept of object representation. If you want to read object representation use `unsigned char`. – Oliv Dec 21 '17 at 11:09
  • 1
    @sandthron, But I believe you should not worry about that, there maybe is a hole in the standard, esentialy because there is no concept of byte value, and because an implementation that would change the value of the bytes when you copy them in an array of char, and then make the inverse transformation would be a clear declaration of "I reject any form of retro-compatibility". – Oliv Dec 21 '17 at 11:18
  • 1
    Your opinion is also included in that discussion. The key is that a sequence of `unsigned char` objects is not the same as an array object. In fact, the standard does not explicitly specify that such array object exists. Regarding the paragraph I quoted (and the words about `memcpy` in the C standard), the standard uses the word "copy". I believe this is exactly the literal meaning. – xskxzr Dec 21 '17 at 11:23
  • 1
    IIRC there is an open core issue about whether the whole _"array/sequence of `unsigned char`"_ thing is meant to imply that the bytes making up an object representation can be treated as any other array; right now, it seems pretty clearly not well-defined to do so. – underscore_d Dec 21 '17 at 11:25
  • If nothing are now "well defined", somebody please write an example that demonstrate "legality" of "the last rule" of strict alias rules (`char` and `unsigned char` exemption) ? I wonder how I can use the last exemption? – sandthorn Dec 21 '17 at 11:35
  • 2
    @santhorn Unfortunatly it seems that there is a hole in the c++ standard set of lawes. So you should apply the *jurisprudence*: common practice accepted by compilers. I believe the two answers and your third case are common practices. – Oliv Dec 21 '17 at 11:42
  • Relevant: https://stackoverflow.com/questions/47830449/can-you-do-arithmetic-on-a-char-pointing-at-another-object and https://stackoverflow.com/questions/47498585/is-adding-to-a-char-pointer-ub-when-it-doesnt-actually-point-to-a-char-arr Your third point is highly disputed – Passer By Dec 21 '17 at 15:51
  • I question your statement that `unsigned char[]` can have overlapping lifetime with another object that occupy the same space. Where was that mentioned? – Passer By Dec 21 '17 at 15:55
  • 1
    @PasserBy You'll find it in [intro.object](http://eel.is/c++draft/intro.object#3). I believe it is part of the clean up which is actualy running about the c++ memory/object concepts because I do not find this paragraph in older standard version. – Oliv Dec 21 '17 at 16:03
  • @PasserBy, About this question I have my own inductive reasonning. If the intention was not to let user use this possibility, it is simple, low level system programming would not be possible. Otherwise, the simple fact the coders want it, compilers allow it, should be considered as a loud indicator that there is a necessity being that. It is not possible to think about doing any low level system programming without such possibilities. Personnaly, the c++ language is incoherent: the standard library cannot be implemented in term of it. – Oliv Dec 21 '17 at 16:16
  • @Oliv I agree it is better if it is possible. However, that might simply mean the standard is broken instead of the standard saying its well-defined. – Passer By Dec 21 '17 at 16:18
  • @Oliv: It is possible for an implementation to be useful for many purposes without being suitable for low-level programming; further, support for low-level programming may be impractical in some cases. What is "broken" is the fact that even though the Standard only attempts to define behaviors which would be practical in all cases, rather than defining all behaviors needed to make a compiler useful for any particular purpose, compiler writers assume it fully defines everything needed programmers should need for every purpose. – supercat Feb 07 '18 at 21:29
  • @PasserBy: From what I can tell, the C and C++ standards all follow the C89 practice of focusing on mandating behaviors in cases where they would expect that compilers might behave differently without a mandate, largely ignoring cases which implementations for commonplace targets would process predictably whether required to do so or not. Such focus would not be a defect if compiler writers recognized the existence and importance of precedents not listed in the Standard, but given that compiler writers interpret the Standard's failure to recognize something as... – supercat Feb 07 '18 at 21:39
  • ...an indication that quality implementations should not be expected to support it, the Standard should either start addressing such things or else more clearly state that quality compilers should not claim to be suitable for particular purposes on particular platforms unless they honor precedents for similar purposes and platforms. I don't know that such a statement would fly politically, since it would suggest that compilers that make some aggressive "optimizations" are of low quality, but it should have been included from the beginning. – supercat Feb 07 '18 at 21:42
  • @supercat Finaly, this is recognized as a language flaw: there is an open core language issue about accessibility to object representation, maybe n1710 if my memory is not wrong. – Oliv Feb 08 '18 at 08:00
  • @Oliv: I couldn't find anything looking for n1710. I think a major problem with the approach the Standard takes is that if focuses on objects rather than derivations of lvalues. I think it is reasonable for a compiler to regard any of the following as unsequenced *in the absence of any specific evidence that would imply sequencing could matter*: (a) accesses to objects of different types, (b) accesses to distinct parts of the same structure type, or (c) accesses to distinct parts of the same array type. If a pointer of any type is derived from an lvalue, however, ... – supercat Feb 15 '18 at 15:52
  • ...such derivation should create a window in which actions using the resulting pointer would be sequenced with regard to any other actions on that storage. Rules based on lvalues and sequencing could allow many optimizations that aren't currently allowed, while allowing many useful constructs which are presently forbidden. – supercat Feb 15 '18 at 15:58
  • @supercat My memory was approximate, I inverted 2 numbers, it was core issue [1701](http://www.open-std.org/JTC1/SC22/WG21/docs/cwg_active.html#1701). I believe what you propose would solve many issue related to object representation. Could you propose it? – Oliv Feb 15 '18 at 17:06
  • @Oliv: I don't know any means of officially "proposing" anything. I also have a strong suspicion that some compiler writers are emotionally heavily invested in the idea that they should never have been expected to recognize that code which casts a `float*` to a `uint32_t*` and accesses it might really be accessing something of type `float`, and would thus oppose anything that might suggest that they should always have recognized such things. – supercat Feb 15 '18 at 17:28
  • @supercat For sure, the larger is an organisation, the more difficult it is to move it. Moreover, creative idea can not come from the inside. This is well known now, and new organisations are nowaday more connected to their outside. I think this is the case for the C++ standard commitee. On isocpp.org the proposal submission procedure is detailed. I do not believe it will be such a big change for compilers and I believe there is work in progress on C++ object model, so they are certainly expecting new ideas. – Oliv Feb 16 '18 at 07:40
  • @Oliv: The N1701 636 seems hung up on the type of an "object", a fundamentally broken concept for C which is also pretty much broken for C++ PODS. In a good language, the things a compiler *knows* about should coincide with those it would have reason to *care* about. My main interest is with C rather than C++, but there's no good reason why things like PODS shouldn't behave the same on both. – supercat Feb 16 '18 at 15:42
  • @Oliv: Also, I think some compilers' designs have evolved in a way that would require major changes to support my proposal (or the corner cases required by the Standard); that should be resolved by having the Standard define macros that would indicate what kinds of aliasing a compiler supports. The approach presently used by gcc and clang may be superior for some kinds of programs, and the Standard should be changed to allow that approach to be used *with programs for which it is suitable*, but also to make clear that a program's incompatibility with that approach is not a defect. – supercat Feb 16 '18 at 17:16
  • @Oliv In c++23, we now have `start_life_time_as` [P2590R2](https://wg21.link/P2590R2). I just wonder whether it would change anything in this scenario? Can we really quit that `memcpy` workaround for good? If possible, please update the answer for c++23. – sandthorn Oct 14 '22 at 09:59
  • @supercat Any opinion on the aspect of `start_life_time_as` [P2590R2](https://wg21.link/P2590R2) on this scenario? A full-blown answer is very welcome. – sandthorn Oct 14 '22 at 10:04
  • 1
    @sandthorn: I don't have time to look at it now, but abstractions based upon the notion of objects having lifetime separate from the allocations occupied thereby end up being needlessly restrictive, which introducing corner cases that are almost impossible to handle efficiently and soundly, compared with a model based on the lifetimes of *references* whose corner cases can be easily examined based on execution paths. If a reference is created, a compiler may treat accesses using it, or others derived from it, as unsequenced relative to some particular access if either (1) the compiler can... – supercat Oct 14 '22 at 14:32
  • 1
    ...identify all references that would be derived from the original on all paths between its formation and the use of the latter reference, and the latter is not among those, or (2) the compiler can identify all references upon which the latter object was based, all of them existed when the original was based, and the original is not among them. The vast majority of useful aliasing optimizations would fit one of those criteria (if not both), and they will be vastly easier to analyze than approaches that ascribe "objects" to storage. – supercat Oct 14 '22 at 14:36
  • 1
    @sandthorn: Looking through the proposal, I can see a fundamental problem: sound aliasing optimizations require the ability to identify not just the start of an "object's" lifetime, but also the end. If one has a construct that creates a restricted reference to a region of storage as a particular type, then with the semantics that within the lifetime of the reference, no region of storage that is modified within that lifetime may be accessed both by a pointer which is definitely based upon it, and by one which is definitely not based upon it, and this construct is used twice sequentially... – supercat Oct 14 '22 at 15:50
  • 1
    ...to create references with disjoint lifetimes, then a compiler will be able to know that it must clean up any pending operations on the old reference before scheduling any operations using the new one. If an object is created from some address, however, and another object is later created at an address that may or may not be the same, a compiler would have no way of knowing whether the creation of the new object might interact with any pending operations on the old one. – supercat Oct 14 '22 at 15:54
  • 1
    @sandthorn: A fundamental difference between the notion of restricted references versus objects is that if a restricted reference X that can access a region of storage is passed to a function, both the programmer and compiler would be entitled to assume that the reference can access the same things after the function returns as it could access before the call. – supercat Oct 14 '22 at 18:18