1

I am learning C++ using the books listed here. In particular, I read here that:

If the value represented by a single hexadecimal escape sequence does not fit the range of values represented by the character type used in this string literal (char, char8_t, (since C++20)char16_t, char32_t, (since C++11)or wchar_t), the result is unspecified.

(emphasis mine)

This means that, in a system where char is signed, the result of '\xe4' will be unspecified. But here the person says that "it is implementation defined and not unspecified".

So, my question: Is the behavior of the below statements unspecified or implementation-defined? That is, is this an error in cppreferene's documentation or have I understood it incorrectly.

char arr[] = {'\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0'}; //unspecified or implementation defined 
char ch = '\xef';                                              //unspecified or implementation defined
Jason
  • 36,170
  • 5
  • 26
  • 60
Alex
  • 318
  • 1
  • 14

2 Answers2

2

This can be either implementation defined (as per C++17) or (probably) well defined (as per C++23).

In C++17 (or earlier?), according to this Draft Standard:

5.13.3 Character literals        [lex.ccon]


8     … The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for character literals with no prefix) or wchar_t (for character literals prefixed by L). …

However, from this Draft C++23 Standard (also §5.3.13, [lex.ccon]):

3.2.3     Otherwise, if the character-literal's encoding-prefix is absent or L, and v does not exceed the range of representable values of the corresponding unsigned type for the underlying type of the character-literal's type, then the value is the unique value of the character-literal's type T that is congruent to v modulo 2N, where N is the width of T.

So, in your case, as long as the value of the escaped sequence is representable by an unsigned char, then there is neither undefined nor implementation-defined behaviour, as of C++23. However, if that value is outside the range of that unsigned equivalent, then the literal is ill-formed:

3.2.4     Otherwise, the character-literal is ill-formed.


Note: This C++20 Draft Standard has the same clause as the above-cited C++17 version (although it's paragraph 7, rather than 8).

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
  • Does this mean the cppreference is wrong? Because they said that it is unspecified. – Alex Sep 17 '22 at 15:03
  • Well, I'm not going to argue with the *person* (as it happens, a Stack Overflow moderator who knows a thing or two about C++) who said so. – Adrian Mole Sep 17 '22 at 15:07
  • I am not asking you to argue with anyone. Note that at the end of my question I have asked whether the [documentation](https://en.cppreference.com/w/cpp/language/escape#Notes) given is wrong because they(cppreference people) said that the behaviour is unspecified. That is, I am not asking you to argue with the SO moderator but instead asking about cpprefernce documentation. – Alex Sep 17 '22 at 15:16
  • Well, the answer to that is implicit in my answer, already. The fact that both Standards that I quote *specify* what happens means that, in neither case, is the behaviour *unspecified*. – Adrian Mole Sep 17 '22 at 15:20
1

This concern was resolved by the adoption of P2029R4 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals) for C++23 at the October 2021 WG21 virtual plenary. See that paper for links to the relevant core issues.

The behavior is now well-defined. C++ now requires two’s complement representation for integer types (following the adoption of P1236 for C++20). When char is a signed type, the result is the (unsigned) value of the numeric escape converted to a signed type. Whether the result is negative depends on the range of values representable by char (char may be larger than 8 bits).

Tom Honermann
  • 1,774
  • 1
  • 7
  • 10