C++11 character literal '\xC4' standard type with UTF-8 execution character set?

Question

Consider a C++11 compiler that has an execution character set of UTF-8 (and is compliant with the x86-64 ABI which requires the char type be a signed 8-bit byte).

The letter Ä (umlaut) has unicode code point of 0xC4, and has a 2 code unit UTF-8 representation of {0xC3, 0x84}

The compiler assigns the character literal '\xC4' a type of int with a value of 0xC4.

Is the compiler standard-compliant and ABI-compliant? What is your reasoning?

Relevant quotes from C++11 standard:

2.14.3.1

An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal has type int and implementation-defined value.

2.14.3.4

The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character. The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char

@RemusRusanu: "The hexadecimal digits are taken to specify the value of the desired character". I think the value of the desired character means its code point, it isn't sensical to specify the "code unit of the desired character" (as you seem to be implying it should), as characters can have more than one code unit (and do in the case of a UTF-8 encoded Ä). — Andrew Tomazos, Feb 25 '13 at 01:08

score 2 · Accepted Answer · answered Feb 25 '13 at 00:48

2

§2.14.3 paragraph 1 is undoubtedly the relevant text in the (C++11) standard. However, there were several defects in the original text, and the latest version contains the following text, emphasis added:

A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.

Although this has been accepted as a defect, it does not actually form part of any standard. However, it stands as a recommendation and I suspect that many compilers will implement it.

answered Feb 25 '13 at 00:48

rici

234,347
28
237
341

I believe that's the correct language for the case where someone puts the out-of-range value directly into a char literal, 'Ä' as a one-byte value, but what bearing does it have on '\xc7', which has an escape sequence not a c-char? – jthill Feb 25 '13 at 00:56
@jthill: a *c-char*, according to the grammar, includes an *escape-sequence*. So I think `\xC7` is a *c-char*. – rici Feb 25 '13 at 00:58
Yes, provided that "the hexadecimal digits are taken to specify the value of the desired character" is referring to the code point as the value, than the behaviour of `'\xC4'` and `'Ä'` should be identical. – Andrew Tomazos Feb 25 '13 at 01:04
@rici (duh) right. Don't know where I got the idea c-char doesn't include the escapes. – jthill Feb 25 '13 at 01:17
...and in both cases the type cannot be a single byte with an execution character set of UTF-8. – Andrew Tomazos Feb 25 '13 at 01:22

jthill · Answer 2 · 2013-02-24T22:55:25.510

1

From 2.1.14.3p4:

The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char

x86 compilers historically (and as you point out, that practice is now an official standard of some sort) have signed chars. \xc7 is out of range for that, so the implementation is required to document the literal value it will produce.

It looks like your implementation promotes out-of-range char literals specified with \x escapes to (in-range) integer literals.

edited Feb 24 '13 at 22:55

answered Feb 24 '13 at 22:44

jthill

55,082
5
77
137

To be pedantic: x86-64 C++ compilers are _required_ to have `char` be an 8-bit signed byte to be compliant with the ABI, it isn't just a historical trend. – Andrew Tomazos Feb 24 '13 at 22:48
Can you clarify what you mean by "promotes out-of-range char literals"? You mean promotes them to `int`? Are you saying this is or is not standard-compliant behaviour? – Andrew Tomazos Feb 24 '13 at 22:49
I don't believe they're required to do so by the C++ standard. Not every compiler is required to follow every standard. – jthill Feb 24 '13 at 22:50
No the _ABI_, not the C++ standard. The ABI is the `x86-64` standard. – Andrew Tomazos Feb 24 '13 at 22:50
Specifically this document: http://www.cs.tufts.edu/comp/40/readings/amd64-abi.pdf – Andrew Tomazos Feb 24 '13 at 22:52
So long as the implementation documents what values it will produce here, its behavior is compliant. – jthill Feb 24 '13 at 22:52

score 0 · Answer 3 · edited May 23 '17 at 11:43

You're mixing apples, oranges, pears and kumquats :)

Yes, "\xc4" is a legal character literal. Specifically, what the standard calls a "narrow character literal".

From the C++ standard:

The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.

This might help clarify:

Rules for C++ string literals escape character

This will might also help, if you're not familiar with it:

The absolute minimum every software developer should know about Unicode

Here is another good, concise - and illuminating - reference:

IBM Developerworks: Character literals

I agree that `'\xC4'` is a legal character literal, however its representation in the execution character set of UTF-8 is 16-bits (0xC3, 0x84), and as such cannot fit in type `char` which is 8-bits. So first of all what do you propose the type of `'\xC4'` to be? — Andrew Tomazos, Feb 25 '13 at 00:57
Also I think the quote is not relevant, the character literal `'\xC4'` consists of 5 characters, namely `'`, \, `C`, `4`, and `'` - all of which are members of the _basic source character set_ (which consists of 96 ASCII characters). — Andrew Tomazos, Feb 25 '13 at 01:00

C++11 character literal '\xC4' standard type with UTF-8 execution character set?

3 Answers3