C++ Compilation. Phase of translation #1. Universal character name

Question

I can't understand what does it mean in c++ standard:

Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that charac- ter. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

As I understand, if compiler sees charcter not in the basic character set it's just replaced it with sequence of characters in this format '\uNNNN' or '\UNNNNNNNN'. But I don't get how to obtain this NNNN or NNNNNNNN. So this is my question: how to do conversion ?

The same way you'd perform any other character set conversion. Using an OS service or third-party library which is able to map the source character set to the target character set. You just look up in a table which unicode codepoint this character in the source character set corresponds to. — jalf, Mar 09 '13 at 13:53
It seems to mean the programmer enters \uXXXX and the compiler reads that for its (unicode) internal format. — QuentinUK, Mar 09 '13 at 14:06

score 3 · Accepted Answer · edited May 23 '17 at 11:56

Note the preceding sentence which states:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary.

That is, it's entirely up to the compiler how it actually interprets the characters or bytes that make up your file. In doing this interpretation, it must decide which of the physical characters belong to the basic source character set and which don't. If a character does not belong, then it is replaced with the universal character name (or at least, the effect is as if it had done).

The point of this is to reduce the source file down to a very small set of characters - there are only 96 characters in the basic source character set. Any character not in the basic source character set has been replaced by \, u or U, and some hexadecimal digits (0-F).

A universal character name is one of:

\uNNNN
\UNNNNNNNN

Where each N is a hexadecimal digit. The meaning of these digits is given in §2.3:

The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. If the hexadecimal value for a universal-character-name corresponds to a surrogate code point (in the range 0xD800–0xDFFF, inclusive), the program is ill-formed.

The ISO/IEC 10646 standard originated before Unicode and defined the Universal Character Set (UCS). It assigned code points to characters and specified how those code points should be encoded. The Unicode Consortium and the ISO group then joined forces to work on Unicode. The Unicode standard specifies much more than ISO/IEC 10646 does (algorithms, functional character specifications, etc.) but both standards are now kept in sync.

So you can think of the NNNN or NNNNNNNN as the Unicode code point for that character.

As an example, consider a line in your source file containing this:

const char* str = "Hellô";

Since ô is not in the basic source character set, that line is internally translated to:

const char* str = "Hell\u00F4";

This will give the same result.

There are only certain parts of your code that a universal-character-name is permitted:

In string literals
In character literals
In identifiers (however, this is not very well supported)

The question is: how does the implementation map between “ô” and “\u00F4”? After all, the compiler doesn’t see “ô”, it merely sees a byte outside the range 0–127. Which encoding is assumed? What happens to multibyte characters? As far as I see it ISO 10646 defines a mapping between characters and code points but *neither is presented to the compiler here*; and ISO 10646 doesn’t seem to define an encoding as far as I read its Wikipedia article. — Konrad Rudolph, Mar 09 '13 at 14:56
e.g. compilator checks if this code point belongs to basic char set, if not it replace this code point = NNNN or NNNNNNNN with sequence of code points \uNNNN or \UNNNNNNNN. — greensher, Mar 09 '13 at 14:58
@KonradRudolph That's because we're ignoring the step *before* this step: "Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary." The translation is implementation-defined. — Joseph Mansfield, Mar 09 '13 at 15:01
@sftrabbit No: *I’m* not ignoring it, *you* were. ;-) As far as I understand OP’s question, *this* was the only relevant point. — Konrad Rudolph, Mar 09 '13 at 15:01
@sftrabbit, e.g. compilator checks if this code point belongs to basic char set, if not it replace this code point = NNNN or NNNNNNNN with sequence of code points \uNNNN or \UNNNNNNNN. ? — greensher, Mar 09 '13 at 15:04
@KonradRudolph Fair point - I added a bit to the top of my answer. — Joseph Mansfield, Mar 09 '13 at 15:07
@greensher No. The file is a text file made up of characters, not code points. The compiler decides whether a particular character belongs to the basic character set (in an implementation-defined way) and if it doesn't, replaces it with `\u` or `\U` followed by the code point that represents that character in the Universal Character Set. — Joseph Mansfield, Mar 09 '13 at 15:09

score 2 · Answer 2 · answered Mar 09 '13 at 15:10

But I don't get how to obtain this NNNN or NNNNNNNN. So this is my question: how to do conversion?

The mapping is implementation-defined (e.g. §2.3 footnote 14). For instance if I save the following file as Latin-1:

#include <iostream>

int main() {
    std::cout << "Hallö\n";
}

And compile it with g++ on OS X, I get the following output after running it:

Hell�

… but if I had saved it as UTF-8 I would have gotten this:

Hellö

Because GCC assumes UTF-8 as the input encoding on my system.

Other compilers may perform different mappings.

score 1 · Answer 3 · answered Mar 09 '13 at 14:17

1

So, if your file is called Hello°¶.c, the compile would, when using that name internally, e.g. if we do:

cout << __FILE__ << endl;

the compiler would translate Hello°¶.c to Hello\u00b0\u00b6.c.

However, when I just tried this with g++ it doesn't do that...

But the assembler output contains:

.string "Hello\302\260\302\266.c"

answered Mar 09 '13 at 14:17

Mats Petersson

126,704
14
140
227

Try `cout << "Hello\u00b0\u00b6.c" << endl;` and you'll see the same string. I think you're forgetting that the next phases of translation will have to translate \u00b0\u00b6 back to the appropriate character set, which is UTF-8 on your system. – Mar 09 '13 at 14:54
Same comment as on sfrabbit’s answer: how does the compiler get from “°” to “\u00b0”? – Konrad Rudolph Mar 09 '13 at 14:57
I expected that `g++ -E Hello°¶.c` would actually output a string with `\uXXXX` in it. It didn't - it had `"Hello°¶.c"` as the output. – Mats Petersson Mar 09 '13 at 14:57
@KonradRudolph: Well, there is a well defined UNICODE character set. It defines every character known to man (well, there are some contentious ones, but let's ignore that for now) in a 32-bit integer value. Now, how that is encoded when it's in a filename, or as part of a string literal, etc, etc is up to the environment in which the compiler works (e.g. if I use Xemacs on Linux, there is a defined system, but that may not work the same in Windows). The degree symbol is character number 176 in the Unicode list - which is `b0` in hex. – Mats Petersson Mar 09 '13 at 15:02
@MatsPetersson The C++ standard does not require the exact steps, it only requires that the end result is correct. If GCC does not bother to change `°` into `\u00b0` in contexts in which `\u00b0` would end up interpreted as `°` anyway, that's okay. There is no requirement for the `g++ -E` output to correspond exactly to the result of translation phase 4, and in some cases, it would even be a huge problem if it did. – Mar 09 '13 at 15:03
@Mats Unicode is not an encoding. The question isn’t about the character set (well, maybe also), it’s about encoding. My point is that there’s in fact *no guarantee* that when I write “°” in the source code this will be converted to “\u00b0”. This depends on the file encoding and the encoding assumed by the compiler. *Only if those two match* will the result be “\u00b0” and this is far from given: hence OP’s question. – Konrad Rudolph Mar 09 '13 at 15:03
Ok, so what is the encoding that I see in my character map when I view the degree sign that says "U+00B0" if it's not the unicode code point for that character? Why does it use U? – Mats Petersson Mar 09 '13 at 15:09
@Mats No idea what the “U” stands for but this is irrelevant: “U+00B0” is a *code point*, not a byte sequence. It defines a character map, not an encoding. An encoding defines how code points are represented in byte sequences. – Konrad Rudolph Mar 09 '13 at 15:11
@MatsPetersson I think Konrad Rudolph's point, simplified, is this: put `°` in a text file. Save it as UTF-8. Open it as ISO-8859-1. Observe `Â°`. If a text editor can misinterpret `°` as `Â°`, there is no reason why a compiler cannot misinterpret it exactly the same way. The compiler does not *know* a file is saved as (for example) UTF-8, it has to guess. – Mar 09 '13 at 15:12
Right, I'm with you now. So, yes, Unicode just defined a character map, and there are a variety of encodings, one of which is UTF-8 - I guess it's fairly easy to detect if it's UTF-16, since read as bytes, every other character is `0x00`. But anything else is "convention" and perhaps also looking at for example the system Locale. – Mats Petersson Mar 09 '13 at 15:16

C++ Compilation. Phase of translation #1. Universal character name

3 Answers3