Note the preceding sentence which states:
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary.
That is, it's entirely up to the compiler how it actually interprets the characters or bytes that make up your file. In doing this interpretation, it must decide which of the physical characters belong to the basic source character set and which don't. If a character does not belong, then it is replaced with the universal character name (or at least, the effect is as if it had done).
The point of this is to reduce the source file down to a very small set of characters - there are only 96 characters in the basic source character set. Any character not in the basic source character set has been replaced by \
, u
or U
, and some hexadecimal digits (0
-F
).
A universal character name is one of:
\uNNNN
\UNNNNNNNN
Where each N
is a hexadecimal digit. The meaning of these digits is given in §2.3:
The character designated by the universal-character-name \UNNNNNNNN
is that character whose character short name in ISO/IEC 10646 is NNNNNNNN
; the character designated by the universal-character-name \uNNNN
is that character whose character short name in ISO/IEC 10646 is 0000NNNN
. If the hexadecimal value for a universal-character-name corresponds to a surrogate code point (in the range 0xD800
–0xDFFF
, inclusive), the program is ill-formed.
The ISO/IEC 10646 standard originated before Unicode and defined the Universal Character Set (UCS). It assigned code points to characters and specified how those code points should be encoded. The Unicode Consortium and the ISO group then joined forces to work on Unicode. The Unicode standard specifies much more than ISO/IEC 10646 does (algorithms, functional character specifications, etc.) but both standards are now kept in sync.
So you can think of the NNNN
or NNNNNNNN
as the Unicode code point for that character.
As an example, consider a line in your source file containing this:
const char* str = "Hellô";
Since ô is not in the basic source character set, that line is internally translated to:
const char* str = "Hell\u00F4";
This will give the same result.
There are only certain parts of your code that a universal-character-name is permitted: