Answers to your questions (same order):
Why choose? Xcode uses C99 in default setup. Refer to the C0X draft specification 6.4.3 on Universal Character Names. See below.
More technically, the @"\U0001d11e
is the 32 bit Unicode code point for that character in the ISO 10646 character set.
I would not count on this behavior working. You should absolutely, positively, without question have all the characters in your source file be 7 bit ASCII. For string literals, use an encoding or, preferably, a suitable external resource able to handle binary data.
Universal Character Names (from the WG14/N1256 C0X Draft which CLANG follows fairly well):
Universal Character Names may be used
in identifiers, character constants,
and string literals to designate
characters that are not in the basic
character set.
The universal
character name \Unnnnnnnn designates
the character whose eight-digit short
identifier (as specified by ISO/IEC
10646) is nnnnnnnn) Similarly, the
universal character name \unnnn
designates the character whose
four-digit short identifier is nnnn
(and whose eight-digit short
identifier is 0000nnnn).
Therefor, you can produce your character or string in a natural, mixed way:
char *utf8CStr =
"May all your CLEF's \xF0\x9D\x84\x9E be left like this: \U0001d11e";
NSString *uni4=[[NSString alloc] initWithUTF8String:utf8CStr];
The \Unnnnnnnn
form allows you to select any Unicode code point, and this is the same value as "Unicode" field at the bottom left of the Character Viewer. The direct entry of \Unnnnnnnn
in the C99 source file is handled appropriately by the compiler. Note that there are only two options: \unnnn
which is a 256 character offset to the default code page or \Unnnnnnnn
which is the full 32 bit character of any Unicode code point. You need to pad the left with 0's if you are not using all 4 or all 8 digits or \u or \U.
The form of \xF0\x9D\x84\x9E
in the same string literal is more interesting. This is inserting the raw UTF-8 encoding of the same character. Once passed to the initWithUTF8String
method, but the literal and the encoded literal end up as encoded UTF-8.
It may, arguably, be a violation of 130 of section 5.1.1.2 to use raw bytes in this way. Given that a raw UTF-8 string would be encoded similarly, I think you are OK.