21

Suppose I have the MUSICAL SYMBOL G CLEF symbol: ** ** that I wish to have in a string literal in my Objective-C source file.

The OS X Character Viewer says that the CLEF is UTF8 F0 9D 84 9E and Unicode 1D11E(D834+DD1E) in their terms.

After some futzing around, and using the ICU UNICODE Demonstration Page, I did get the following code to work:

NSString *uni=@"\U0001d11e";
NSString *uni2=[[NSString alloc] initWithUTF8String:"\xF0\x9D\x84\x9E"];
NSString *uni3=@"";
NSLog(@"unicode: %@ and %@ and %@",uni, uni2, uni3);

My questions:

  1. Is it possible to streamline the way I am doing UTF-8 literals? That seems kludgy to me.
  2. Is the @"\U0001d11e part UTF-32?
  3. Why does cutting and pasting the CLEF from Character Viewer actually work? I thought Objective-C files had to be UTF-8?
Iulian Onofrei
  • 9,188
  • 10
  • 67
  • 113
the wolf
  • 34,510
  • 13
  • 53
  • 71

4 Answers4

11
  1. I would prefer the way you did it in uni3, but sadly that is not recommended. Failing that, I would prefer the method in uni to that in uni2. Another option would be [NSString stringWithFormat:@"%C", 0x1d11e].
  2. It is a "universal character name", introduced in C99 (section 6.4.3) and imported into Objective-C as of OS X 10.5. Technically this doesn't have to give you UTF-8 (it's up to the compiler), but in practice UTF-8 is probably what you'll get.
  3. The encoding of the source code file is probably UTF-8, matching what the runtime expects, so everything happens to work. It's also possible the source file is UTF-16 or UTF-32 and the compiler is doing the Right Thing when compiling it. None the less, Apple does not recommend this.
Anomie
  • 92,546
  • 13
  • 126
  • 145
8

Answers to your questions (same order):

  1. Why choose? Xcode uses C99 in default setup. Refer to the C0X draft specification 6.4.3 on Universal Character Names. See below.

  2. More technically, the @"\U0001d11e is the 32 bit Unicode code point for that character in the ISO 10646 character set.

  3. I would not count on this behavior working. You should absolutely, positively, without question have all the characters in your source file be 7 bit ASCII. For string literals, use an encoding or, preferably, a suitable external resource able to handle binary data.

Universal Character Names (from the WG14/N1256 C0X Draft which CLANG follows fairly well):

Universal Character Names may be used in identifiers, character constants, and string literals to designate characters that are not in the basic character set.

The universal character name \Unnnnnnnn designates the character whose eight-digit short identifier (as specified by ISO/IEC 10646) is nnnnnnnn) Similarly, the universal character name \unnnn designates the character whose four-digit short identifier is nnnn (and whose eight-digit short identifier is 0000nnnn).

Therefor, you can produce your character or string in a natural, mixed way:

char *utf8CStr = 
   "May all your CLEF's \xF0\x9D\x84\x9E be left like this: \U0001d11e";
NSString *uni4=[[NSString alloc] initWithUTF8String:utf8CStr];

The \Unnnnnnnn form allows you to select any Unicode code point, and this is the same value as "Unicode" field at the bottom left of the Character Viewer. The direct entry of \Unnnnnnnn in the C99 source file is handled appropriately by the compiler. Note that there are only two options: \unnnn which is a 256 character offset to the default code page or \Unnnnnnnn which is the full 32 bit character of any Unicode code point. You need to pad the left with 0's if you are not using all 4 or all 8 digits or \u or \U.

The form of \xF0\x9D\x84\x9E in the same string literal is more interesting. This is inserting the raw UTF-8 encoding of the same character. Once passed to the initWithUTF8String method, but the literal and the encoded literal end up as encoded UTF-8.

It may, arguably, be a violation of 130 of section 5.1.1.2 to use raw bytes in this way. Given that a raw UTF-8 string would be encoded similarly, I think you are OK.

Iulian Onofrei
  • 9,188
  • 10
  • 67
  • 113
dawg
  • 98,345
  • 23
  • 131
  • 206
  • 1
    It is certainly not a violation of 130 of section 5.1.1.2 to use raw bytes in that way. "token concatenation" refers to the ## operator used to paste together tokens in the preprocessor being used to paste together something like `\u` and `1234` to get `\u1234`, which has nothing to do with bytes within string literals being used to represent a UTF-8 character. – Anomie Apr 17 '11 at 02:07
  • I stated, >> I << think it is fine, and sounds like you do too. I did have someone so passionate that he turned red then blue stating that using multiple encodings in a single string was not OK and in fact a security risk. I pass on on the warning mostly for his long ago fired memory... – dawg Apr 17 '11 at 02:35
  • Multiple encodings in a string will, without a doubt, screw up auto-detectors encoding detectors tho. YMMV. – dawg Apr 17 '11 at 02:37
  • Poor style, sure. Security risk, maybe. Nothing at all to do with 130 of section 5.1.1.2 though. – Anomie Apr 17 '11 at 03:10
2
  1. You can write the clef character in your string literal, too:

    NSString *uni2=[[NSString alloc] initWithUTF8String:""];
    
  2. The \U0001d11e matches the unicode code point for the G clef character. The UTF-32 form of a character is the same as its codepoint, so you can think of it as UTF-32 if you want to. Here's a link to the unicode tables for musical symbols.

  3. Your file probably is UTF-8. The G clef is a valid UTF8 character - check out the output from hexdump for your file:

    00  4e 53 53 74 72 69 6e 67  20 2a 75 6e 69 33 3d 40  |NSString *uni3=@|
    10  22 f0 9d 84 9e 22 3b 0a  20 20 4e 53 4c 6f 67 28  |"....";.  NSLog(|
    

    As you can see, the correct UTF-8 representation of that character is in the file right where you'd expect it. It's probably safer to use one of your other methods and try to keep the source file in the ASCII range.

Carl Norum
  • 219,201
  • 40
  • 422
  • 469
0

I created some utility classes to convert easily between unicode code points, UTF-8 byte sequences and NSString. You can find the code on Github, maybe it is of some use to someone.

Almer Lucke
  • 241
  • 2
  • 5